Earlier this month, I had the privilege of attending MLOps World (November 7–8, 2024) in Austin, TX. This international gathering of practitioners brought together an impressive array of minds striving to advance the science of deploying Machine Learning (ML) and AI models into production. With a strong focus on efficiency, scalability, and responsibility, the event showcased insights for practitioners across industries working in AI, ML, and Generative AI.Â
In this blog, I will discuss my top five takeaways (in no particular order) from the conference, highlighting key ideas and innovations that stood out.
Serverless Infrastructure For AI With Modal
One of the standout talks, “How to Run Your Own LLMs, From Silicon to Service”, by Charles Frye, showcased Modal, a serverless cloud platform purpose-built for AI, ML, and data applications.Â
Modal eliminates the headaches of infrastructure management, enabling engineers and researchers to build compute-intensive applications seamlessly. From generative AI models to large-scale batch workflows, Modal accelerates development while maintaining flexibility.Â
The session included a demo on running a vLLM server for your own large language model (LLM) inference service using Modal. Frye highlighted how Modal’s documentation empowers developers to experiment with and deploy different LLMs quickly.
Key takeaway: Modal is a game-changer for those looking to simplify their AI workflows and focus on innovation rather than infrastructure.
Dolt: Version Control For Databases
Imagine Git and MySQL had a baby: that’s Dolt, the world’s first and only version-controlled SQL database. Introduced by Timothy Sehn, Dolt allows users to fork, clone, branch, merge, push, and pull databases like code repositories.Â
One compelling use case came from Flock Safety, a company leveraging computer vision for community safety. Flock uses Dolt to ensure:Â
Reproducibility: Regenerating machine learning models with consistent inputs.Â
Explainability: Understanding training set differences by tracking database changes.Â
Key takeaway: Dolt can bridge the gap between traditional software engineering workflows and data science, improving both reliability and accountability.Â
Metaflow And The Power of Fast Bakery
Savin Goyal, co-founder and CTO of Outerbounds, gave a fascinating talk titled “Fast and Reproducible: Taming AI/ML Dependencies”. Outerbounds, originally developed at Netflix, extends the open-source framework Metaflow, a Python library designed for building and scaling AI/ML applications.Â
The focus of Goyal’s session was Fast Bakery, Outerbounds’ efficient containerization backend. It enables rapid packaging of ML dependencies into reproducible images: taking as little as 40 seconds, compared to over five minutes with GitHub Actions in some cases.Â
Key takeaway: Tools like Fast Bakery make reproducibility and deployment significantly faster and less error-prone, especially for large-scale ML workloads.Â
Advanced Kubernetes Techniques For AI/ML Workflows
In her session “Kubernetes: Advanced Techniques and Integration”, Anu Reddy dived deep into optimizing AI/ML workflows on Kubernetes.Â
Reddy demonstrated how platforms like Google Kubernetes Engine (GKE), paired with tools like Kueue (a Kubernetes-native job queueing system) and Ray (for distributed compute management), can efficiently orchestrate compute resources. A highlight of the session was deploying Retrieval-Augmented Generation (RAG) workflows on GKE, showcasing Kubernetes’ flexibility in handling modern AI applications.Â
Key takeaway: Kubernetes remains a cornerstone for scaling AI workloads, with advanced tools simplifying resource management for cutting-edge workflows.
Efficiency in AI: Practical Techniques
Shelby Heinecke’s talk, “A Practical Guide to Efficient AI”, offered actionable strategies to improve model efficiency. Two standout topics were:Â
Efficient architectures: Small LLMs like phi-3 (3.8B parameters) outperformed larger models like Mistral 7B on benchmarks such as MMLU and GSM-8k. EvenMobileLLM-350M, after fine-tuning, rivaled Llama2-7B in performance.Â
Model quantization: Heinecke showed how reducing a model’s precision can significantly decrease storage needs and latency without sacrificing performance. For example, quantizing Llama2-7B from 16-bit to 4-bit reduced its size from 13.9 GB to 3.9 GB, improving latency while maintaining accuracy on visual question-answering tasks.Â
She also introduced Mobile AI Bench, an open-source framework for evaluating and optimizing AI models.Â
Key takeaway: Small, efficient models and quantization are driving forces in making AI more accessible and resource-friendly.Â
Final Thoughts
MLOps World 2024 was a treasure trove of knowledge for anyone involved in production AI. From cutting-edge infrastructure like Modal to reproducibility with Dolt and efficiency breakthroughs in model design, the conference underscored the rapidly evolving landscape of AI/ML tools and techniques.Â
Whether you’re an engineer, data scientist, or researcher, these takeaways highlight the exciting opportunities and responsibilities of building better AI systems for the future.Â
Looking forward to the next one!
If you’re interested in taking your data insights to the next level, schedule a consultation with one of our architects at phData to explore how we can empower your data and AI journey.