The MLOps Outsider

On this podcast, we discussed the mix-match of ML and systems engineering with Andrew Dye, a Software engineer at Union.ai.

Bridging the Gap
Andrew's background as a low-level systems engineer focused on building firmware for chips that use custom silicon and custom instruction sets.

He took his gateway pill into the ML space at Microsoft while working in Hololens. It required him to work between real-time tracking hardware layers and complex AR algorithms.

At Meta, he worked on the distributed training team, which later became the AI infrastructure organization. It was the perfect mash-up between ML and systems-based work.

The scale of application at Microsoft vs. Meta
The different applications required different approaches.

VR requires a lot of local processing, and latencies are highly critical, so they must be optimized at all layers. Getting positional updates down is critical; this makes it easier to update the renderings.

The silicon chips that Hololens was built on had several cores. Explicit cache management was used to pass messages across cores to avoid latencies.

The use-case scale at Meta was bigger. In a classic scene, it involved bigger models, more complexity, and more computing. For context, the problems were more or less state-of-the-art at Meta, like trying to train the Imagenet 1k dataset in an hour and using 256gpus p100s... Yea, that kind of crazy stuff.

Distributed Training Challenges
Distributing training across multiple devices has become a necessity as model size and computation have increased over time. Scheduling and resource constraints are significant concerns from the developers' perspective.

Knowing the number of GPUs you can access, how to schedule fairly for a huge training job that wants tons of GPUs, and optimizing their use to make maximum use of the net efficiency e.t.c are some of the concerns that come to mind.

Some interesting scheduling algorithms let the small jobs train, then initiate the training for the bigger jobs when enough resources are ready is one solution. It requires a lot of cooperation with the training stack to achieve this and not waste the training progress.

At this meet-up, it was really exciting to have Steven Fines, Sr. Principal ML Architect at CoreLogic, share his perspective on the reasons and motivation behind MLOps.

Unifying framework vs. Production
Machine learning models become an operational and compliance headache from the production perspective without some unified idea and structure. This unified idea applies to complex analytics and non-ML systems like standard statistical learning.

For ML models, compliance concerns include the following:
Uniformly monitoring predictive model performance.
Legal allowances to use the data for that model.
Data handling and automated decision-making regulation dependency on the usage and industry.

Operational concerns include
The nightmare of pipelines or manual processes once enough ML products are in production.
High onboarding costs of operations.
Resourcing the support for the models as a model's lifecycle crosses many different domains.

What's the point of MLOps
MLOps is a conceptual framework for developing processes and tooling to support the creation and delivery of ML products.

Viewing MLOps from a practitioner's approach, it is critical to recognize that models are viewed in one of two lights. They are either revenue-producing (i.e., profit-based products) or cost-reducing (i.e., internal tools for managing business costs). When your MLOps framework falls cleanly into one of those buckets, it is easier to know why to invest in MLOps.

MLOps is needed as the number of models in production increases, not as team size increases.

Model prototyping and experimenting are crucial parts of the model development journey, where signals are extracted from data, and new codes are created. To keep track of all the chaos within this phase, MLflow comes to help us. This blog post discusses a possible Python SDK implementation to help your data science team keep track of all the model's experiments, saving from codes to artifacts to plots and related files.

Stefano Bosisio, Machine Learning Engineer at Trustpilot, helps data science teams to have a smooth journey from model prototyping to model deployment. In this article, he shows how your data science team could enjoy experimenting with new models with an MLflow SDK.