Happy Thanksgiving!

James Cambell the CTO of Superconductive (Great Expectations) came on the meetup last week to talk about Durable Data Discovery.

What does it mean to be a Data Asset? It's the sweet spot where there is overlap between the data you have collected and the operational purposes you are running. The way you look at the data may be different depending on the operational purpose you are using it for. Not to mention, data assets bring with them inherited assumptions.

What makes a batch? How can you properly define a batch?

My favorite moments of the meetup were when James started talking about how attention is what makes a batch.

Huh?

A batch of data correlates with when you look at it. ie you run a pipeline and the state of the warehouse at the time you look might be the batch you are looking for as opposed to the monthly batch of data for a dataset.

There are a few larger ideas that come through in the conversation. We really need to be aware of the need to automate testing of data. More on this topic below in the reading group section.

Another week, another amazing podcast! Slater Victoroff, the CTO and founder of Indico Data, joined us to discuss ETL for unstructured data, multi modal machine learning, and his perspective on how to data programming is real future of ML engineering (and how Indico makes this happen!)

This was both a theoretical and practical conversation. Slater has been involved with ML since the AlexNet days. He shared his wisdom from along the way, especially on how to thinking fundamentally about data and information flows in machine learning. We discussed some of the opportunities and flaws of synthetic data and active learning. Both techniques can make models a lot better, but the architecture of the data systems that enable these techniques is really crucial to realize their potential.

On the practical side, Slater walked us through how Indico deals with the ETL challenges of representing unstructured data. We jammed on how data engineering is causing more of the challenges in machine learning than modeling. Representing data in a flexible fashion is crucial to Indico's unconventional modeling solutions, which takes modeling out of only the data scientist's hands and puts subject matter experts in a position to make and deploy their own models. Sound fascinating? It really is, and we got into the nitty gritty of how Indico does this with a sample use case.

Thanks to Slater for joining us and being such a thought-provoking guest! Definitely listen to this session to get some knowledge dropped.

Till next time,
Vishnu

'mlctl' is our take on how to approach this tooling sprawl and provide a common interface to data scientists and other ML practitioners." - Srivathsan Canchi

With the explosion in tools and opinionated frameworks for machine learning, it's very hard to define standards and best practices for MLOps and ML platforms.

Based on their building AWS SageMaker and Intuit's ML Platform respectively, Alex Chung and Srivathsan Canchi spoke with @Demetrios and @Vishnu in August about their experience navigating "tooling sprawl". They discussed their efforts to solve this problem organizationally with Social Good Technologies and technically with MLctl, the control plane for MLOps.

Model Monitoring: The Million Dollar Problem
If you have listened to the podcast or meetup more than once you have probably heard Demetrios talk about Monitoring and the infamous story from one of our first meetup presenters Flavio Clesio. The story goes something like this, Flavios team was getting all the right signals from their monitoring platform so they thought things were all Honky Dory.

Turn out they were wrong. For 18 days their recsys was showing the same recomendation to every single person that visited the webpage. 18 days. Hundreds of thousands of estimated losses.

We don't want that to be you. hence the talk from James last week, and the next two sessions are going to be about monitoring. Let's finish the year off strong and make sure these war stories are less and less frequesnt.

In this Meetup, four of Loka’s MLOps experts will provide a thorough overview of model monitoring, the benefits for your business, and reference architectures to get you started.

The crew will also provide an introduction to several managed and open source solutions including SageMaker Model Monitor and Great Expectations. Bonus material includes access to a hands-on exercise for learning SageMaker Model Monitor.

As we know, model and data monitoring is a crucial part of a production ML system — a lot can go wrong: model drift, data anomalies, upstream data, or processing failures. And mistakes can be costly if ML drives mission-critical systems like recommendation systems, credit approval, or fraud detection, easily meaning millions of dollars lost for large organizations.

So, why do so few companies have ML monitoring fully implemented? Don't be that company.

Sub to our public calendar or click the button below to jump into the meetup on Wednesday at 10am PST/5pm BST

On the last reading group session, Tim Blazina and Laszlo Sragner joined us to discuss the impact of SWE in ML projects. This session helped us to go in further detail about many ideas. For example:

Why do we actually need tests? How can we employ it to ML?
Being in control of your code is a very good place to be. Otherwise you won't feel safe. Jeff Bezos has a concept of type 1 and type 2 decision: type 1 are the ones you cannot come back from, and type 2 you can. By testing and having version control you can convert type 1 decision into type 2 since you can undo your solutions while having guarantees if you did something wrong or not. Regarding ML, have a small set of data to test the correctness of your code (e.g. pre-process, feature engineering), always have an anti-pipeline for this. This can be useful for an early-on structure.

What about data? How do we manage data?
Regarding pre-processing and feature engineering: apply great expectations, dbt, be notified immediately with assertions you can still control like db schema changes, sanity checks, missing value handling.
Regarding production: monitoring. For example, monitor mean, std, moving averages, moving stds, missing values of your features. Be aware of feature drifts. Pay close attention to the data distribution changes. Also, when you model performance decreases in time - that's when traditional SWE testing breaks. That is a Data Science/Analysis problem. You can try to put automated tests on it, but this is something that requires an analysis from a DS.

Main takeaways from Object Design Style Guide?

using abstractions when you are dealing with external services and use dependency inversion for it to increase robustness
distinction between service objects (to perform a task, or retrieve info) from existent entities
distinction between private methods and public methods that can implement an interface and be used by external classes

When Adria presented his project ProdutizeML at the MLOps community alongside Demetrios earlier this year, he started interacting with a lot of other community members, discussing various collaboration opportunities.

Among these members was @Mateo Rojas-Carulla, CTO and founder of Lakera AI, a startup in Zurich that redefines how we develop, test, and operate safety-critical AI systems.

Adria was one of Lakera’s first product testers, giving valuable feedback to the team–initially all through the MLOps community workspace. After iterating through various product versions together, both Adria and Lakera decided to join forces!

Adria became Lakera’s first software engineering hire: check out this post for proof! Looking back, Adria believes that the MLOps community provides great value through its members, their combined knowledge, and chances to find lasting professional and personal relationships. He’s still a daily community visitor and contributor.

If you want to learn more about Lakera AI, feel free to reach out to @Adria Romero directly. They’re also hiring, so maybe there is another success story in the making?