You've been KServe

You may have caught this one already in the Mega-Ops newsletter that went out on Sunday. In case you missed it - Tests got a bad rap.

I hear you. They are boring. They are a waste of time.

What if I just throw out the project? I spent all this time on building so many damn tests!

As Svet put it "One of the biggest challenges we are facing is not having enough meaningful and grounded discussions about testing. Ultimately the ML folks (of course myself included) have the mindset of 'Let's deploy our model and just accept the fact that it will fail'."

Vruuuum!

After spending most of his career as a full-stack Data Scientist/ML Engineer, Leonard Aukea has shifted focus towards MLOps and is currently driving Machine Learning Engineering and Operations at Volvo Cars. In particular, focusing in how to effectively reap its benefits at an enterprise scale.

Leonard will introduce the Volvo Cars ML stack and related work focused on stitching these services together in order to reduce friction in the ML value stream. Furthermore, will get into some of the learnings along the way and general thoughts around what it takes to lay a solid ML foundation in a company like Volvo Cars.

Sub to our public calendar or click the button below to jump into the meetup on Wednesday at 10am PST/5pm BST

More Test Talk

The paper for this week's reading group is Building Continuous Integration Services for Machine Learning (Karlaš et al. 2020).

This paper takes applying Continuous Integration (CI) in Machine Learning a step further as it has been "a de facto standard for building industrial-strength software" (Karlaš et al. 2020).

The motivation of the paper comes from the complexity of ML testing, as every model iteration being evaluated in the holdout test set leaks information from itself, which leads to overfitting the test set, and diverging from the real model performance.

An obvious solution to this issue is to sample a new independent test set each time a new model version is implemented. However, labelled data is not cheap to collect and this can lead to a financial problem.

To mitigate this issue, Karlaš et al. 2020 proposes an ML development lifecycle where we have 3 roles:

Data curator - in charge of providing new test data to this lifecycle
Developer - in charge of iterating on ML models by implementing new models, training, and tuning them
Manager - in charge of defining the test to be employed on the ML model (and determine its minimum quality), monitor the current available data for the test set, and the maximum number of runs a test set can have before being replaced.

In addition, Karlaš et al. 2020 provides the mathematical explanation regarding

the maximum number of evaluation runs for a test set
the required test set size to provide a statistical guarantee according to a certain probability that our evaluation metric estimate (i.e. accuracy, f1-score, etc.) will not be different from the real truthful performance by an error tolerance.

The authors put this into practice by demonstrating an experiment where it is possible to observe, on the one hand, that the measured test accuracy with their framework always stays within bounds of this error tolerance regarding the true test accuracy. On the other hand, the test score diverged from the true test score on another baseline approach where the test remained unchanged in subsequent model iterations.