System Design Deep Dive

Absolute pleasure and honor to learn from Mr. Eugene Yan last week about the system designs of many different companies.

In the madly efficient way that only Eugene can do, he abstracted away common principles from all the papers. What he presented to us was an x-y plot as seen in the picture above. He then placed a few core pieces on the graph to help us "land".

After breaking down his understanding Eugene then walked through various system designs from some of the top companies. He then brought the diagrams back to his original description, back to home base.

I am so thankful to have people like Eugene around who can see these patterns and articulate them to the rest of us in a digestible way. If you have not checked out the video jump on it.

Side note - I am currently in Portugal and brought my nice sound equipment with me only to realize after the meetup I had my zoom audio input set to my earpods... luckily for you I did minimal talking this session. Still, sorry in advance.

How to become an MLOps Engineer

This week, we had the pleasure of talking to Salwa Nur Mohammed, the CEO of FourthBrain, which is a dev-focused bootcamp that helps professionals learn about MLOps. We get a lot of requests about how to learn MLOps and this conversation focused exactly on that!

You can check out the FourthBrain curriculum here. In the conversation, we talked about how fast MLOps moves, which poses challenges both to learners and teachers.

Salwa shared her perspective on how FourthBrain and all learners can keep their education strategy fresh enough for the current zeitgeist. Furthermore, Salwa, Demetrios, and I talked about principles of effective learning that are important to keep in mind while embarking on any educational journey.

This was a great conversation with a lot of practical tips that I hope you all listen to!

Till next time,
Vishnu

Another Monitoring Tool?

I caught up with Karel Vanhoorebeeck about the Monitoring tool their team just released. this space is getting really crowded which is an obvious signal to me that, its a hard problem to solve and there is much demand for it.

So what was the inspiration for this open source project? Karel told me a lot can go wrong post-deployment with ML systems. This is mainly due to things going wrong or changing in the data generation process. For example,
the ML team may have overlooked something during model training or may have introduced a bug during integration and deployment. Someone or something may have changed the data generation process.

Some examples of how he has seen this happen in the past:

Camera shift due to vibrations. Dust collecting on the lens.
Data drift due to onboarding new customers on the other side of the world.
New android version with more privacy features leading to different data distributions.
Some other dependency may have changed, leading to faulty data e.g. API change, outages, ...

The Why

"The reason we founded Raymon was because we were frustrated by the lack of tooling to handle ML systems post deployment.

"Most teams set up different DevOps tools like Elastic stack to collect logs and grafana to log metrics. Combined that with some custom in-house developed tooling for data inspection, visualization, and troubleshooting.

"What we consider really important is easy to understand distance metrics between distributions and tunability, on which you can build flexible alerting schemes.

"These tools are often set up as an afterthought or something to tick off the to-do list. Not much thought goes into how useful they actually are, how easy they are to work with and how exactly they will help an engineer with troubleshooting.

"Only when production issues start occurring people notice they lack this or that functionality and either ignore the tooling and write custom code to debug, or they gradually improve and patch up the tooling.

"We have a broader focus where teams can log all relevant data and metadata related to a model prediction, like pre and post processing information and visualisations. This is especially useful for explainability information and richer data types like computer vision or sensor data where an image tells more than a thousand words.

"Stitching together and building your own troubleshooting and monitoring tooling takes a lot of time, requires a lot of conceptual scoping work, and unless you really get the time and budget to work on it, the usefulness and user friendliness will be... bad.

"We’re building an observability hub that is basically a place to log all kinds of information that could be relevant for the ML team. This can be the raw data received for a prediction request for example, or it can be the data after a few preprocessing steps, it can be the model output, model confidence, processing times, data after post processing, and so on. All this information is relevant for a DS, so all this information should end up in one integrated platform."

Where Does The Open Source Library Come In?

"It's a toolbox to collect metrics about your data and model predictions into profiles (ModelProfiles we call them). These profiles can then be used to validate data in production, which generates all kinds of data health metrics that can be logged to the Raymon hub. Using these profiles, Raymon knows what all these metrics mean and can auto-configure monitoring. Next to that, the library also allows you to log all kinds of data to our hub.

"The metrics that are collected in a profile can be anything really. The simplest case is probably if you would use them to track input feature distributions and health. We currently offer support for a (limited) set of metrics that we’ve found useful so far in structured data and computer vision. We are now working to support more metrics for more data domains. All suggestions and ideas are most welcome!"

For Machine Learning

The increasing reliance on applications with ML components calls for mature engineering techniques. This must ensure these are built in a robust and future-proof manner. Moreover, the negative impact that improper use of ML can have on users and society is now widely recognized and policymakers are working on guidelines aiming to promote trustworthy development of ML. (See the new EU proposed regulation)

To address these issues, we mined both academic and non-academic literature and compiled a catalog of engineering best practices for the development of ML applications. The catalog was validated with over 500 teams of practitioners, which allowed us to extract valuable information about the practice difficulty or the effects of adopting the practices.

Alex Serban will give an overview of his findings, which indicate, for example, that teams tend to neglect traditional software engineering practices, or that effects such as traceability or reproducibility can be accurately predicted from assessing the practice adoption.

Moreover, Alex will present a quantitative method to assess a team’s engineering ability to develop software with ML components and suggest improvements for your team’s processes.

Alex Serban works at the intersection of machine learning and software engineering, looking for ways to design, develop and maintain robust machine learning solutions.

Since robustness has broad implications along each stage of the development life cycle, Alex studies robustness both from a system (engineering) and from an algorithmic prescriptive.

If you haven't heard already, we have a public cal you can subscribe to. Otherwise, see you tomorrow at 9am PST/5pm BST by clicking on the link below