Dream Big, Start Small

What's the role of product management in ML? Olalekan Elesin, the director of data platforms at HRS Group, gave us a few tips.

ML Platforms
Knowing where to start isn't often obvious. Discovery work happens at the beginning phases of figuring out what to build. There might be an awareness of the problem but the approach to solving it requires a series of different phases.

For example, some phases might center around understanding data scientist issues. Solving these issues early makes it easier later on when designing an ML Platform.

The tools of the trade are one of the most defining aspects of a technical professional. But let's remember, it's not all that MLOps is about.

However, understanding the intimacy between the data professionals and the tools they use becomes useful for the product architect. To be fair, is all change for the better? Technical buy-in for new tools can be difficult.

Platform Investments
Taking the leap toward making a platform investment as an organization can be scary. Especially if there is no pre-defined metric for deciding the efficiency, cost, and optimization of the platform. When analyzing tradeoffs Olalekan spoke to the dangers that arise from not having clear definitions of success.

Platform Investments are made on long-term projects that also try to address short-term needs. Business variables are constantly changing. The goalpost can always be moving. How can you survive? Tie yourself to some number the business cares deeply about. Move the needle.

I'll leave you with this final thought from Olalekan "process excellence is just as much excellence as engineering and technical excellence".

Christmas came early for us in the community. The product leads from the DoorDash ML platform handed out sweet sweet candy.

About DoorDash
Most have heard of the company that started in 2013. If not, it's an online food delivery platform that makes use of machine learning. Some ML use cases at Doordash are search recommendations, ads, e.t.a, forecasting, and fraud.

DoorDash's ML Platform Design consists of a mixed bag of vendor solutions, open-source and homegrown gems.

Major engineering resources are put into improving the build-to-deploy processes. One key metric the team thinks about often is ML velocity.

This means how fast can a data scientist productionize a model. How fast can they create new features and start serving them to the models? What is the delta between the inception of an idea to the production of it?

The idea is to abstract the technical process and create a path for an easy workflow that enables users to think more about the business logic.

ML Platform Journey
When the team started their journey, commercial ML tools were nonexistent. The platform team HAD to build their solutions in-house. There was nothing else to choose from in 2015.

Much like Olalekan above the team needed to speak to its users. So how did they do it? They established a "Machine Learning Council". The council consists of representatives from the Data Science ecosystem, coupled with leaders from the search teams that heavily use ML.

Frequent round table discussions and lunches were held to align ideas in the right direction. Surveys of key pain points and bottlenecks were given twice a year.

The team found in conversations many data scientists liked to talk about the importance of data quality monitoring (DQM), but after sending the survey, quantitative data showed DQM wasn't even in the top 3 most pressing priorities.

Why are building real-time data pipelines in machine learning so challenging?

For most projects, things start off with batch feature engineering. that's cool.

A common stack for batch could be:

A data warehouse for central access to data (e.g. Snowflake or Databricks)
A data modeling tool to express feature logic (e.g. dbt Labs)
A scheduler to orchestrate feature computation (e.g. Airflow)

With this stack, scheduled jobs (~daily) kick off the materialization of feature tables, which can be consumed on demand by ML models.

All gravy up to here. Then comes the first hurdle.

Online Inference.

Suddenly machine learning models need very fast access to feature data. Like SLAs of 100ms fast.

The code you wrote to read features from your data warehouse won’t cut it anymore.

So just throw some Redis at it and you're all good right? Wrong!

Now you've got standing infra to maintain. S*** fails and you are facing dreaded downtime my friend.

And then, it all really hits the fan when you start to need fresh features.

Why is that hard? Well, the first challenge is simply figuring out where fresh data will come from!

Some common sources are:
Streaming data (e.g. Kafka)
Third-party APIs (e.g. Plaid)
Transactional databases (e.g. Postgres)
Internal APIs managed by other teams
Application context

Simply hunting down all of these sources of data (to match what was available in your data warehouse) makes you reach for the aspirin bottle.

Making it worse, you’ll often need a unique set of tooling to query and transform each of these sources of data into features.

Now, congrats you just got yourself more standing infrastructure to monitor - microservices, stream processing jobs, and transactional databases.

This means more monitoring to build and more on-call burden.

Whoohoo! If that doesn't sound like fun I don't know what does!

It's a great time to be navigating this space. Get out there and have some fun with it, but don't say I didn't warn you.