Scratch That Itch

We had a cool podcast with Sasha Ovsiankin, Sr. Software engineer and Rupesh Gupta, Sr staff engineer at Linkedln. They talked about all the new stuff being done at LinkedIn.

Recommendations
With the large amount of information that is floating around the internet, we rely on search and recommender systems to sift through all this information and present the most relevant information to us.

In order to serve the most relevant information to the user, recommender systems need to know the user's intent and preference. Usually, intent and preference are not explicitly stated. The intent and preference are inferred based on the history of the user's actions.

Traditionally, there is a delay between when a user takes an action and when that action can be leveraged to adapt recommendations for that user. This is because user activity is typically processed periodically into features in a batch environment, then made available for recommender systems

The outcome is that the recommender system might not be able to serve the most relevant recommendations based on the user's current intent.

Near Real-time Features
Near real-time recommendation can be achieved by computing features from the recent actions of the user and then using these features in a machine learning model. These features are the information that is contained within these actions that the user has taken in the past.

So rather than using the total number of times an action occurred to generate recommendation features, the actions are summarized into a vector(embeddings) of numbers as features.

Andrew Jones is a tech lead at GoCardless working across data infrastructure and ML enablement. He talked with us about how to drive ML data quality with the hyped-up "Data Contracts".

GoCardless is a FinTech company that is based in London. They specialize in current payments and bank payments made via direct debit or Open banking. Some of their customers include DocuSign and Guardian.

GoCardless has three key models in terms of how they use ML for providing solutions.

The first one is an internal fraud model. This helps to protect GoCardless from being defrauded by people acting as merchants.

'Success plus' and 'Protect plus' are the other two models that are used, that are part of their product.

'Success plus' is used to diligently retry failed payments at times when it's most likely to succeed again in the future.

Protect Plus is used to protect its merchants from being defrauded by people acting as their customers.

GoCardless Data Platform Problem
One of the main problems with the data platform architecture at GoCardless was that the changes made in the Postgres(upstream) database negatively impact the user(consumer). Changes like schema change break the ETL pipelines and data transmission, the consumers of the data have to then manually try and work out what has changed, how it has changed, and who changed it.

In the nutshell, the data is not built for consumption because the architecture does not take schema changes and data changes into consideration.

The Better Way
To improve data quality within the data platform, first, they needed to work out the qualities of good data. Good data needs to be:

Documented and Discoverable
Versioned, with migrations for breaking schema changes
Reliable

Next, using an API for data management makes it possible to easily bypass the troubling issues.

Technically it's an explicit interface between two services producing and consuming data via an API.
It's a fairly stable approach, it has the ability to evolve over time.

Great part 2 meetup session with Hamza Tahir, Co-creator of ZenML. He went through the basics of ZenML and the launch of the ZenML Server and Dashboard. He showed how easy it is to switch your MLOps stack from a local setting to different cloud environments with ZenML.

ZenML's itch
One of the biggest problems when deploying a machine learning model is that they are so many tools to pick from. There isn't always good guidance in terms of putting the model in production, with the available tools.

The ultimate goal of ZenML is to be a robust and production-ready, end-to-end MLOps framework. It is a flexible and modular standardized system, that integrates with the wide spectrum of open-source MLOps tools.

The friction between the different personas involved at different stages of the process breaks the process involved in bringing ML into production.

ZenML standardizes the production process by creating a unified interface. It enables a single source of truth that everyone can rely on for the entire process.

ZenML is an extensible open-source framework to create a unified interface for MLOps pipelines.

ZenML Pipelining Workflow
ZenML pipelines make use of python functions with decorators to define workflows.

ZenML stack defines which infrastructure or tooling ZenML is running as well as how and where they run. The stack comprises of different components like an orchestrator, experiment tracker, model deployer, artifact store, etc. It abstracts the infrastructure configuration from the pipeline. It enables you to define configuration outside the paradigm of the code

Kian Kenyon-Dean Senior Machine Learning Engineer, Jake Schmidt Machine Learning Engineer, John Urbanik Staff Machine Learning Engineer, Ayla Khan Senior Software Engineer, Jess Leung Associate Director, Machine Learning Engineering, Berton Earnshaw Machine Learning Fellow at Recursion; are solving problems in the pharmaceutical space (decoding biology) in style, with AI/ML.

Drug discovery is time-consuming, difficult, and expensive. It also has a high failure rate. This post explains some of the unique challenges Recursion faces in operationalizing deep learning to build maps of human cellular biology to develop biological insights and accelerate drug discovery.

This post was written in collaboration with our sponsors from Union, Samhita Alla | Software Engineer & Tech Evangelist at Union.ai.

It’s no secret there has been a sharp rise in demand for machine learning as organizations innovate new products that make complex use of data. To keep up with the high volume of ML workflows, ML teams and companies are scrambling to find ways to iterate on and maintain their pipelines collectively.

Data science and machine learning require analysis at every intermediary step of the pipeline, data validation process, and resource optimization, outstripping the capabilities of typical DevOps life cycles. To maintain reliable ML pipelines, Flyte makes it painless to orchestrate them at scale. In this article, we’ll consider how Flyte enables orchestrating ML pipelines with infrastructure abstraction.