Share
Preview
There is no learning without something to learn and some way to learn
 ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

On thursday the 21st aka not this thursday but next thursday, we are hosting an AMA with Netflix veterans, Romain Cledat and Kedar Sadekar. Stay up to date with all our events by subbing to the public cal.

New Tool Tuesday
Deep Clean
Ok I'll be honest, I shared the first glimpse of this tool Lineapy a week ago in the community slack. Let's just say there were a few words exchanged. These words made me wonder. Should I write about this tool?

Yes.

Cause it's still a novel attempt at doing something different. Even if it ignores the fact you are not building good habits.

I chatted more with Sangyoon Park the creator of Lineapy about their motivations behind building this python package. So here is his take.

The Why
For data science work to generate actual impact, productionization is an essential step. Yet, going from development to production is often a difficult and time-consuming process as it involves engineering efforts outside the primary responsibility of data scientists. Even with dedicated engineers, the process often becomes challenging as engineers do not have the full context behind the data science work passed to them, which comes in a crude form (e.g., long, messy notebooks). This friction drastically reduces the team’s ability to deliver actionable insights in real-time.

Where Does Lineapy Fit In?
LineaPy traces the sequence of every code execution to capture the non-linear, iterative development process in data science. This comprehensive understanding of the code and its context then allows LineaPy to automatically transform the original development code into cleaned-up, production-ready components (e.g., pipeline operators) that can be easily picked up and used by engineers.

I like the novel approach to solving the messy notebook problem. However, I would be remiss to not mention some of the feedback from the community. so here are some random quotes...

"Because why build good habits when you could just put duct tape over your bad ones....?"

"Just two lines of code - my classic red flag warning"

Maybe this can be a gateway drug for data scientists to learn deeper SWE best practices. Maybe this will be a crutch that could end up cripeling a data scientist. Who knows?

I still think it's worth playing around with to make your own decision.

Coffee Session
Clean Learning Bad Data
Recently we had the CEO & Co-Founder of Cleanlabs Curtis Northcutt on the pod.

Ok so stay with me. This might sound a little inception-y. What is Cleanlab? Welp, it's a data-centric AI tool that enables data to correct errors that exist within the data automatically. It sounded a little bit like magic at first. Luckily for us, Curtis loves details. He didn't mind going into the nitty-gritty bits.


MLOps vs Data-centric AI: We keep making up new terms in the Machine Learning space as we go. Data-centric AI focuses on how to fix up the data to improve the data pipeline. In contrast to MLOps which is more of a superset that tends to construct operations around both the data and model.

Nonetheless, these two still play a little tango. Their landscape lies between the realms of industry and academia. But one thing is for sure, the future will sort out their faith.

Why Data-centric AI: There is still much more work to be done around this concept. Theoretically, the approach focuses on the effectiveness of improving data quality to provide a better solution for solving some machine learning problems. Given that there will always be contaminants within any data, no matter how little. The methods around it show a significant performance in machine learning when alterations are performed on just the data.

How should ML engineers think of it: Backing on the most important thing to remember is that this approach reminds us that Machine Learning isn't all about math and logic. Having ML engineers in the loop of these machine learning systems is really important. That is why machine learning just enhances our mundane capabilities and does not replace them. After all, we are still authors of this data.

Coffee Session
To KubeFlow, or Not to KubeFlow
It's not every day you come across a Kubeflow fan. Ryan Russon, the MLOps and Data Science consulting manager at Maven Wave Partners, happens to be a huge fan of Kubeflow.

All About KubeFlow: We tend not to get a lot of yay's in a room where Kubeflow is mentioned. But it all falls back to perspective and use-case. In the same way that different tools are better options for specific scenarios, "KubeFlow is not the right tool for everybody".

It was natively built for Kubernetes and as such, it works well in a Kubernetes forward setting. This seems to be why it isn't easily embraced because the on-ramp isn't exactly straightforward. There are a lot of things that need to be overcome when working with Kubernetes from security to networking.

KubeFlow in the ML tooling ecosystem: If you critically look at it; Kubeflow brings a pretty interesting flavor to the ML tooling table. It fits the same mold as full end-to-end ML infra which vendor tools like vertexAI, Sagemaker, and Azure ML would provide, but in an open-source form. It provides customizable flexibility around exploration orchestration and serving, but without vendor lock-in. This enables seamless portability of MLOps infrastructure across different cloud platforms. Talk about a value prop.

KubeFlow ecosystem Maturity: Kubernetes in itself is layers upon layers of abstraction. Kubeflow builds on top of that by adding more abstractions on top of that. The goal of an abstraction is to make get rid of the complicated bits and make things more straightforward. Unfortunately, when working with kubeflow it can sometimes introduce more complexities than bargined for.

The gods smile uploon us though, kubeflow 1.4 came out recently and with each new update we can see that the project is listening to the feedback of the kubeflow community. Time will tell what the future of Kubeflow is, but until then, let's live in the moment.

Sponsored
Snorkel event
Join fellow ML experts at the Future of Data-centric AI 2022 - a free virtual event.

Learn about the latest data-centric approaches to AI application development during 30+ sessions presented by 40+ speakers from across industry and academia at this virtual event hosted by Snorkel AI on August 3-4.

You can register today to learn more, and see the current speaker lineup.
Current Meetup
So Fresh and So Clean
Tommy DANGerouswill be leading this weks session for our virtual meetup! Since this week its all about that clean data you can guess what we will be talking about. In the session we will quickly identify data issues and easily clean them with an open-source data cleaning tool.

We will go over the following:

  • Use visualizations and reports to understand your data quality issues
  • Use common cleaning functions to quickly fix issues
  • Write custom cleaning actions that can be re-used
  • Use your cleaning pipeline in any environment

Tomorrow at 9am PST/ 5pm BST its all going down. Come join us!
We Have Jobs!!
There is an official MLOps community jobs board now. Post a job and get featured in this newsletter!

IRL Meetups
London - July 16th
Lisbon - July 21
NYC - ....Soon!
Seatle - ??
Denver - ??
Best of Slack
Best of Slack is its own newsletter now. Sign up for it here.
Thanks for reading. This issue was written by Nwoke Tochukwu and edited by Demetrios Brinkmann. See you in Slack, Youtube, and podcast land. Oh yeah, and we are also on Twitter if you like chirping birds.



Email Marketing by ActiveCampaign