Data Lies, Flow Dies

Ok I'll be honest, I shared the first glimpse of this tool Lineapy a week ago in the community slack. Let's just say there were a few words exchanged. These words made me wonder. Should I write about this tool?

Yes.

Cause it's still a novel attempt at doing something different. Even if it ignores the fact you are not building good habits.

I chatted more with Sangyoon Park the creator of Lineapy about their motivations behind building this python package. So here is his take.

The Why
For data science work to generate actual impact, productionization is an essential step. Yet, going from development to production is often a difficult and time-consuming process as it involves engineering efforts outside the primary responsibility of data scientists. Even with dedicated engineers, the process often becomes challenging as engineers do not have the full context behind the data science work passed to them, which comes in a crude form (e.g., long, messy notebooks). This friction drastically reduces the team’s ability to deliver actionable insights in real-time.

Where Does Lineapy Fit In?
LineaPy traces the sequence of every code execution to capture the non-linear, iterative development process in data science. This comprehensive understanding of the code and its context then allows LineaPy to automatically transform the original development code into cleaned-up, production-ready components (e.g., pipeline operators) that can be easily picked up and used by engineers.

I like the novel approach to solving the messy notebook problem. However, I would be remiss to not mention some of the feedback from the community. so here are some random quotes...

"Because why build good habits when you could just put duct tape over your bad ones....?"

"Just two lines of code - my classic red flag warning"

Maybe this can be a gateway drug for data scientists to learn deeper SWE best practices. Maybe this will be a crutch that could end up cripeling a data scientist. Who knows?

I still think it's worth playing around with to make your own decision.

Recently we had the CEO & Co-Founder of Cleanlabs Curtis Northcutt on the pod.

Ok so stay with me. This might sound a little inception-y. What is Cleanlab? Welp, it's a data-centric AI tool that enables data to correct errors that exist within the data automatically. It sounded a little bit like magic at first. Luckily for us, Curtis loves details. He didn't mind going into the nitty-gritty bits.

MLOps vs Data-centric AI: We keep making up new terms in the Machine Learning space as we go. Data-centric AI focuses on how to fix up the data to improve the data pipeline. In contrast to MLOps which is more of a superset that tends to construct operations around both the data and model.

Nonetheless, these two still play a little tango. Their landscape lies between the realms of industry and academia. But one thing is for sure, the future will sort out their faith.

Why Data-centric AI: There is still much more work to be done around this concept. Theoretically, the approach focuses on the effectiveness of improving data quality to provide a better solution for solving some machine learning problems. Given that there will always be contaminants within any data, no matter how little. The methods around it show a significant performance in machine learning when alterations are performed on just the data.

How should ML engineers think of it: Backing on the most important thing to remember is that this approach reminds us that Machine Learning isn't all about math and logic. Having ML engineers in the loop of these machine learning systems is really important. That is why machine learning just enhances our mundane capabilities and does not replace them. After all, we are still authors of this data.

It's not every day you come across a Kubeflow fan. Ryan Russon, the MLOps and Data Science consulting manager at Maven Wave Partners, happens to be a huge fan of Kubeflow.

All About KubeFlow: We tend not to get a lot of yay's in a room where Kubeflow is mentioned. But it all falls back to perspective and use-case. In the same way that different tools are better options for specific scenarios, "KubeFlow is not the right tool for everybody".

It was natively built for Kubernetes and as such, it works well in a Kubernetes forward setting. This seems to be why it isn't easily embraced because the on-ramp isn't exactly straightforward. There are a lot of things that need to be overcome when working with Kubernetes from security to networking.

KubeFlow in the ML tooling ecosystem: If you critically look at it; Kubeflow brings a pretty interesting flavor to the ML tooling table. It fits the same mold as full end-to-end ML infra which vendor tools like vertexAI, Sagemaker, and Azure ML would provide, but in an open-source form. It provides customizable flexibility around exploration orchestration and serving, but without vendor lock-in. This enables seamless portability of MLOps infrastructure across different cloud platforms. Talk about a value prop.

KubeFlow ecosystem Maturity: Kubernetes in itself is layers upon layers of abstraction. Kubeflow builds on top of that by adding more abstractions on top of that. The goal of an abstraction is to make get rid of the complicated bits and make things more straightforward. Unfortunately, when working with kubeflow it can sometimes introduce more complexities than bargined for.

The gods smile uploon us though, kubeflow 1.4 came out recently and with each new update we can see that the project is listening to the feedback of the kubeflow community. Time will tell what the future of Kubeflow is, but until then, let's live in the moment.

Tommy DANGerouswill be leading this weks session for our virtual meetup! Since this week its all about that clean data you can guess what we will be talking about. In the session we will quickly identify data issues and easily clean them with an open-source data cleaning tool.

We will go over the following:

Use visualizations and reports to understand your data quality issues
Use common cleaning functions to quickly fix issues
Write custom cleaning actions that can be re-used
Use your cleaning pipeline in any environment

Tomorrow at 9am PST/ 5pm BST its all going down. Come join us!