Data Stockholm Syndrome

The Data Mesh approach is centered around decentralizing data architecture for ML. Generally, it constitutes both a technical and social side of its culture.

On the technical side, the concept is to treat every piece of data as a product.

These data products need to satisfy certain criteria to be considered useful before being curated as output ports for easy consumption by consumers.

The data mesh moment primarily started to handle the data workflow's analytics plane. Over time it evolved into handling other multiplanar areas like the operational plane.

The data mesh concept has evolved into allowing the production of the data products to handle multiple consumers by including only what is needed by a user, e.g. business analyst or data scientist.

Socially, transitioning to a data mesh workflow will fail unless everyone around the organization is involved in the conversations about the data mesh culture.

There must be a clear understanding of integrating what is being built with the
outputs of the data product because it affects how given output is accessed.

Chad Sanderson, the Head of Product & Data Platform at Convoy, shared some insightful thoughts on rethinking existing data engineering/modeling approaches. We also had a very special guest host, long-time community member Josh Wills.

Data Modelling
The elevator pitch of data modeling is basically the idea of building relationships between core concepts within the data. Typically data can be modeled in two ways.

The physical data model entails using data environments like DBT, snowflake e.t.c to draw the relationship between data.

The semantic model is typically an entity relationship design of abstracting away any data-centric relationship design between entities and introducing a stand-alone relationship design between entities.

Rethink Process
The modern modeling architecture tends to disconnect between the upstream and downstream data pipelines as data grows and changes.

Totally rebuilding the entire data architecture from scratch isn't always the best solution. Just swap out the redundant parts. The idea is to make it easy to collaboratively modify the data model incrementally when scalability problems start to emerge. This is done by redesigning the modeling process to address the full spectrum of all possible relationships that exist between the data entities.

Contracts and Lineage
Contracts technically enforce schema for the data in this modeling paradigm. Lineage is a way of adding an order of magnitude to the data model.

They both help to enable traceability between data producers and consumers across the data stream and also highlight backward incompatibility changes in the data.

Carrot-Stick Strategy
An implied expectation and high dependency of data collection, quality, and governance are on the data personnel.

It introduces a "hand-to-mouth" scenario between the producer and consumer for data.

This results in their net negative performance, thereby crippling efficiency.

There must be as much care and interest for the data at all levels across the stream.

To adopt this healthy habit, at Convoy, they try to abstract the code and focus more on the data semantics.

This is accomplished by capturing the schemas and mapping their schematics to the actual code in the data warehouse.

This week we had the pleasure of speaking with Fabiana Clemente from YData, who showed us how to generate synthetic data and validate the correctness of data all from a jupyter notebook.

We learned about the what, how, when, and "Y" of synthetic data using the open source ydata-synthetic package to generate new tabular data, and then validated its results with great expectations.

We leveraged jupyter notebooks, virtual environments, and some fun background music from the lovely Gonçalo, all in support of making our data work better for us.

If you want to follow along and try the notebooks for yourself, the code is fully open and available here. Share what kind of data you synthesize and if you can spot the difference!

The first Data-CentricAI Summit is coming up fast on Sept 29-30, focusing deeply on practical and educational talks that teach you something awesome. It’s all online and totally free. Come learn from some of the top Data-Centric AI practitioners and platforms all over the world.

What is Data-Centric AI? It focuses on updating the data to solve a problem versus changing the algorithm or code. That’s a complete reversal of how we’ve thought about AI up until now.

Over the last decade, researchers focused on code and algorithms first. They’d import the data once and leave it fixed. Data-Centric AI flips that on its head and says fix the data itself. Clean the noise. Augment the dataset. Re-label so it’s more consistent.

The event is brought to you by the AI Infrastructure Alliance and the DCAI Community. Come learn how to make Data-Centric AI practical in the real world.

Patrick Barker wrote a nice little lib to store artifacts in image registries, and I wanted to tell the deeper story of why he created it.

Artifact storage presents unique challenges when building a development platform. For working in the clouds, it means creating separate implementations for S3, GCS, AWS, etc., and also often requires the developer to provision those underlying cloud resources.

Working on-prem or without cloud dependencies, it means running and maintaining a service like MinIO, which is a lot of overhead and lacks reliability.

Image registries are one of the most critical pieces of the developer stack, making them highly durable and accessible anywhere you write code. Image registries just store OCI compliant images which are simply versioned tarballs. The OCI artifacts standard aims to extend the functionality of image registries to be able to store any arbitrary artifact with an associated media type. Tools like ORAS leverage this today.

The OCI artifacts standard is definitely the future of artifact storage, however its not fully supported across all clouds and image registries. While Patrick was working at VMware, a team developed a tool called imgpkg. Its aim was to provide similar functionality to the OCI artifacts standard, but using the currently supported APIs, so it worked with any image registry on the market today.

Blobz builds on the work of imgpkg by providing a clean python interface and the ability to add labels to artifacts, as well as references to other artifacts. This is useful in ML because we often have many artifact types and they need to be linked to one another. It also provides a dead simple way of storing things like models without the need to deal with another backend.

It's still very early days for Blobz, Patrick plans on building out ways of searching the artifacts and tracing through refs to other artifacts. Feel free to contribute or reach out to Patrick in the #open-source channel!