What is Dud?
I heard about Dud, pronounced duhd", not "dood" a few months ago when the creator Kevin Hanselman dropped a few lines in the community about it. Curious and willing to learn more I caught up with him to hear about the new 0.2.0 release, and what exactly the tool aims to do.
Dud is a lightweight tool for versioning data alongside source code and building data pipelines. In practice, Dud extends many of the benefits of
source control to large binary data. It is especially a more focused and lighter weight data version control tool.
It strives to be 3 things. Simple. Fast. Transparent.
Simple Dud should never get in your way (unless you're about to do something stupid). Dud should be less magical, not more. Dud should do one thing well and be a good UNIX citizen.
Fast Dud should prioritize speed while maintaining sensible assurances of data integrity. Dud should isolate time-intensive
operations to keep the majority of the UX as fast as possible. Dud should scale to datasets in the hundreds of gigabytes and/or hundreds of thousands of files.
Transparent Dud should explain itself early and often. Dud should maintain its state in a human-readable (and ideally human-editable) form.
To summarize with an analogy: Dud is to DVC what Flask is to Django. Both Dud and DVC have their strengths. If you want a "batteries included" suite of tools for managing machine learning projects, DVC can be a good fit for you. If data management is your main area of need and you want something lightweight and fast, Dud may be what you are looking for.
|