Frequently asked questions
What can’t you do with pure docker that you would need DDS for? e.g. Docker does a lot of caching of layers
Docker fills a gap that slightly overlaps with DDS:
- Docker allows you to embed arbitrary content (software, models, data, ...) into a single bundle
- It requires a specific language (the Dockerfile instructions)
- Its caching system does not understand the semantics of your code: if you just move code, it will rebuild the layer. In fact, Docker has multiple caching systems that try to address this issue in different ways.
- It requires a specific runtime (the docker system) to run
By contrast, DDS understands very well your python code, and only your python code:
- you can run it in any python environment
- it will understand your code: if you copy/paste functions in files, it will understand that the code is still the same and will not trigger recomputations
In practice, both systems are complementary:
- you build all your data artifacts (models, cleaned data) with dds
- you embed them in a final docker container that you publish as a service, with MLFlow for example
Can DDS run in the background automatically like Delta IO?
Not currently, but this is a potential point on the roadmap. DDS already benefits from Delta IO if available, and solves a different problem: - DDS helps for batch transforms written in Python - Delta IO can be used for streaming and batch, using Python, Java - DDS automatically infers all the data dependencies from the code - Delta IO needs an explicit computation graph provided by the user
Best practices: at which point in code should put it in?
The rule of thumb is the following: any idempotent calculation that you end up waiting for and that takes more than 0.3 seconds to compute can benefit from DDS.
In practice, this includes:
- fetching data from the internet and returning a
pandas
dataframe - using the
display()
function to show statistics on large tables - running ML models
With DDS, the general user experience is that any notebook can be made to run in less than 10 seconds. This is very powerful to communicate results that potentially depend on long-running calculations.