User guide¶
The dds
package solves the data integration problem in data science codebases. By using the dds
package, you can safely assume that:
- data consumed or produced is up to date with the current code (coherence guarantee)
- if a piece of data (machine learning models, datasets, ...) has already been calculated for a given code, it will immediately be used, dramatically accelerating the run of the code (caching)
dds
works by inspecting python code and checking against a central store if its output has already been calculated. In that sense, it can be thought of as a smart caching system that detects if it should rerun calculations. As we will see, this makes dds
a very simple foundation to build a feature store that keeps models, transformed data and feature data all in sync.
In order to work, dds
needs three pieces of information:
- where to store all the pieces of data (called blobs in
dds
jargon) that have been already calculated. This is by default in/tmp/dds/internal
(or equivalent for your operating system) - where to store all the paths that are being requested for evaluation. It is by default in
/tmp/dds/data
. - what code should be tracked. Using the default configuration is enough for this tutorial in a notebook.
Data functions¶
The easiest way to use dds
is to add a special annotation to data functions. A data function is a function that takes no arguments and returns something (a piece of data) that is of interest to us. Furthermore, it should respect the following conditions:
- it always returns the same result when called repeatedly (determinism)
- it could be replaced just by its result without changing the working of the program (referential transparency)
The first property says that the output does not change if the code is the same and the second property says that we only really care about the output of the function, not what it might decide to do on the side.
Here is a simple "Hello world" example in dds
.
import dds
@dds.data_function("/hello_world")
def hello_world():
print("hello_world() has been called")
return "Hello, world"
hello_world()
hello_world() has been called
'Hello, world'
When we called the function, a few things happened:
dds
calculated a unique fingerprint for this function and checked if a blob was already associated for this fingerprint in its storage- since this is the first run, the function was executed and its result was stored in a storage
- also, because the output is associated to a path (
/hello_world
), the path/hello_world
filled with the content of the output.
We can in fact see all these outputs in the default store. Here is the file newly created with our welcoming content:
! cat /tmp/dds/user_guide/data/hello_world
Hello, world
But that file is just a link to the unique signature associated with this piece of code:
! readlink /tmp/dds/user_guide/data/hello_world
/tmp/dds/user_guide/internal/blobs/26f46034012ffdebb21af34aea2e6f0775a521122f457f23bf34d4b97facfb3b
This function prints a message whenever it executes. Now, if we try to run it again, it will actually not run, because the code has not changed.
hello_world()
'Hello, world'
In fact, because dds
looks at the source code, if you redefine the function with the same content, it still does not recompute:
@dds.data_function("/hello_world")
def hello_world():
print("hello_world() has been called")
return "Hello, world"
hello_world()
'Hello, world'
Functions can include arbitrary dependencies, as shown with this example. The function f
has a dependency on an extra variable:
my_var = 1
@dds.data_function("/f")
def f():
print("Calling f")
return my_var
f()
Calling f
1
If we call the function again, as seen before, the function does not get called again:
f()
1
However, if we change any dependency of the function, such as my_var
, then the function will get evaluated again:
my_var = 2
f()
Calling f
2
Interestingly, if we change the variable again to its previous value, the function does not get evaluated again! The signature of the function will match a signature that was calculated before, hence there is no need to recompute it.
my_var = 1
f()
1
This mechanism covers all the basic structures in python (functions, dictionaries, lists, basic types, ...).
A function that is annotated with a dds
annotation is called a data function. It is a function that not only a name in code but also a data path associated with it, and for which the output is captured and stored in a data system.
As we said, the data_function
annotation requires little code change but only works for functions that do not have arguments. How to deal with more complicated functions?
This is the object of the next section.
Functions with arguments: keep() and eval()¶
dds
can also wrap functions that have arguments using the dds.keep()
function. Here is a simple example, in which the hello
function expects an extra word to be provided:
def hello(name):
print(f"Calling function hello on {name}")
return f"Hello, {name}"
greeting = hello("world")
greeting
Calling function hello on world
'Hello, world'
In order to capture a specific call to this function with dds
, the function call has to be wrapped with the dds.keep
function:
greeting = dds.keep("/greeting", hello, "world")
greeting
Calling function hello on world
'Hello, world'
Again, try to change the argument of the function to see when it calls the function. This substitution can be done everywhere the function hello(world)
was called. It can also be wrapped in a separate function instead of hello
. This is in fact how the decorator data_function
works.
This constructs works well if the arguments can be summarized to a signature. It will fail for complex objects such as files, because dds
needs to understand basic information about the input of a function to decide if it has changed or not. As an example:
def hello_from_file(file):
name = file.readline().strip()
print("Calling hello_from_file")
return f"Hello, {name}"
f = open("input.txt", "r")
hello_from_file(f)
Calling hello_from_file
'Hello, world'
# This line will trigger a DDSException
try:
dds.keep("/greeting", hello_from_file, open("input.txt", "r"))
except dds.DDSException as e:
print(e)
The type <class '_io.TextIOWrapper'> is currently not supported. The only supported types are 'well-known' types that are part of the standard data structures in the python library. If you think your data type should be supported by DDS, please open a request ticket. General Python classes will not be supported since they can carry arbitrary state and cannot be easily compared. Consider using a dataclass, a dictionary or a named tuple instead.
How do we still use files? dds
does not need to understand the content passed to a function if it is called as a sub-function within dds
. More concretely in this example, we can create a wrapper function that contains the file call and the call to the function to keep:
def wrapper_hello():
f = open("input.txt", "r")
print(f"Opening file {f}")
greeting = dds.keep("/greeting", hello_from_file, f)
return greeting
dds.eval(wrapper_hello)
Opening file <_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'> Calling hello_from_file
'Hello, world'
Calling the function again shows that:
- we still open the file: the content of
wrapper_hello
is still executed. hello_from_file
is not called again: even if we pass a file to it, all the source code to provide the arguments is the same, the functionhello_from_file
is the same, hencedds
assumes that the resultinggreeting
is going to be the same.
As a result, wrapper_hello
is run (it is just eval
uated), but all the sub-calls to data functions are going to be cached.
dds.eval(wrapper_hello)
Opening file <_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'>
'Hello, world'
Indirect references: load()¶
So far, we have seen only one way to access data: using dds.keep
(or its shortcut @data_function
).
It is not always convenient to refer to the data function that created the piece of data in the first place.
For example, the function that created the data in the first place contains some secrets that should not be accessible.
This is why dds
provides an alternative way to access data, using only the path to the data. This is what the dds.load
function provides.
For example, if we want to retrieve the data stored in /hello_world
, we can directly retrieve it in the following way:
dds.load("/hello_world")
'Hello, world'
Just like the other functions, changes to the underlying data will cause the signature of a dds.load
call to change.
This function seems convenient, but it comes at some cost: because it refers only to the result and not to how the data was calculated, it cannot check for loop dependencies, or if this reference should be updated.
When to use dds.load
and when to directly call the function?
- directly calling the function ensure that the most recent version will be taken into account, at the expense of a code dependency
- loading the data indirectly hides the implementation, but may introduce coherency and synchronization issues
Furthermore, some extra rules must be respected when mixing load
and other functions. In particular, dds
will prevent you from reading a dataset first through load
and then evaluating it using keep
. This example will fail for example:
@dds.data_function("/f")
def f():
return 1
def h():
_ = dds.load("/f")
def g():
h()
f()
# This will fail with a DDSException
# dds.eval(g)
Rearranging the call h()
after f()
solves the problem:
def g():
f()
h()
dds.eval(g)
Conclusion¶
As a conclusion, dds
provides 4 basic functions to track and cache pieces of data:
data_function
is an annotation for functions that take no arguments and return a piece of data that should be trackedkeep
is a function that wraps function calls. It can be used standalone when the function uses basic types as arguments.eval
is used in conjunction withkeep
when data functions take complex arguments.load
directly loads a piece of data from its path (without having to refer its generating data function)
By building on these foundations, dds
allows you to do many more things such as visualizing all the dependencies between data, speeding up Machine Learning pipelines, and parallelizing your code automatically. The other tutorials provide more information.
! ls /tmp/dds/user_guide/internal/blobs | grep -v meta
1c5018ef452f3aafead20de4d9e1ad5e6920453025813a266fde975387d0b5f5 22deab6baa11ebb1f379519a1c00a0bd9a8e6a93e278b8ae319c2bd95c4fd3dc 26f46034012ffdebb21af34aea2e6f0775a521122f457f23bf34d4b97facfb3b de7ee19728e267fc76a0c22b4aaa5e28c6d9b7388038de9d422fb257609bb671 e95b2a802746c539eb4b2549abaf12d73366f966862d5682c6930a878cf3557 f2802c71b37ba3eeefaf0a6c6f6fe4cec847cbba8f67e7de8bd2580a27cbb5c