User guide¶

The dds package solves the data integration problem in data science codebases. By using the dds package, you can safely assume that:

data consumed or produced is up to date with the current code (coherence guarantee)
if a piece of data (machine learning models, datasets, ...) has already been calculated for a given code, it will immediately be used, dramatically accelerating the run of the code (caching)

dds works by inspecting python code and checking against a central store if its output has already been calculated. In that sense, it can be thought of as a smart caching system that detects if it should rerun calculations. As we will see, this makes dds a very simple foundation to build a feature store that keeps models, transformed data and feature data all in sync.

In order to work, dds needs three pieces of information:

where to store all the pieces of data (called blobs in dds jargon) that have been already calculated. This is by default in /tmp/dds/internal (or equivalent for your operating system)
where to store all the paths that are being requested for evaluation. It is by default in /tmp/dds/data.
what code should be tracked. Using the default configuration is enough for this tutorial in a notebook.

Data functions¶

The easiest way to use dds is to add a special annotation to data functions. A data function is a function that takes no arguments and returns something (a piece of data) that is of interest to us. Furthermore, it should respect the following conditions:

it always returns the same result when called repeatedly (determinism)
it could be replaced just by its result without changing the working of the program (referential transparency)

The first property says that the output does not change if the code is the same and the second property says that we only really care about the output of the function, not what it might decide to do on the side.

Here is a simple "Hello world" example in dds.

In [3]:

Copied!





import dds

@dds.data_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()
import dds

@dds.data_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()

hello_world() has been called

Out[3]:

'Hello, world'

When we called the function, a few things happened:

dds calculated a unique fingerprint for this function and checked if a blob was already associated for this fingerprint in its storage
since this is the first run, the function was executed and its result was stored in a storage
also, because the output is associated to a path (/hello_world), the path /hello_world filled with the content of the output.

We can in fact see all these outputs in the default store. Here is the file newly created with our welcoming content:

In [4]:

Copied!

! cat /tmp/dds/user_guide/data/hello_world
! cat /tmp/dds/user_guide/data/hello_world

Hello, world

But that file is just a link to the unique signature associated with this piece of code:

In [5]:

Copied!

! readlink /tmp/dds/user_guide/data/hello_world
! readlink /tmp/dds/user_guide/data/hello_world

/tmp/dds/user_guide/internal/blobs/26f46034012ffdebb21af34aea2e6f0775a521122f457f23bf34d4b97facfb3b

This function prints a message whenever it executes. Now, if we try to run it again, it will actually not run, because the code has not changed.

In [6]:

Copied!

hello_world()
hello_world()

Out[6]:

'Hello, world'

In fact, because dds looks at the source code, if you redefine the function with the same content, it still does not recompute:

In [7]:

Copied!





@dds.data_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()
@dds.data_function("/hello_world")
def hello_world():
    print("hello_world() has been called")
    return "Hello, world"

hello_world()

Out[7]:

'Hello, world'

Functions can include arbitrary dependencies, as shown with this example. The function f has a dependency on an extra variable:

In [8]:

Copied!





my_var = 1

@dds.data_function("/f")
def f():
    print("Calling f")
    return my_var

f()
my_var = 1

@dds.data_function("/f")
def f():
    print("Calling f")
    return my_var

f()

Calling f

Out[8]:

If we call the function again, as seen before, the function does not get called again:

In [9]:

Copied!

f()
f()

Out[9]:

However, if we change any dependency of the function, such as my_var, then the function will get evaluated again:

In [10]:

Copied!

my_var = 2
f()
my_var = 2
f()

Calling f

Out[10]:

Interestingly, if we change the variable again to its previous value, the function does not get evaluated again! The signature of the function will match a signature that was calculated before, hence there is no need to recompute it.

In [11]:

Copied!

my_var = 1
f()
my_var = 1
f()

Out[11]:

This mechanism covers all the basic structures in python (functions, dictionaries, lists, basic types, ...).

A function that is annotated with a dds annotation is called a data function. It is a function that not only a name in code but also a data path associated with it, and for which the output is captured and stored in a data system.

As we said, the data_function annotation requires little code change but only works for functions that do not have arguments. How to deal with more complicated functions? This is the object of the next section.

Functions with arguments: keep() and eval()¶

dds can also wrap functions that have arguments using the dds.keep() function. Here is a simple example, in which the hello function expects an extra word to be provided:

In [12]:

Copied!





def hello(name):
    print(f"Calling function hello on {name}")
    return f"Hello, {name}"

greeting = hello("world")
greeting
def hello(name):
    print(f"Calling function hello on {name}")
    return f"Hello, {name}"

greeting = hello("world")
greeting

Calling function hello on world

Out[12]:

'Hello, world'

In order to capture a specific call to this function with dds, the function call has to be wrapped with the dds.keep function:

In [13]:

Copied!

greeting = dds.keep("/greeting", hello, "world")
greeting
greeting = dds.keep("/greeting", hello, "world")
greeting

Calling function hello on world

Out[13]:

'Hello, world'

Again, try to change the argument of the function to see when it calls the function. This substitution can be done everywhere the function hello(world) was called. It can also be wrapped in a separate function instead of hello. This is in fact how the decorator data_function works.

This constructs works well if the arguments can be summarized to a signature. It will fail for complex objects such as files, because dds needs to understand basic information about the input of a function to decide if it has changed or not. As an example:

In [14]:

Copied!





def hello_from_file(file):
    name = file.readline().strip()
    print("Calling hello_from_file")
    return f"Hello, {name}"

f = open("input.txt", "r")
hello_from_file(f)
def hello_from_file(file):
    name = file.readline().strip()
    print("Calling hello_from_file")
    return f"Hello, {name}"

f = open("input.txt", "r")
hello_from_file(f)

Calling hello_from_file

Out[14]:

'Hello, world'

In [15]:

Copied!





# This line will trigger a DDSException
try:
    dds.keep("/greeting", hello_from_file, open("input.txt", "r"))
except dds.DDSException as e:
    print(e)
# This line will trigger a DDSException
try:
    dds.keep("/greeting", hello_from_file, open("input.txt", "r"))
except dds.DDSException as e:
    print(e)

The type <class '_io.TextIOWrapper'> is currently not supported. The only supported types are 'well-known' types that are part of the standard data structures in the python library. If you think your data type should be supported by DDS, please open a request ticket. General Python classes will not be supported since they can carry arbitrary state and cannot be easily compared. Consider using a dataclass, a dictionary or a named tuple instead.

How do we still use files? dds does not need to understand the content passed to a function if it is called as a sub-function within dds. More concretely in this example, we can create a wrapper function that contains the file call and the call to the function to keep:

In [16]:

Copied!





def wrapper_hello():
    f = open("input.txt", "r")
    print(f"Opening file {f}")
    greeting = dds.keep("/greeting", hello_from_file, f)
    return greeting

dds.eval(wrapper_hello)
def wrapper_hello():
    f = open("input.txt", "r")
    print(f"Opening file {f}")
    greeting = dds.keep("/greeting", hello_from_file, f)
    return greeting

dds.eval(wrapper_hello)

Opening file <_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'>
Calling hello_from_file

Out[16]:

'Hello, world'

Calling the function again shows that:

we still open the file: the content of wrapper_hello is still executed.
hello_from_file is not called again: even if we pass a file to it, all the source code to provide the arguments is the same, the function hello_from_file is the same, hence dds assumes that the resulting greeting is going to be the same.

As a result, wrapper_hello is run (it is just evaluated), but all the sub-calls to data functions are going to be cached.

In [17]:

Copied!

dds.eval(wrapper_hello)
dds.eval(wrapper_hello)

Opening file <_io.TextIOWrapper name='input.txt' mode='r' encoding='UTF-8'>

Out[17]:

'Hello, world'

Indirect references: load()¶

So far, we have seen only one way to access data: using dds.keep (or its shortcut @data_function). It is not always convenient to refer to the data function that created the piece of data in the first place. For example, the function that created the data in the first place contains some secrets that should not be accessible. This is why dds provides an alternative way to access data, using only the path to the data. This is what the dds.load function provides.

For example, if we want to retrieve the data stored in /hello_world, we can directly retrieve it in the following way:

In [18]:

Copied!

dds.load("/hello_world")
dds.load("/hello_world")

Out[18]:

'Hello, world'

Just like the other functions, changes to the underlying data will cause the signature of a dds.load call to change. This function seems convenient, but it comes at some cost: because it refers only to the result and not to how the data was calculated, it cannot check for loop dependencies, or if this reference should be updated.

When to use dds.load and when to directly call the function?

directly calling the function ensure that the most recent version will be taken into account, at the expense of a code dependency
loading the data indirectly hides the implementation, but may introduce coherency and synchronization issues

Furthermore, some extra rules must be respected when mixing load and other functions. In particular, dds will prevent you from reading a dataset first through load and then evaluating it using keep. This example will fail for example:

In [19]:

Copied!





@dds.data_function("/f")
def f():
    return 1

def h():
    _ = dds.load("/f")

def g():
    h()
    f()

# This will fail with a DDSException
# dds.eval(g)
@dds.data_function("/f")
def f():
    return 1

def h():
    _ = dds.load("/f")

def g():
    h()
    f()

# This will fail with a DDSException
# dds.eval(g)

Rearranging the call h() after f() solves the problem:

In [20]:

Copied!





def g():
    f()
    h()
dds.eval(g)
def g():
    f()
    h()
dds.eval(g)

Conclusion¶

As a conclusion, dds provides 4 basic functions to track and cache pieces of data:

data_function is an annotation for functions that take no arguments and return a piece of data that should be tracked
keep is a function that wraps function calls. It can be used standalone when the function uses basic types as arguments.
eval is used in conjunction with keep when data functions take complex arguments.
load directly loads a piece of data from its path (without having to refer its generating data function)

By building on these foundations, dds allows you to do many more things such as visualizing all the dependencies between data, speeding up Machine Learning pipelines, and parallelizing your code automatically. The other tutorials provide more information.

In [21]:

Copied!

! ls /tmp/dds/user_guide/internal/blobs | grep -v meta
! ls /tmp/dds/user_guide/internal/blobs | grep -v meta

1c5018ef452f3aafead20de4d9e1ad5e6920453025813a266fde975387d0b5f5
22deab6baa11ebb1f379519a1c00a0bd9a8e6a93e278b8ae319c2bd95c4fd3dc
26f46034012ffdebb21af34aea2e6f0775a521122f457f23bf34d4b97facfb3b
de7ee19728e267fc76a0c22b4aaa5e28c6d9b7388038de9d422fb257609bb671
e95b2a802746c539eb4b2549abaf12d73366f966862d5682c6930a878cf3557
f2802c71b37ba3eeefaf0a6c6f6fe4cec847cbba8f67e7de8bd2580a27cbb5c