Using dds to collaborate between data branches¶

dds can be used within a team to collaborate on data, just like one would collaborate on source code. This tutorial shows how two people, Alice and Bob, can work on the same code base and checkout each other's versions of their data.

Let us start with a data function written in the mainbranch.

We will mimic the switch between branches and computers with the two following functions. This assumes that at least the internal_dir of the stores are shared between all the collaborators. This is naturally the case when using a shared system such as Databricks' DBFS store or mounting a share drive such as NFS or even Microsoft SharePoint, Dropbox, ..

In [2]:

Copied!

import dds

def store_main():
    dds.set_store("local", data_dir="/tmp/dds/tut_collab/data_main", internal_dir="/tmp/dds/tut_collab/internal")

def store_fork():
    dds.set_store("local", data_dir="/tmp/dds/tut_collab/data_fork", internal_dir="/tmp/dds/tut_collab/internal")

store_main()
import dds

def store_main():
    dds.set_store("local", data_dir="/tmp/dds/tut_collab/data_main", internal_dir="/tmp/dds/tut_collab/internal")

def store_fork():
    dds.set_store("local", data_dir="/tmp/dds/tut_collab/data_fork", internal_dir="/tmp/dds/tut_collab/internal")

store_main()

This is the code that we have in the main branch. Let's run it once to ensure that the content is in the store and is available to everyone.

In [3]:

Copied!





# main branch
@dds.data_function("/my_data")
def my_data():
    print("calculating my_data")
    return "Alice"

my_data()
# main branch
@dds.data_function("/my_data")
def my_data():
    print("calculating my_data")
    return "Alice"

my_data()

calculating my_data

Out[3]:

'Alice'

Bob branches the code in his fork. So far, there is no change. When evaluating the data functions, he gets the same content

In [4]:

Copied!

store_fork()
store_fork()

In [5]:

Copied!

# fork branch

my_data()
# fork branch

my_data()

Out[5]:

'Alice'

Now, Bob is going to change the content of the branch and update the code.

In [6]:

Copied!





# fork branch

@dds.data_function("/my_data")
def my_data():
    print("calculating my_data")
    return "Alice, Bob"

my_data()
# fork branch

@dds.data_function("/my_data")
def my_data():
    print("calculating my_data")
    return "Alice, Bob"

my_data()

calculating my_data

Out[6]:

'Alice, Bob'

Let's look at the content: the store has two blobs, one for each of the data functions:

In [7]:

Copied!

! ls /tmp/dds/tut_collab/internal/blobs | grep -v meta
! ls /tmp/dds/tut_collab/internal/blobs | grep -v meta

16f03a582d2223d294ce9976f7ae4299ad305fbc4a522984ce3c109561ff7851
458a6965f10642f0e9ad3e6e2f9ddad1f437324ca38fda34ab4c3b1e399af7d3

In the view of Alice using the main branch, the data points still to the 'Alice' dataset:

In [8]:

Copied!

! cat /tmp/dds/tut_collab/data_main/my_data
! cat /tmp/dds/tut_collab/data_main/my_data

Alice

And in the view of Bob, working in the fork branch, the data is updated:

In [9]:

Copied!

! cat /tmp/dds/tut_collab/data_fork/my_data
! cat /tmp/dds/tut_collab/data_fork/my_data

Alice, Bob

Now, we assume that the change of Bob has been merged back into the main branch. Now the code in the main branch is the one from the fork branch:

In [10]:

Copied!





# main branch:
@dds.data_function("/my_data")
def my_data():
    print("calculating my_data")
    return "Alice, Bob"
# main branch:
@dds.data_function("/my_data")
def my_data():
    print("calculating my_data")
    return "Alice, Bob"

When Alice imports the main branch and re-evaluates the code, she gets the updated version without having to recalculate the content of the function:

In [11]:

Copied!

# main branch
my_data()
# main branch
my_data()

Out[11]:

'Alice, Bob'

Indeed, the content of the cache was already populated when Bob ran its branch. Alice working from the main branch does need to re-compute anything: the merged code from Bob is the same as the one he ran from the fork branch, hence the stored artifacts are already there.

As we see, switching between branches of data is as easy as switching between code branches. When re-evaluating the content, dds checks that the objects are already in the shared store. A code branch switch is just a matter of updating file links to existing objects.

How does this work with storage systems that do not support linking, such as the Databricks(R) File System (DBFS) or S3 for example? In this case, there are two possibilities:

if the data is only to be used within dds, then the respective stores offer a way to just store links. Switching between branches is very fast, but other systems cannot read the content of the files without changes
is the data is mant to be shared outside of dds, then the stores will copy the content of the blob to the final destination. Depending on the size and the number of objects to copy, this may be significant.

To conclude, dds's philosophy of data is code makes it easy to share and update data in a collaborative environment:

data is tracked in each branch
switching between code branches works just like normal code to retrieve views of the corresponding data
all the data can be pre-calculated before merging the code, making a code+data checkout always a fast operation for the target branch

In [12]:

Copied!

! ls /tmp/dds/tut_collab/internal/blobs | grep -v meta
! ls /tmp/dds/tut_collab/internal/blobs | grep -v meta

16f03a582d2223d294ce9976f7ae4299ad305fbc4a522984ce3c109561ff7851
458a6965f10642f0e9ad3e6e2f9ddad1f437324ca38fda34ab4c3b1e399af7d3