Using dds to collaborate between data branches¶
dds
can be used within a team to collaborate on data, just like one would collaborate on source code. This tutorial shows how two people, Alice and Bob, can work on the same code base and checkout each other's versions of their data.
Let us start with a data function written in the main
branch.
We will mimic the switch between branches and computers with the two following functions. This assumes that at least the internal_dir
of the stores are shared between all the collaborators. This is naturally the case when using a shared system such as Databricks' DBFS
store or mounting a share drive such as NFS or even Microsoft SharePoint, Dropbox, ..
import dds
def store_main():
dds.set_store("local", data_dir="/tmp/dds/tut_collab/data_main", internal_dir="/tmp/dds/tut_collab/internal")
def store_fork():
dds.set_store("local", data_dir="/tmp/dds/tut_collab/data_fork", internal_dir="/tmp/dds/tut_collab/internal")
store_main()
This is the code that we have in the main branch. Let's run it once to ensure that the content is in the store and is available to everyone.
# main branch
@dds.data_function("/my_data")
def my_data():
print("calculating my_data")
return "Alice"
my_data()
calculating my_data
'Alice'
Bob branches the code in his fork. So far, there is no change. When evaluating the data functions, he gets the same content
store_fork()
# fork branch
my_data()
'Alice'
Now, Bob is going to change the content of the branch and update the code.
# fork branch
@dds.data_function("/my_data")
def my_data():
print("calculating my_data")
return "Alice, Bob"
my_data()
calculating my_data
'Alice, Bob'
Let's look at the content: the store has two blobs, one for each of the data functions:
! ls /tmp/dds/tut_collab/internal/blobs | grep -v meta
16f03a582d2223d294ce9976f7ae4299ad305fbc4a522984ce3c109561ff7851 458a6965f10642f0e9ad3e6e2f9ddad1f437324ca38fda34ab4c3b1e399af7d3
In the view of Alice using the main
branch, the data points still to the 'Alice' dataset:
! cat /tmp/dds/tut_collab/data_main/my_data
Alice
And in the view of Bob, working in the fork
branch, the data is updated:
! cat /tmp/dds/tut_collab/data_fork/my_data
Alice, Bob
Now, we assume that the change of Bob has been merged back into the main
branch. Now the code in the main branch is the one from the fork
branch:
# main branch:
@dds.data_function("/my_data")
def my_data():
print("calculating my_data")
return "Alice, Bob"
When Alice imports the main branch and re-evaluates the code, she gets the updated version without having to recalculate the content of the function:
# main branch
my_data()
'Alice, Bob'
Indeed, the content of the cache was already populated when Bob
ran its branch. Alice working from the main branch does need to re-compute anything: the merged code from Bob is the same as the one he ran from the fork
branch, hence the stored artifacts are already there.
As we see, switching between branches of data is as easy as switching between code branches. When re-evaluating the content, dds
checks that the objects are already in the shared store. A code branch switch is just a matter of updating file links to existing objects.
How does this work with storage systems that do not support linking, such as the Databricks(R) File System (DBFS) or S3 for example? In this case, there are two possibilities:
- if the data is only to be used within
dds
, then the respective stores offer a way to just store links. Switching between branches is very fast, but other systems cannot read the content of the files without changes - is the data is mant to be shared outside of
dds
, then the stores will copy the content of the blob to the final destination. Depending on the size and the number of objects to copy, this may be significant.
To conclude, dds
's philosophy of data is code makes it easy to share and update data in a collaborative environment:
- data is tracked in each branch
- switching between code branches works just like normal code to retrieve views of the corresponding data
- all the data can be pre-calculated before merging the code, making a code+data checkout always a fast operation for the target branch
! ls /tmp/dds/tut_collab/internal/blobs | grep -v meta
16f03a582d2223d294ce9976f7ae4299ad305fbc4a522984ce3c109561ff7851 458a6965f10642f0e9ad3e6e2f9ddad1f437324ca38fda34ab4c3b1e399af7d3