About
I am currently a software engineer at Databricks.
Prior to that, I did a Ph. D. in Machine Learning, in the
Department of Computer Science at the
University of California at Berkeley. My advisors are Pieter Abbeel and Alexandre Bayen.
Projects
In the course of my work, I have created or co-created a few projects around data science and Machine Learning.
Koalas
Koalas provides a pandas-like API and experience on top of Apache Spark. This dramatically helps the transition from “small scale” data science to “large scale” data science.
Website: Github
Announcement: Databricks blog
Presentations:
Graphframes
Graphframes is a distributed graph processing interface built on top of Apache Spark, it is the recommended Graph interface for Spark until a future inclusion inside Spark. This work was co-authored with Joseph Bradley and Xiangrui Meng.
Website: Documentation Github
Presentations:
spark-sklearn
spark-sklearn
is a simple package that helps distributing scikit-learn
models on Spark. It is useful when you want to apply or train millions of models on billions of data points.
Website: Github Announcement
TensorFrames
Tensorframes is a wrapper between Google TensorFlow and Apace Spark. While I would not recommend using it directly these days, it drove the efforts to integrate numerical stacks (numpy, tensorflow) into Spark, while limiting the communication overhead and offering a simple interface. If you want to use TensorFlow in Scala with Spark though, this is still one of the most efficient ways to do it.
Website: Github
Presentations:
Deep Learning Pipelines
DLP augments Spark’s standard Machine Learning toolkits with Deep Learning technologies. Using this package, one can easily apply DL models on large collections without having to learn new frameworks and still using all the ease of use of Spark. This is one of the recommended ways to apply a Keras or TensorFlow model on billions of data points.
Website: Github Documentation
Presentations:
I currently reside in Amsterdam, the Netherlands. You can reach me through one of the following methods:
E Mail:
tjhunter@eecs.berkeley.edu
Social networks:
I am also on LinkedIn and Viadeo.
Other presentations
A meetup on MLFlow and incoming features of Spark
A presentation on the status of project Hydrogen (high performance numerical integration in Spark): video
A few presentations on Geospatial analysis at scale using Deep Learning and Spark (done with my colleague Raela Wang):
A short presentation at Scala By the Bay 2017: video
Webinars
Deep Learning on Apache® Spark™- Best Practices webinar for Data Science Central
From Pandas to Apache Spark™ webinar for Data Science Central
An old presentation to UK Authority about the modern power of geospatial.
Webinar on MLFlow