Tim Hunter's web page

Posts

About

I am currently a software engineer at Databricks.

Prior to that, I did a Ph. D. in Machine Learning, in the Department of Computer Science at the University of California at Berkeley. My advisors are Pieter Abbeel and Alexandre Bayen.

Projects

In the course of my work, I have created or co-created a few projects around data science and Machine Learning.

Koalas

Koalas provides a pandas-like API and experience on top of Apache Spark. This dramatically helps the transition from “small scale” data science to “large scale” data science.

Website: Github

Announcement: Databricks blog

Presentations:

Graphframes

Graphframes is a distributed graph processing interface built on top of Apache Spark, it is the recommended Graph interface for Spark until a future inclusion inside Spark. This work was co-authored with Joseph Bradley and Xiangrui Meng.

Website: Documentation Github

Presentations:

spark-sklearn

spark-sklearn is a simple package that helps distributing scikit-learn models on Spark. It is useful when you want to apply or train millions of models on billions of data points.

Website: Github Announcement

TensorFrames

Tensorframes is a wrapper between Google TensorFlow and Apace Spark. While I would not recommend using it directly these days, it drove the efforts to integrate numerical stacks (numpy, tensorflow) into Spark, while limiting the communication overhead and offering a simple interface. If you want to use TensorFlow in Scala with Spark though, this is still one of the most efficient ways to do it.

Website: Github

Presentations:

Deep Learning Pipelines

DLP augments Spark’s standard Machine Learning toolkits with Deep Learning technologies. Using this package, one can easily apply DL models on large collections without having to learn new frameworks and still using all the ease of use of Spark. This is one of the recommended ways to apply a Keras or TensorFlow model on billions of data points.

Website: Github Documentation

Presentations:

Contact

I currently reside in Amsterdam, the Netherlands. You can reach me through one of the following methods:

E Mail: tjhunter@eecs.berkeley.edu

Social networks: I am also on LinkedIn and Viadeo.

Other presentations

A meetup on MLFlow and incoming features of Spark

A presentation on the status of project Hydrogen (high performance numerical integration in Spark): video

A few presentations on Geospatial analysis at scale using Deep Learning and Spark (done with my colleague Raela Wang):

A short presentation at Scala By the Bay 2017: video

Webinars

Deep Learning on Apache® Spark™- Best Practices webinar for Data Science Central

From Pandas to Apache Spark™ webinar for Data Science Central

An old presentation to UK Authority about the modern power of geospatial.

Webinar on MLFlow