End-to-end tutorial: scikit-learn tutorial¶

This tutorial is an adaptation of the Machine Learning tutorial from Elite Data Science. The original tutorial is here:

https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn

Let us start with a few imports

Seting up the dds store. This is a recommended operation (here necessary to generate the documentation).

In [3]:

Copied!





import dds
dds.set_store("local",
              data_dir="/tmp/dds/tut_sklearn/data",
              internal_dir="/tmp/dds/tut_sklearn/internal")
import dds
dds.set_store("local",
              data_dir="/tmp/dds/tut_sklearn/data",
              internal_dir="/tmp/dds/tut_sklearn/internal")

In [4]:

Copied!





import sklearn
import pandas as pd
import numpy as np
 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import requests
import io
import json
import sklearn
import pandas as pd
import numpy as np
 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import requests
import io
import json

We now add the import to the dds package. By default, the data will be stored in the temporary directory

In [5]:

Copied!

import dds
import dds

The two internal directories (data and internal) are now there

Let's start with a familiar problem of accessing data from the internet. This piece of code will download a dataset, but with the additional twist that the dataset will be cached onto the local machine.

In [6]:

Copied!





path_model = "/wine-quality/my_model"
path_model_stats = "/wine-quality/my_model_stats.json"

@dds.data_function("/wine-quality/raw")
def data():
    print("*** in _load_data ***")
    url = "https://raw.githubusercontent.com/zygmuntz/wine-quality/master/winequality/winequality-red.csv"
    x = requests.get(url=url, verify=False).content 
    return pd.read_csv(io.StringIO(x.decode('utf8')), sep=";")
path_model = "/wine-quality/my_model"
path_model_stats = "/wine-quality/my_model_stats.json"

@dds.data_function("/wine-quality/raw")
def data():
    print("*** in _load_data ***")
    url = "https://raw.githubusercontent.com/zygmuntz/wine-quality/master/winequality/winequality-red.csv"
    x = requests.get(url=url, verify=False).content 
    return pd.read_csv(io.StringIO(x.decode('utf8')), sep=";")

In [7]:

Copied!

dds.eval(data, dds_export_graph="/tmp/2.png", dds_extra_debug=True, dds_stages=["analysis"])
from IPython.display import Image
Image("/tmp/2.png")
dds.eval(data, dds_export_graph="/tmp/2.png", dds_extra_debug=True, dds_stages=["analysis"])
from IPython.display import Image
Image("/tmp/2.png")

Out[7]:

No description has been provided for this image

In [8]:

Copied!

data().head(3)
data().head(3)

*** in _load_data ***

Out[8]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5

Our complete pipeline. A few points are noteworthy:

the _load_data message does not appear: the data has already been loaded
note the use of dds.keep to keep multiple pieces of data that depend in an interconnected fashion to subset of the input dataset. The data is still loaded and split, but the ML model and scoring function will be evaluated once (try to rerun the cell below to see what happens)

In [9]:

Copied!





def build_model(X_train, y_train):
    print("*** in build_model ***")
    pipeline = make_pipeline(preprocessing.StandardScaler(), 
                             RandomForestRegressor(n_estimators=30))
    hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt'],
                      'randomforestregressor__max_depth': [None, 5, 3]}

    clf = GridSearchCV(pipeline, hyperparameters, cv=10)
    
    clf.fit(X_train, y_train)
    return clf
 
    
def model_stats(clf, X_test, y_test) -> str:
    print("*** in model_stats ***")
    pred = clf.predict(X_test)
    return json.dumps({
#         "r2_score": r2_score(y_test, pred), # uncomment me, see what happens
        "mse": mean_squared_error(y_test, pred)
    })
    
    
def pipeline():
    wine_data = data()
    y = wine_data.quality
    X = wine_data.drop('quality', axis=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.15, 
                                                        random_state=123, 
                                                        stratify=y)
    clf = dds.keep(path_model, build_model, X_train, y_train)
    dds.keep(path_model_stats, model_stats, clf, X_test, y_test)
    print("*** done ***")


dds.eval(pipeline)
def build_model(X_train, y_train):
    print("*** in build_model ***")
    pipeline = make_pipeline(preprocessing.StandardScaler(), 
                             RandomForestRegressor(n_estimators=30))
    hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt'],
                      'randomforestregressor__max_depth': [None, 5, 3]}

    clf = GridSearchCV(pipeline, hyperparameters, cv=10)
    
    clf.fit(X_train, y_train)
    return clf
 
    
def model_stats(clf, X_test, y_test) -> str:
    print("*** in model_stats ***")
    pred = clf.predict(X_test)
    return json.dumps({
#         "r2_score": r2_score(y_test, pred), # uncomment me, see what happens
        "mse": mean_squared_error(y_test, pred)
    })
    
    
def pipeline():
    wine_data = data()
    y = wine_data.quality
    X = wine_data.drop('quality', axis=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.15, 
                                                        random_state=123, 
                                                        stratify=y)
    clf = dds.keep(path_model, build_model, X_train, y_train)
    dds.keep(path_model_stats, model_stats, clf, X_test, y_test)
    print("*** done ***")


dds.eval(pipeline)

*** in build_model ***
*** in model_stats ***
*** done ***

/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:425: FitFailedWarning: 
30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/pipeline.py", line 427, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py:979: UserWarning: One or more of the test scores are non-finite: [       nan 0.47861519        nan 0.38602872        nan 0.31607409]
  warnings.warn(

Accessing the output. This can be done in 2 ways:

directly reading the files in its final destination in the store directory. The store in this notebook is configured to write data in /tmp/dds/tut_sklearn/data. Note that my_model_stats.json is directly readable as a text blob, as expected
loaded through its dds path, using dds.load. The latter is preferred because it is compatible with maintaining multiple data forks and branches without having to hardcode such branches.

In [10]:

Copied!

%%sh
cat /tmp/dds/tut_sklearn/data/wine-quality/my_model_stats.json
%%sh
cat /tmp/dds/tut_sklearn/data/wine-quality/my_model_stats.json

{"mse": 0.335625}

In [11]:

Copied!

dds.load("/wine-quality/my_model_stats.json")
dds.load("/wine-quality/my_model_stats.json")

Out[11]:

'{"mse": 0.335625}'

Code update: updating the final mode_stats function simply rerun this part, not build_model.

In [12]:

Copied!





def model_stats(clf, X_test, y_test) -> str:
    print("*** in model_stats ***")
    pred = clf.predict(X_test)
    return json.dumps({
        "r2_score": r2_score(y_test, pred), # now it was uncommented
        "mse": mean_squared_error(y_test, pred)
    })

dds.eval(pipeline)
def model_stats(clf, X_test, y_test) -> str:
    print("*** in model_stats ***")
    pred = clf.predict(X_test)
    return json.dumps({
        "r2_score": r2_score(y_test, pred), # now it was uncommented
        "mse": mean_squared_error(y_test, pred)
    })

dds.eval(pipeline)

*** in model_stats ***
*** done ***

In [13]:

Copied!

%%sh
cat /tmp/dds/tut_sklearn/data/wine-quality/my_model_stats.json
%%sh
cat /tmp/dds/tut_sklearn/data/wine-quality/my_model_stats.json

{"r2_score": 0.47997310020174844, "mse": 0.335625}

In [14]:

Copied!

! ls /tmp/dds/tut_sklearn/internal/blobs | grep -v meta
! ls /tmp/dds/tut_sklearn/internal/blobs | grep -v meta

335afe91bf11f0788b1d8623ff5418a398b8ef6c617544d2d77cdc1dcc3c15b6
5a35d7d812a90852e790254e7d966f039e5807600c595892508fa51d0dca7ca2
9c4883c47e2700445752f5677ae6e7e88e26832b702ab7e68050b0201c2e2b5f
a2bf8e9b34c86816ea608134e7f790a5eb99a7ca399f78733dad8f557c941e3f