End-to-end tutorial: scikit-learn tutorial¶
This tutorial is an adaptation of the Machine Learning tutorial from Elite Data Science. The original tutorial is here:
https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn
Let us start with a few imports
Seting up the dds
store. This is a recommended operation (here necessary to generate the documentation).
import dds
dds.set_store("local",
data_dir="/tmp/dds/tut_sklearn/data",
internal_dir="/tmp/dds/tut_sklearn/internal")
import sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import requests
import io
import json
We now add the import to the dds
package. By default, the data will be stored in the temporary directory
import dds
The two internal directories (data and internal) are now there
Let's start with a familiar problem of accessing data from the internet. This piece of code will download a dataset, but with the additional twist that the dataset will be cached onto the local machine.
path_model = "/wine-quality/my_model"
path_model_stats = "/wine-quality/my_model_stats.json"
@dds.data_function("/wine-quality/raw")
def data():
print("*** in _load_data ***")
url = "https://raw.githubusercontent.com/zygmuntz/wine-quality/master/winequality/winequality-red.csv"
x = requests.get(url=url, verify=False).content
return pd.read_csv(io.StringIO(x.decode('utf8')), sep=";")
dds.eval(data, dds_export_graph="/tmp/2.png", dds_extra_debug=True, dds_stages=["analysis"])
from IPython.display import Image
Image("/tmp/2.png")
data().head(3)
*** in _load_data ***
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
Our complete pipeline. A few points are noteworthy:
- the
_load_data
message does not appear: the data has already been loaded - note the use of
dds.keep
to keep multiple pieces of data that depend in an interconnected fashion to subset of the input dataset. The data is still loaded and split, but the ML model and scoring function will be evaluated once (try to rerun the cell below to see what happens)
def build_model(X_train, y_train):
print("*** in build_model ***")
pipeline = make_pipeline(preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=30))
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt'],
'randomforestregressor__max_depth': [None, 5, 3]}
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)
return clf
def model_stats(clf, X_test, y_test) -> str:
print("*** in model_stats ***")
pred = clf.predict(X_test)
return json.dumps({
# "r2_score": r2_score(y_test, pred), # uncomment me, see what happens
"mse": mean_squared_error(y_test, pred)
})
def pipeline():
wine_data = data()
y = wine_data.quality
X = wine_data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.15,
random_state=123,
stratify=y)
clf = dds.keep(path_model, build_model, X_train, y_train)
dds.keep(path_model_stats, model_stats, clf, X_test, y_test)
print("*** done ***")
dds.eval(pipeline)
*** in build_model *** *** in model_stats *** *** done ***
/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:425: FitFailedWarning: 30 fits failed out of a total of 60. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 30 fits failed with the following error: Traceback (most recent call last): File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper return fit_method(estimator, *args, **kwargs) File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/pipeline.py", line 427, in fit self._final_estimator.fit(Xt, y, **fit_params_last_step) File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/base.py", line 1145, in wrapper estimator._validate_params() File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/base.py", line 638, in _validate_params validate_parameter_constraints( File "/home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints raise InvalidParameterError( sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead. warnings.warn(some_fits_failed_message, FitFailedWarning) /home/tjhunter/work/dds_py/.venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py:979: UserWarning: One or more of the test scores are non-finite: [ nan 0.47861519 nan 0.38602872 nan 0.31607409] warnings.warn(
Accessing the output. This can be done in 2 ways:
- directly reading the files in its final destination in the store directory. The store in this notebook is configured to write data in
/tmp/dds/tut_sklearn/data
. Note thatmy_model_stats.json
is directly readable as a text blob, as expected - loaded through its
dds
path, usingdds.load
. The latter is preferred because it is compatible with maintaining multiple data forks and branches without having to hardcode such branches.
%%sh
cat /tmp/dds/tut_sklearn/data/wine-quality/my_model_stats.json
{"mse": 0.335625}
dds.load("/wine-quality/my_model_stats.json")
'{"mse": 0.335625}'
Code update: updating the final mode_stats
function simply rerun this part, not build_model
.
def model_stats(clf, X_test, y_test) -> str:
print("*** in model_stats ***")
pred = clf.predict(X_test)
return json.dumps({
"r2_score": r2_score(y_test, pred), # now it was uncommented
"mse": mean_squared_error(y_test, pred)
})
dds.eval(pipeline)
*** in model_stats *** *** done ***
%%sh
cat /tmp/dds/tut_sklearn/data/wine-quality/my_model_stats.json
{"r2_score": 0.47997310020174844, "mse": 0.335625}
! ls /tmp/dds/tut_sklearn/internal/blobs | grep -v meta
335afe91bf11f0788b1d8623ff5418a398b8ef6c617544d2d77cdc1dcc3c15b6 5a35d7d812a90852e790254e7d966f039e5807600c595892508fa51d0dca7ca2 9c4883c47e2700445752f5677ae6e7e88e26832b702ab7e68050b0201c2e2b5f a2bf8e9b34c86816ea608134e7f790a5eb99a7ca399f78733dad8f557c941e3f