From 1d78d6fd60eb79eddb42c3e587e05753ddec949c Mon Sep 17 00:00:00 2001
From: krassowski <5832902+krassowski@users.noreply.github.com>
Date: Thu, 15 Aug 2024 12:55:16 +0100
Subject: [PATCH] Remove examples which require polars
---
.../plot_time_series_lagged_features.ipynb | 381 ------------------
.../gaussian_process/plot_gpr_co2.ipynb | 324 ---------------
2 files changed, 705 deletions(-)
delete mode 100644 content/sklearn/applications/plot_time_series_lagged_features.ipynb
delete mode 100644 content/sklearn/gaussian_process/plot_gpr_co2.ipynb
diff --git a/content/sklearn/applications/plot_time_series_lagged_features.ipynb b/content/sklearn/applications/plot_time_series_lagged_features.ipynb
deleted file mode 100644
index 6c5d8a9..0000000
--- a/content/sklearn/applications/plot_time_series_lagged_features.ipynb
+++ /dev/null
@@ -1,381 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "
The first `import sklearn` can take roughly 10-20s.\n
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "%pip install polars\n%pip install pyodide-http\nimport pyodide_http\npyodide_http.patch_all()\nimport matplotlib\nimport pandas"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n# Lagged features for time series forecasting\n\nThis example demonstrates how Polars-engineered lagged features can be used\nfor time series forecasting with\n:class:`~sklearn.ensemble.HistGradientBoostingRegressor` on the Bike Sharing\nDemand dataset.\n\nSee the example on\n`sphx_glr_auto_examples_applications_plot_cyclical_feature_engineering.py`\nfor some data exploration on this dataset and a demo on periodic feature\nengineering.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Analyzing the Bike Sharing Demand dataset\n\nWe start by loading the data from the OpenML repository\nas a pandas dataframe. This will be replaced with Polars\nonce `fetch_openml` adds a native support for it.\nWe convert to Polars for feature engineering, as it automatically caches\ncommon subexpressions which are reused in multiple expressions\n(like `pl.col(\"count\").shift(1)` below). See\nhttps://docs.pola.rs/user-guide/lazy/optimizations/ for more information.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import numpy as np\nimport polars as pl\n\nfrom sklearn.datasets import fetch_openml\n\npl.Config.set_fmt_str_lengths(20)\n\nbike_sharing = fetch_openml(\n \"Bike_Sharing_Demand\", version=2, as_frame=True, parser=\"pandas\"\n)\ndf = bike_sharing.frame\ndf = pl.DataFrame({col: df[col].to_numpy() for col in df.columns})"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Next, we take a look at the statistical summary of the dataset\nso that we can better understand the data that we are working with.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import polars.selectors as cs\n\nsummary = df.select(cs.numeric()).describe()\nsummary"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let us look at the count of the seasons `\"fall\"`, `\"spring\"`, `\"summer\"`\nand `\"winter\"` present in the dataset to confirm they are balanced.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n\ndf[\"season\"].value_counts()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Generating Polars-engineered lagged features\nLet's consider the problem of predicting the demand at the\nnext hour given past demands. Since the demand is a continuous\nvariable, one could intuitively use any regression model. However, we do\nnot have the usual `(X_train, y_train)` dataset. Instead, we just have\nthe `y_train` demand data sequentially organized by time.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "lagged_df = df.select(\n \"count\",\n *[pl.col(\"count\").shift(i).alias(f\"lagged_count_{i}h\") for i in [1, 2, 3]],\n lagged_count_1d=pl.col(\"count\").shift(24),\n lagged_count_1d_1h=pl.col(\"count\").shift(24 + 1),\n lagged_count_7d=pl.col(\"count\").shift(7 * 24),\n lagged_count_7d_1h=pl.col(\"count\").shift(7 * 24 + 1),\n lagged_mean_24h=pl.col(\"count\").shift(1).rolling_mean(24),\n lagged_max_24h=pl.col(\"count\").shift(1).rolling_max(24),\n lagged_min_24h=pl.col(\"count\").shift(1).rolling_min(24),\n lagged_mean_7d=pl.col(\"count\").shift(1).rolling_mean(7 * 24),\n lagged_max_7d=pl.col(\"count\").shift(1).rolling_max(7 * 24),\n lagged_min_7d=pl.col(\"count\").shift(1).rolling_min(7 * 24),\n)\nlagged_df.tail(10)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Watch out however, the first lines have undefined values because their own\npast is unknown. This depends on how much lag we used:\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "lagged_df.head(10)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can now separate the lagged features in a matrix `X` and the target variable\n(the counts to predict) in an array of the same first dimension `y`.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "lagged_df = lagged_df.drop_nulls()\nX = lagged_df.drop(\"count\")\ny = lagged_df[\"count\"]\nprint(\"X shape: {}\\ny shape: {}\".format(X.shape, y.shape))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Naive evaluation of the next hour bike demand regression\nLet's randomly split our tabularized dataset to train a gradient\nboosting regression tree (GBRT) model and evaluate it using Mean\nAbsolute Percentage Error (MAPE). If our model is aimed at forecasting\n(i.e., predicting future data from past data), we should not use training\ndata that are ulterior to the testing data. In time series machine learning\nthe \"i.i.d\" (independent and identically distributed) assumption does not\nhold true as the data points are not independent and have a temporal\nrelationship.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.ensemble import HistGradientBoostingRegressor\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.2, random_state=42\n)\n\nmodel = HistGradientBoostingRegressor().fit(X_train, y_train)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Taking a look at the performance of the model.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.metrics import mean_absolute_percentage_error\n\ny_pred = model.predict(X_test)\nmean_absolute_percentage_error(y_test, y_pred)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Proper next hour forecasting evaluation\nLet's use a proper evaluation splitting strategies that takes into account\nthe temporal structure of the dataset to evaluate our model's ability to\npredict data points in the future (to avoid cheating by reading values from\nthe lagged features in the training set).\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.model_selection import TimeSeriesSplit\n\nts_cv = TimeSeriesSplit(\n n_splits=3, # to keep the notebook fast enough on common laptops\n gap=48, # 2 days data gap between train and test\n max_train_size=10000, # keep train sets of comparable sizes\n test_size=3000, # for 2 or 3 digits of precision in scores\n)\nall_splits = list(ts_cv.split(X, y))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Training the model and evaluating its performance based on MAPE.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "train_idx, test_idx = all_splits[0]\nX_train, X_test = X[train_idx, :], X[test_idx, :]\ny_train, y_test = y[train_idx], y[test_idx]\n\nmodel = HistGradientBoostingRegressor().fit(X_train, y_train)\ny_pred = model.predict(X_test)\nmean_absolute_percentage_error(y_test, y_pred)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The generalization error measured via a shuffled trained test split\nis too optimistic. The generalization via a time-based split is likely to\nbe more representative of the true performance of the regression model.\nLet's assess this variability of our error evaluation with proper\ncross-validation:\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.model_selection import cross_val_score\n\ncv_mape_scores = -cross_val_score(\n model, X, y, cv=ts_cv, scoring=\"neg_mean_absolute_percentage_error\"\n)\ncv_mape_scores"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The variability across splits is quite large! In a real life setting\nit would be advised to use more splits to better assess the variability.\nLet's report the mean CV scores and their standard deviation from now on.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "print(f\"CV MAPE: {cv_mape_scores.mean():.3f} ± {cv_mape_scores.std():.3f}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can compute several combinations of evaluation metrics and loss functions,\nwhich are reported a bit below.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from collections import defaultdict\n\nfrom sklearn.metrics import (\n make_scorer,\n mean_absolute_error,\n mean_pinball_loss,\n root_mean_squared_error,\n)\nfrom sklearn.model_selection import cross_validate\n\n\ndef consolidate_scores(cv_results, scores, metric):\n if metric == \"MAPE\":\n scores[metric].append(f\"{value.mean():.2f} ± {value.std():.2f}\")\n else:\n scores[metric].append(f\"{value.mean():.1f} ± {value.std():.1f}\")\n\n return scores\n\n\nscoring = {\n \"MAPE\": make_scorer(mean_absolute_percentage_error),\n \"RMSE\": make_scorer(root_mean_squared_error),\n \"MAE\": make_scorer(mean_absolute_error),\n \"pinball_loss_05\": make_scorer(mean_pinball_loss, alpha=0.05),\n \"pinball_loss_50\": make_scorer(mean_pinball_loss, alpha=0.50),\n \"pinball_loss_95\": make_scorer(mean_pinball_loss, alpha=0.95),\n}\nloss_functions = [\"squared_error\", \"poisson\", \"absolute_error\"]\nscores = defaultdict(list)\nfor loss_func in loss_functions:\n model = HistGradientBoostingRegressor(loss=loss_func)\n cv_results = cross_validate(\n model,\n X,\n y,\n cv=ts_cv,\n scoring=scoring,\n n_jobs=2,\n )\n time = cv_results[\"fit_time\"]\n scores[\"loss\"].append(loss_func)\n scores[\"fit_time\"].append(f\"{time.mean():.2f} ± {time.std():.2f} s\")\n\n for key, value in cv_results.items():\n if key.startswith(\"test_\"):\n metric = key.split(\"test_\")[1]\n scores = consolidate_scores(cv_results, scores, metric)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Modeling predictive uncertainty via quantile regression\nInstead of modeling the expected value of the distribution of\n$Y|X$ like the least squares and Poisson losses do, one could try to\nestimate quantiles of the conditional distribution.\n\n$Y|X=x_i$ is expected to be a random variable for a given data point\n$x_i$ because we expect that the number of rentals cannot be 100%\naccurately predicted from the features. It can be influenced by other\nvariables not properly captured by the existing lagged features. For\ninstance whether or not it will rain in the next hour cannot be fully\nanticipated from the past hours bike rental data. This is what we\ncall aleatoric uncertainty.\n\nQuantile regression makes it possible to give a finer description of that\ndistribution without making strong assumptions on its shape.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "quantile_list = [0.05, 0.5, 0.95]\n\nfor quantile in quantile_list:\n model = HistGradientBoostingRegressor(loss=\"quantile\", quantile=quantile)\n cv_results = cross_validate(\n model,\n X,\n y,\n cv=ts_cv,\n scoring=scoring,\n n_jobs=2,\n )\n time = cv_results[\"fit_time\"]\n scores[\"fit_time\"].append(f\"{time.mean():.2f} ± {time.std():.2f} s\")\n\n scores[\"loss\"].append(f\"quantile {int(quantile*100)}\")\n for key, value in cv_results.items():\n if key.startswith(\"test_\"):\n metric = key.split(\"test_\")[1]\n scores = consolidate_scores(cv_results, scores, metric)\n\nscores_df = pl.DataFrame(scores)\nscores_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let us take a look at the losses that minimise each metric.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "def min_arg(col):\n col_split = pl.col(col).str.split(\" \")\n return pl.arg_sort_by(\n col_split.list.get(0).cast(pl.Float64),\n col_split.list.get(2).cast(pl.Float64),\n ).first()\n\n\nscores_df.select(\n pl.col(\"loss\").get(min_arg(col_name)).alias(col_name)\n for col_name in scores_df.columns\n if col_name != \"loss\"\n)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Even if the score distributions overlap due to the variance in the dataset,\nit is true that the average RMSE is lower when `loss=\"squared_error\"`, whereas\nthe average MAPE is lower when `loss=\"absolute_error\"` as expected. That is\nalso the case for the Mean Pinball Loss with the quantiles 5 and 95. The score\ncorresponding to the 50 quantile loss is overlapping with the score obtained\nby minimizing other loss functions, which is also the case for the MAE.\n\n## A qualitative look at the predictions\nWe can now visualize the performance of the model with regards\nto the 5th percentile, median and the 95th percentile:\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "all_splits = list(ts_cv.split(X, y))\ntrain_idx, test_idx = all_splits[0]\n\nX_train, X_test = X[train_idx, :], X[test_idx, :]\ny_train, y_test = y[train_idx], y[test_idx]\n\nmax_iter = 50\ngbrt_mean_poisson = HistGradientBoostingRegressor(loss=\"poisson\", max_iter=max_iter)\ngbrt_mean_poisson.fit(X_train, y_train)\nmean_predictions = gbrt_mean_poisson.predict(X_test)\n\ngbrt_median = HistGradientBoostingRegressor(\n loss=\"quantile\", quantile=0.5, max_iter=max_iter\n)\ngbrt_median.fit(X_train, y_train)\nmedian_predictions = gbrt_median.predict(X_test)\n\ngbrt_percentile_5 = HistGradientBoostingRegressor(\n loss=\"quantile\", quantile=0.05, max_iter=max_iter\n)\ngbrt_percentile_5.fit(X_train, y_train)\npercentile_5_predictions = gbrt_percentile_5.predict(X_test)\n\ngbrt_percentile_95 = HistGradientBoostingRegressor(\n loss=\"quantile\", quantile=0.95, max_iter=max_iter\n)\ngbrt_percentile_95.fit(X_train, y_train)\npercentile_95_predictions = gbrt_percentile_95.predict(X_test)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can now take a look at the predictions made by the regression models:\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "last_hours = slice(-96, None)\nfig, ax = plt.subplots(figsize=(15, 7))\nplt.title(\"Predictions by regression models\")\nax.plot(\n y_test[last_hours],\n \"x-\",\n alpha=0.2,\n label=\"Actual demand\",\n color=\"black\",\n)\nax.plot(\n median_predictions[last_hours],\n \"^-\",\n label=\"GBRT median\",\n)\nax.plot(\n mean_predictions[last_hours],\n \"x-\",\n label=\"GBRT mean (Poisson)\",\n)\nax.fill_between(\n np.arange(96),\n percentile_5_predictions[last_hours],\n percentile_95_predictions[last_hours],\n alpha=0.3,\n label=\"GBRT 90% interval\",\n)\n_ = ax.legend()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Here it's interesting to notice that the blue area between the 5% and 95%\npercentile estimators has a width that varies with the time of the day:\n\n- At night, the blue band is much narrower: the pair of models is quite\n certain that there will be a small number of bike rentals. And furthermore\n these seem correct in the sense that the actual demand stays in that blue\n band.\n- During the day, the blue band is much wider: the uncertainty grows, probably\n because of the variability of the weather that can have a very large impact,\n especially on week-ends.\n- We can also see that during week-days, the commute pattern is still visible in\n the 5% and 95% estimations.\n- Finally, it is expected that 10% of the time, the actual demand does not lie\n between the 5% and 95% percentile estimates. On this test span, the actual\n demand seems to be higher, especially during the rush hours. It might reveal that\n our 95% percentile estimator underestimates the demand peaks. This could be be\n quantitatively confirmed by computing empirical coverage numbers as done in\n the `calibration of confidence intervals `.\n\nLooking at the performance of non-linear regression models vs\nthe best models:\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.metrics import PredictionErrorDisplay\n\nfig, axes = plt.subplots(ncols=3, figsize=(15, 6), sharey=True)\nfig.suptitle(\"Non-linear regression models\")\npredictions = [\n median_predictions,\n percentile_5_predictions,\n percentile_95_predictions,\n]\nlabels = [\n \"Median\",\n \"5th percentile\",\n \"95th percentile\",\n]\nfor ax, pred, label in zip(axes, predictions, labels):\n PredictionErrorDisplay.from_predictions(\n y_true=y_test,\n y_pred=pred,\n kind=\"residual_vs_predicted\",\n scatter_kwargs={\"alpha\": 0.3},\n ax=ax,\n )\n ax.set(xlabel=\"Predicted demand\", ylabel=\"True demand\")\n ax.legend([\"Best model\", label])\n\nplt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Conclusion\nThrough this example we explored time series forecasting using lagged\nfeatures. We compared a naive regression (using the standardized\n:class:`~sklearn.model_selection.train_test_split`) with a proper time\nseries evaluation strategy using\n:class:`~sklearn.model_selection.TimeSeriesSplit`. We observed that the\nmodel trained using :class:`~sklearn.model_selection.train_test_split`,\nhaving a default value of `shuffle` set to `True` produced an overly\noptimistic Mean Average Percentage Error (MAPE). The results\nproduced from the time-based split better represent the performance\nof our time-series regression model. We also analyzed the predictive uncertainty\nof our model via Quantile Regression. Predictions based on the 5th and\n95th percentile using `loss=\"quantile\"` provide us with a quantitative estimate\nof the uncertainty of the forecasts made by our time series regression model.\nUncertainty estimation can also be performed\nusing [MAPIE](https://mapie.readthedocs.io/en/latest/index.html),\nthat provides an implementation based on recent work on conformal prediction\nmethods and estimates both aleatoric and epistemic uncertainty at the same time.\nFurthermore, functionalities provided\nby [sktime](https://www.sktime.net/en/latest/users.html)\ncan be used to extend scikit-learn estimators by making use of recursive time\nseries forecasting, that enables dynamic predictions of future values.\n\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.19"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}
\ No newline at end of file
diff --git a/content/sklearn/gaussian_process/plot_gpr_co2.ipynb b/content/sklearn/gaussian_process/plot_gpr_co2.ipynb
deleted file mode 100644
index 2cc0e5a..0000000
--- a/content/sklearn/gaussian_process/plot_gpr_co2.ipynb
+++ /dev/null
@@ -1,324 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The first `import sklearn` can take roughly 10-20s.\n
"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "%pip install polars\n%pip install pyodide-http\nimport pyodide_http\npyodide_http.patch_all()\nimport matplotlib\nimport pandas"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n# Forecasting of CO2 level on Mona Loa dataset using Gaussian process regression (GPR)\n\nThis example is based on Section 5.4.3 of \"Gaussian Processes for Machine\nLearning\" [1]_. It illustrates an example of complex kernel engineering\nand hyperparameter optimization using gradient ascent on the\nlog-marginal-likelihood. The data consists of the monthly average atmospheric\nCO2 concentrations (in parts per million by volume (ppm)) collected at the\nMauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to\nmodel the CO2 concentration as a function of the time $t$ and extrapolate\nfor years after 2001.\n\n.. rubric:: References\n\n.. [1] [Rasmussen, Carl Edward. \"Gaussian processes in machine learning.\"\n Summer school on machine learning. Springer, Berlin, Heidelberg, 2003](http://www.gaussianprocess.org/gpml/chapters/RW.pdf).\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "print(__doc__)\n\n# Authors: Jan Hendrik Metzen \n# Guillaume Lemaitre \n# License: BSD 3 clause"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Build the dataset\n\nWe will derive a dataset from the Mauna Loa Observatory that collected air\nsamples. We are interested in estimating the concentration of CO2 and\nextrapolate it for further year. First, we load the original dataset available\nin OpenML as a pandas dataframe. This will be replaced with Polars\nonce `fetch_openml` adds a native support for it.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.datasets import fetch_openml\n\nco2 = fetch_openml(data_id=41187, as_frame=True)\nco2.frame.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "First, we process the original dataframe to create a date column and select\nit along with the CO2 column.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import polars as pl\n\nco2_data = pl.DataFrame(co2.frame[[\"year\", \"month\", \"day\", \"co2\"]]).select(\n pl.date(\"year\", \"month\", \"day\"), \"co2\"\n)\nco2_data.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "co2_data[\"date\"].min(), co2_data[\"date\"].max()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We see that we get CO2 concentration for some days from March, 1958 to\nDecember, 2001. We can plot these raw information to have a better\nunderstanding.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import matplotlib.pyplot as plt\n\nplt.plot(co2_data[\"date\"], co2_data[\"co2\"])\nplt.xlabel(\"date\")\nplt.ylabel(\"CO$_2$ concentration (ppm)\")\n_ = plt.title(\"Raw air samples measurements from the Mauna Loa Observatory\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We will preprocess the dataset by taking a monthly average and drop month\nfor which no measurements were collected. Such a processing will have an\nsmoothing effect on the data.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "co2_data = (\n co2_data.sort(by=\"date\")\n .group_by_dynamic(\"date\", every=\"1mo\")\n .agg(pl.col(\"co2\").mean())\n .drop_nulls()\n)\nplt.plot(co2_data[\"date\"], co2_data[\"co2\"])\nplt.xlabel(\"date\")\nplt.ylabel(\"Monthly average of CO$_2$ concentration (ppm)\")\n_ = plt.title(\n \"Monthly average of air samples measurements\\nfrom the Mauna Loa Observatory\"\n)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The idea in this example will be to predict the CO2 concentration in function\nof the date. We are as well interested in extrapolating for upcoming year\nafter 2001.\n\nAs a first step, we will divide the data and the target to estimate. The data\nbeing a date, we will convert it into a numeric.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "X = co2_data.select(\n pl.col(\"date\").dt.year() + pl.col(\"date\").dt.month() / 12\n).to_numpy()\ny = co2_data[\"co2\"].to_numpy()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Design the proper kernel\n\nTo design the kernel to use with our Gaussian process, we can make some\nassumption regarding the data at hand. We observe that they have several\ncharacteristics: we see a long term rising trend, a pronounced seasonal\nvariation and some smaller irregularities. We can use different appropriate\nkernel that would capture these features.\n\nFirst, the long term rising trend could be fitted using a radial basis\nfunction (RBF) kernel with a large length-scale parameter. The RBF kernel\nwith a large length-scale enforces this component to be smooth. An trending\nincrease is not enforced as to give a degree of freedom to our model. The\nspecific length-scale and the amplitude are free hyperparameters.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.gaussian_process.kernels import RBF\n\nlong_term_trend_kernel = 50.0**2 * RBF(length_scale=50.0)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The seasonal variation is explained by the periodic exponential sine squared\nkernel with a fixed periodicity of 1 year. The length-scale of this periodic\ncomponent, controlling its smoothness, is a free parameter. In order to allow\ndecaying away from exact periodicity, the product with an RBF kernel is\ntaken. The length-scale of this RBF component controls the decay time and is\na further free parameter. This type of kernel is also known as locally\nperiodic kernel.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.gaussian_process.kernels import ExpSineSquared\n\nseasonal_kernel = (\n 2.0**2\n * RBF(length_scale=100.0)\n * ExpSineSquared(length_scale=1.0, periodicity=1.0, periodicity_bounds=\"fixed\")\n)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The small irregularities are to be explained by a rational quadratic kernel\ncomponent, whose length-scale and alpha parameter, which quantifies the\ndiffuseness of the length-scales, are to be determined. A rational quadratic\nkernel is equivalent to an RBF kernel with several length-scale and will\nbetter accommodate the different irregularities.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.gaussian_process.kernels import RationalQuadratic\n\nirregularities_kernel = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.0)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Finally, the noise in the dataset can be accounted with a kernel consisting\nof an RBF kernel contribution, which shall explain the correlated noise\ncomponents such as local weather phenomena, and a white kernel contribution\nfor the white noise. The relative amplitudes and the RBF's length scale are\nfurther free parameters.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.gaussian_process.kernels import WhiteKernel\n\nnoise_kernel = 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(\n noise_level=0.1**2, noise_level_bounds=(1e-5, 1e5)\n)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Thus, our final kernel is an addition of all previous kernel.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "co2_kernel = (\n long_term_trend_kernel + seasonal_kernel + irregularities_kernel + noise_kernel\n)\nco2_kernel"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Model fitting and extrapolation\n\nNow, we are ready to use a Gaussian process regressor and fit the available\ndata. To follow the example from the literature, we will subtract the mean\nfrom the target. We could have used `normalize_y=True`. However, doing so\nwould have also scaled the target (dividing `y` by its standard deviation).\nThus, the hyperparameters of the different kernel would have had different\nmeaning since they would not have been expressed in ppm.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "from sklearn.gaussian_process import GaussianProcessRegressor\n\ny_mean = y.mean()\ngaussian_process = GaussianProcessRegressor(kernel=co2_kernel, normalize_y=False)\ngaussian_process.fit(X, y - y_mean)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Now, we will use the Gaussian process to predict on:\n\n- training data to inspect the goodness of fit;\n- future data to see the extrapolation done by the model.\n\nThus, we create synthetic data from 1958 to the current month. In addition,\nwe need to add the subtracted mean computed during training.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "import datetime\n\nimport numpy as np\n\ntoday = datetime.datetime.now()\ncurrent_month = today.year + today.month / 12\nX_test = np.linspace(start=1958, stop=current_month, num=1_000).reshape(-1, 1)\nmean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std=True)\nmean_y_pred += y_mean"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "plt.plot(X, y, color=\"black\", linestyle=\"dashed\", label=\"Measurements\")\nplt.plot(X_test, mean_y_pred, color=\"tab:blue\", alpha=0.4, label=\"Gaussian process\")\nplt.fill_between(\n X_test.ravel(),\n mean_y_pred - std_y_pred,\n mean_y_pred + std_y_pred,\n color=\"tab:blue\",\n alpha=0.2,\n)\nplt.legend()\nplt.xlabel(\"Year\")\nplt.ylabel(\"Monthly average of CO$_2$ concentration (ppm)\")\n_ = plt.title(\n \"Monthly average of air samples measurements\\nfrom the Mauna Loa Observatory\"\n)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Our fitted model is capable to fit previous data properly and extrapolate to\nfuture year with confidence.\n\n## Interpretation of kernel hyperparameters\n\nNow, we can have a look at the hyperparameters of the kernel.\n\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": false
- },
- "outputs": [],
- "source": [
- "gaussian_process.kernel_"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Thus, most of the target signal, with the mean subtracted, is explained by a\nlong-term rising trend for ~45 ppm and a length-scale of ~52 years. The\nperiodic component has an amplitude of ~2.6ppm, a decay time of ~90 years and\na length-scale of ~1.5. The long decay time indicates that we have a\ncomponent very close to a seasonal periodicity. The correlated noise has an\namplitude of ~0.2 ppm with a length scale of ~0.12 years and a white-noise\ncontribution of ~0.04 ppm. Thus, the overall noise level is very small,\nindicating that the data can be very well explained by the model.\n\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.19"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}
\ No newline at end of file