Merge pull request #12 from OpenSTEF/Update_example_notebooks

Update example notebooks
OpenSTEF · Feb 6, 2024 · cbf3648 · cbf3648
2 parents f84bfd0 + a8851e4
commit cbf3648
Show file tree

Hide file tree

Showing 20 changed files with 326 additions and 438 deletions.
diff --git a/.gitignore b/.gitignore
@@ -129,4 +129,6 @@ dmypy.json
 .pyre/
 
 # Trained models
-*trained_models*
+mlflow_artifacts/
+mlflow_trained_models/
+trained_models/
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,4 @@
+[submodule "openstef"]
+	path = openstef
+	url = https://github.com/OpenSTEF/openstef.git
+	branch = Update-pandas
diff --git a/examples/01. Train a model using high-level pipelines.ipynb b/examples/01. Train a model using high-level pipelines.ipynb
@@ -10,8 +10,10 @@
     }
    },
    "source": [
-    "# Example to train a model\n",
-    "Using the openstf tasks"
+    "# Train a model\n",
+    "In this example notebook, a model is trained for a location with id '287'. The data for this location can be found in the 'data' folder. \n",
+    "First, the prediction job will be defined, which contains the properties of the training and prediction. For example the time horizon, machine learning model and location of the forecast are defined in the prediction job. \n",
+    "Thereafter, the model can be trained using the input data and prediction job by the ```train_model_pipeline()```. "
    ]
   },
   {
@@ -27,12 +29,21 @@
    "outputs": [],
    "source": [
     "import pandas as pd\n",
-    "import IPython\n",
+    "from IPython.display import IFrame\n",
     "from openstef.pipeline.train_model import train_model_pipeline\n",
     "from openstef.pipeline.create_forecast import create_forecast_pipeline\n",
     "from openstef.data_classes.prediction_job import PredictionJobDataClass"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "3878fe89",
+   "metadata": {},
+   "source": [
+    "## Prepare for training\n",
+    "Before a model can be trained, the specifications and data need to be defined. The specification of the model are defined in the prediction job (pj), where for example the machine learning model, latitude, longtide and forecast horizon are specified. Furthermore, the data has to be retrieved from the csv file containing both load, weather and energy market data. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -45,28 +56,28 @@
    },
    "outputs": [],
    "source": [
-    "# define properties of training/prediction. We call this a 'prediction_job'\n",
+    "# Define properties of training/prediction. We call this a 'prediction_job'\n",
     "pj = dict(id=287,\n",
-    "        model='xgb',\n",
+    "        model='xgb', \n",
     "        quantiles=[10,30,50,70,90],\n",
     "        forecast_type=\"demand\",\n",
     "        lat=52.0,\n",
     "        lon=5.0,\n",
     "        horizon_minutes=47*60,\n",
     "        resolution_minutes=15,\n",
     "        name=\"Example\",          \n",
-    "        hyper_params={}, # Note, this should become optional\n",
-    "        feature_names=None, # Note, this should become optional\n",
+    "        hyper_params={}, \n",
+    "        feature_names=None, \n",
     "        default_modelspecs=None,\n",
+    "        save_train_forecasts=True,\n",
     "       )\n",
     "pj=PredictionJobDataClass(**pj)\n",
     "\n",
     "# Load input data\n",
     "input_data = pd.read_csv('data/get_model_input_pid_287.csv', index_col='index', parse_dates=True)\n",
     "\n",
-    "# Split in training and forecasting data\n",
-    "train_data = input_data.iloc[:-200,:] # everything except last 200 rows (~ 48 hours)\n",
-    "to_forecast_data = input_data.iloc[:-200,:] # last 200 rows\n"
+    "# Split in training and forecasting data. Everything except the last 20 rows will be used for training\n",
+    "train_data = input_data.iloc[:-200,:] # everything except last 200 rows (~ 48 hours)"
    ]
   },
   {
@@ -81,31 +92,18 @@
    },
    "outputs": [],
    "source": [
-    "train_data.head()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "09f197a0",
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2022-02-09T16:32:18.774539Z",
-     "start_time": "2022-02-09T16:32:18.740534Z"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "to_forecast_data.head()"
+    "# Print the train data. \n",
+    "# For every timestamp, bot the load as well as feature data is available. \n",
+    "display(train_data.head())"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "e0d17724",
    "metadata": {},
    "source": [
-    "# Train a model\n",
-    "Train a model using the high-level pipeline. Store the model and reports on training proces in ./trained_models"
+    "## Train a model\n",
+    "Train the model by using the high-level pipeline ```train_model_pipeline```. Store the model and reports on training proces in ./mlflow_artifacts and ./mlflow_trained_models by setting artifact_folder and mlflow_tracking_uri to this respective path. "
    ]
   },
   {
@@ -121,12 +119,12 @@
    },
    "outputs": [],
    "source": [
-    "train_model_pipeline(\n",
+    "train, val, test=train_model_pipeline(\n",
     "    pj,\n",
     "    train_data,\n",
     "    check_old_model_age=False,\n",
-    "    mlflow_tracking_uri=\"./trained_models\",\n",
-    "    artifact_folder=\"./trained_models\",\n",
+    "    mlflow_tracking_uri=\"./mlflow_trained_models\",\n",
+    "    artifact_folder=\"./mlflow_artifacts\",\n",
     "    )"
    ]
   },
@@ -135,7 +133,7 @@
    "id": "7209dca5",
    "metadata": {},
    "source": [
-    "You can find the trained model in ./trained_models, along with reports on the training process"
+    "Now, you can find the trained model in ./mlflow_trained_models, along with reports on the training process. Below the Predictor0.25 and Predictor47.0 plots are shown, as well as the weight plot. The predictor plots show the prediction of the train, test and validation data. The weight plot shows the importance and weight of every feature."
    ]
   },
   {
@@ -150,10 +148,18 @@
    },
    "outputs": [],
    "source": [
-    "## Inspect local files\n",
-    "IPython.display.HTML(f\"<iframe src=./trained_models/{pj['id']}/Predictor0.25.html width=800 height=400></iframe>\"\n",
-    "                     f\"<iframe src=./trained_models/{pj['id']}/Predictor47.0.html width=800 height=400></iframe>\"\n",
-    "                     f\"<iframe src=./trained_models/{pj['id']}/weight_plot.html width=800 height=400></iframe>\")"
+    "# Inspect local files\n",
+    "display(IFrame('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']), width=900, height=400))\n",
+    "display(IFrame('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']), width=800, height=400))\n",
+    "display(IFrame('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']), width=800, height=400))\n",
+    "\n",
+    "\n",
+    "## Visual Studio Code has difficulties with displaying htmls. If you are working with VSC and are not able to inspect the plots, uncomment the code below\n",
+    "## to open the plots in your browser.\n",
+    "# import webbrowser\n",
+    "# webbrowser.open(r'.\\mlflow_artifacts\\{}\\Predictor0.25.html'.format(pj['id']))\n",
+    "# webbrowser.open(r'.\\mlflow_artifacts\\{}\\Predictor47.0.html'.format(pj['id']))\n",
+    "# webbrowser.open(r'.\\mlflow_artifacts\\{}\\weight_plot.html'.format(pj['id']))"
    ]
   }
  ],
@@ -173,7 +179,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.10.13"
   },
   "toc": {
    "base_numbering": 1,

diff --git a/...performance using Backtest Pipeline.ipynb → ... Train a model and perform backtest.ipynb b/...performance using Backtest Pipeline.ipynb → ... Train a model and perform backtest.ipynb
@@ -10,7 +10,8 @@
     }
    },
    "source": [
-    "# Evaluate Performance of model using Backtest Pipeline"
+    "# Evaluate the performance of model using Backtest Pipeline\n",
+    "In this second example notebook, the performance of the model is analysed using ``train_model_and_forecast_back_test``. First, the prediction job is defined, where the properties of the training and forecasting are specified. Thereafter, the backtest is performed using the prediction job and input data (can be found in the 'data' folder). Thereafter, the results are analysed. "
    ]
   },
   {
@@ -37,6 +38,15 @@
     "pio.renderers.default = \"plotly_mimetype+notebook\""
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b443e6ea",
+   "metadata": {},
+   "source": [
+    "## Prepare for training & backtest\n",
+    "Before a model can be trained, the specifications and data need to be defined. The specification of the model are defined in the prediction job (pj), where for example the machine learning model, latitude, longtide and forecast horizon are specified. Furthermore, the data has to be retrieved from the csv file containing both load, weather and energy market data. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -49,20 +59,20 @@
    },
    "outputs": [],
    "source": [
-    "# define properties of training/prediction. We call this a 'prediction_job' \n",
+    "# Define properties of training/prediction. We call this a 'prediction_job' \n",
     "pj=PredictionJobDataClass(id=287,\n",
     "        model='xgb',\n",
     "        quantiles=[0.10,0.30,0.50,0.70,0.90],\n",
     "        horizon_minutes=48*60,\n",
     "        resolution_minutes=15,\n",
-    "        lat = 1, #should become optional\n",
-    "        lon = 1, #should become optional\n",
+    "        lat = 1, \n",
+    "        lon = 1, \n",
     "        train_components=False,\n",
     "        name='TestPrediction',\n",
-    "        model_type_group=None, # Note, this should become optional\n",
-    "        hyper_params={}, # Note, this should become optional\n",
-    "        feature_names=None, # Note, this should become optional\n",
-    "        forecast_type=\"demand\", # Note, this should become optional\n",
+    "        model_type_group=None, \n",
+    "        hyper_params={}, \n",
+    "        feature_names=None, \n",
+    "        forecast_type=\"demand\", \n",
     "                  )\n",
     "\n",
     "modelspecs = ModelSpecificationDataClass(id=pj['id'])\n",
@@ -71,6 +81,17 @@
     "input_data = pd.read_csv('data/get_model_input_pid_287.csv', index_col='index', parse_dates=True)\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b1483ccb",
+   "metadata": {},
+   "source": [
+    "## Perform backtest\n",
+    "Below, the backtest is performed using ``train_model_and_forecast_back_test``, which not only outputs the forecast but also the model, train data, validation data and test data. The availability of both the forecast and realised values, enables you to evaluate the results of the model.\n",
+    "\n",
+    "One of the variables in the ``train_model_and_forecast_back_test`` are the ``training_horizons``. This entails how far into the future, the model has to predict. Thus, a value of 0.25 means predicting 15 minutes into the future, where as 47.0 entails predicting 47 hours ahead."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -85,13 +106,15 @@
    "source": [
     "# Perform the backtest\n",
     "n_folds = 2\n",
+    "\n",
     "forecast, model, train_data, validation_data, test_data = train_model_and_forecast_back_test(\n",
     "    pj,\n",
     "    modelspecs = modelspecs,\n",
     "    input_data = input_data,\n",
     "    training_horizons=[0.25, 47.0],\n",
     "    n_folds=n_folds,\n",
     " )\n",
+    "\n",
     "# If n_folds>1, model is a list of models. In that case, only use the first model\n",
     "if n_folds>1:\n",
     "    model=model[0]"
@@ -107,7 +130,8 @@
     }
    },
    "source": [
-    "# Evaluate results"
+    "## Evaluate results\n",
+    "Below, the results of the backtest will be evaluated by means of visualisation (plots) and metrics. "
    ]
   },
   {
@@ -123,7 +147,7 @@
    "outputs": [],
    "source": [
     "for horizon in set(forecast.horizon):\n",
-    "    fig = forecast.loc[forecast.horizon==0.25,['quantile_P10','quantile_P30',\n",
+    "    fig = forecast.loc[forecast.horizon==horizon,['quantile_P10','quantile_P30',\n",
     "                    'quantile_P50','quantile_P70','quantile_P90','realised','forecast']].plot(\n",
     "                                                                                   title=f\"Horizon: {horizon}\")\n",
     "    fig.update_traces(\n",
@@ -141,6 +165,14 @@
     "    fig.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "9e418f36",
+   "metadata": {},
+   "source": [
+    "Evaluate the error of the forecast by subtracting the realised values by the forecasted values. The visualisation can help to analyse the errors."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -155,6 +187,24 @@
    "outputs": [],
    "source": [
     "forecast['err']=forecast['realised']-forecast['forecast']\n",
+    "forecast['err'].plot()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e765f5e1",
+   "metadata": {},
+   "source": [
+    "The mean absolute error (mea) gives insight into the scale of the errors. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8e73f034",
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "mea = forecast.pivot_table(index='horizon', values=['err'], aggfunc=lambda x: x.abs().mean())\n",
     "mea.index=mea.index.astype(str)\n",
     "fig = mea.plot(kind='bar')\n",
@@ -163,6 +213,14 @@
     "                      yaxis=dict(title='MAE [MW]')))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "e28744af",
+   "metadata": {},
+   "source": [
+    "Lastly, it is of interest too look into the importance of the features the model has used to make the forecast. The larger the block in this plot, the higher the importance of the feature for the forecast."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -177,14 +235,6 @@
    "source": [
     "plot_feature_importance(model.feature_importance_dataframe)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3d51f7f1",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
@@ -203,7 +253,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.16"
+   "version": "3.10.13"
   },
   "toc": {
    "base_numbering": 1,