Skip to content

Commit

Permalink
Merge pull request #12 from OpenSTEF/Update_example_notebooks
Browse files Browse the repository at this point in the history
Update example notebooks
  • Loading branch information
JonitaRuiter authored Feb 6, 2024
2 parents f84bfd0 + a8851e4 commit cbf3648
Show file tree
Hide file tree
Showing 20 changed files with 326 additions and 438 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,4 +129,6 @@ dmypy.json
.pyre/

# Trained models
*trained_models*
mlflow_artifacts/
mlflow_trained_models/
trained_models/
4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[submodule "openstef"]
path = openstef
url = https://github.com/OpenSTEF/openstef.git
branch = Update-pandas
80 changes: 43 additions & 37 deletions examples/01. Train a model using high-level pipelines.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,10 @@
}
},
"source": [
"# Example to train a model\n",
"Using the openstf tasks"
"# Train a model\n",
"In this example notebook, a model is trained for a location with id '287'. The data for this location can be found in the 'data' folder. \n",
"First, the prediction job will be defined, which contains the properties of the training and prediction. For example the time horizon, machine learning model and location of the forecast are defined in the prediction job. \n",
"Thereafter, the model can be trained using the input data and prediction job by the ```train_model_pipeline()```. "
]
},
{
Expand All @@ -27,12 +29,21 @@
"outputs": [],
"source": [
"import pandas as pd\n",
"import IPython\n",
"from IPython.display import IFrame\n",
"from openstef.pipeline.train_model import train_model_pipeline\n",
"from openstef.pipeline.create_forecast import create_forecast_pipeline\n",
"from openstef.data_classes.prediction_job import PredictionJobDataClass"
]
},
{
"cell_type": "markdown",
"id": "3878fe89",
"metadata": {},
"source": [
"## Prepare for training\n",
"Before a model can be trained, the specifications and data need to be defined. The specification of the model are defined in the prediction job (pj), where for example the machine learning model, latitude, longtide and forecast horizon are specified. Furthermore, the data has to be retrieved from the csv file containing both load, weather and energy market data. "
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -45,28 +56,28 @@
},
"outputs": [],
"source": [
"# define properties of training/prediction. We call this a 'prediction_job'\n",
"# Define properties of training/prediction. We call this a 'prediction_job'\n",
"pj = dict(id=287,\n",
" model='xgb',\n",
" model='xgb', \n",
" quantiles=[10,30,50,70,90],\n",
" forecast_type=\"demand\",\n",
" lat=52.0,\n",
" lon=5.0,\n",
" horizon_minutes=47*60,\n",
" resolution_minutes=15,\n",
" name=\"Example\", \n",
" hyper_params={}, # Note, this should become optional\n",
" feature_names=None, # Note, this should become optional\n",
" hyper_params={}, \n",
" feature_names=None, \n",
" default_modelspecs=None,\n",
" save_train_forecasts=True,\n",
" )\n",
"pj=PredictionJobDataClass(**pj)\n",
"\n",
"# Load input data\n",
"input_data = pd.read_csv('data/get_model_input_pid_287.csv', index_col='index', parse_dates=True)\n",
"\n",
"# Split in training and forecasting data\n",
"train_data = input_data.iloc[:-200,:] # everything except last 200 rows (~ 48 hours)\n",
"to_forecast_data = input_data.iloc[:-200,:] # last 200 rows\n"
"# Split in training and forecasting data. Everything except the last 20 rows will be used for training\n",
"train_data = input_data.iloc[:-200,:] # everything except last 200 rows (~ 48 hours)"
]
},
{
Expand All @@ -81,31 +92,18 @@
},
"outputs": [],
"source": [
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09f197a0",
"metadata": {
"ExecuteTime": {
"end_time": "2022-02-09T16:32:18.774539Z",
"start_time": "2022-02-09T16:32:18.740534Z"
}
},
"outputs": [],
"source": [
"to_forecast_data.head()"
"# Print the train data. \n",
"# For every timestamp, bot the load as well as feature data is available. \n",
"display(train_data.head())"
]
},
{
"cell_type": "markdown",
"id": "e0d17724",
"metadata": {},
"source": [
"# Train a model\n",
"Train a model using the high-level pipeline. Store the model and reports on training proces in ./trained_models"
"## Train a model\n",
"Train the model by using the high-level pipeline ```train_model_pipeline```. Store the model and reports on training proces in ./mlflow_artifacts and ./mlflow_trained_models by setting artifact_folder and mlflow_tracking_uri to this respective path. "
]
},
{
Expand All @@ -121,12 +119,12 @@
},
"outputs": [],
"source": [
"train_model_pipeline(\n",
"train, val, test=train_model_pipeline(\n",
" pj,\n",
" train_data,\n",
" check_old_model_age=False,\n",
" mlflow_tracking_uri=\"./trained_models\",\n",
" artifact_folder=\"./trained_models\",\n",
" mlflow_tracking_uri=\"./mlflow_trained_models\",\n",
" artifact_folder=\"./mlflow_artifacts\",\n",
" )"
]
},
Expand All @@ -135,7 +133,7 @@
"id": "7209dca5",
"metadata": {},
"source": [
"You can find the trained model in ./trained_models, along with reports on the training process"
"Now, you can find the trained model in ./mlflow_trained_models, along with reports on the training process. Below the Predictor0.25 and Predictor47.0 plots are shown, as well as the weight plot. The predictor plots show the prediction of the train, test and validation data. The weight plot shows the importance and weight of every feature."
]
},
{
Expand All @@ -150,10 +148,18 @@
},
"outputs": [],
"source": [
"## Inspect local files\n",
"IPython.display.HTML(f\"<iframe src=./trained_models/{pj['id']}/Predictor0.25.html width=800 height=400></iframe>\"\n",
" f\"<iframe src=./trained_models/{pj['id']}/Predictor47.0.html width=800 height=400></iframe>\"\n",
" f\"<iframe src=./trained_models/{pj['id']}/weight_plot.html width=800 height=400></iframe>\")"
"# Inspect local files\n",
"display(IFrame('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']), width=900, height=400))\n",
"display(IFrame('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']), width=800, height=400))\n",
"display(IFrame('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']), width=800, height=400))\n",
"\n",
"\n",
"## Visual Studio Code has difficulties with displaying htmls. If you are working with VSC and are not able to inspect the plots, uncomment the code below\n",
"## to open the plots in your browser.\n",
"# import webbrowser\n",
"# webbrowser.open(r'.\\mlflow_artifacts\\{}\\Predictor0.25.html'.format(pj['id']))\n",
"# webbrowser.open(r'.\\mlflow_artifacts\\{}\\Predictor47.0.html'.format(pj['id']))\n",
"# webbrowser.open(r'.\\mlflow_artifacts\\{}\\weight_plot.html'.format(pj['id']))"
]
}
],
Expand All @@ -173,7 +179,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
"version": "3.10.13"
},
"toc": {
"base_numbering": 1,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
}
},
"source": [
"# Evaluate Performance of model using Backtest Pipeline"
"# Evaluate the performance of model using Backtest Pipeline\n",
"In this second example notebook, the performance of the model is analysed using ``train_model_and_forecast_back_test``. First, the prediction job is defined, where the properties of the training and forecasting are specified. Thereafter, the backtest is performed using the prediction job and input data (can be found in the 'data' folder). Thereafter, the results are analysed. "
]
},
{
Expand All @@ -37,6 +38,15 @@
"pio.renderers.default = \"plotly_mimetype+notebook\""
]
},
{
"cell_type": "markdown",
"id": "b443e6ea",
"metadata": {},
"source": [
"## Prepare for training & backtest\n",
"Before a model can be trained, the specifications and data need to be defined. The specification of the model are defined in the prediction job (pj), where for example the machine learning model, latitude, longtide and forecast horizon are specified. Furthermore, the data has to be retrieved from the csv file containing both load, weather and energy market data. "
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -49,20 +59,20 @@
},
"outputs": [],
"source": [
"# define properties of training/prediction. We call this a 'prediction_job' \n",
"# Define properties of training/prediction. We call this a 'prediction_job' \n",
"pj=PredictionJobDataClass(id=287,\n",
" model='xgb',\n",
" quantiles=[0.10,0.30,0.50,0.70,0.90],\n",
" horizon_minutes=48*60,\n",
" resolution_minutes=15,\n",
" lat = 1, #should become optional\n",
" lon = 1, #should become optional\n",
" lat = 1, \n",
" lon = 1, \n",
" train_components=False,\n",
" name='TestPrediction',\n",
" model_type_group=None, # Note, this should become optional\n",
" hyper_params={}, # Note, this should become optional\n",
" feature_names=None, # Note, this should become optional\n",
" forecast_type=\"demand\", # Note, this should become optional\n",
" model_type_group=None, \n",
" hyper_params={}, \n",
" feature_names=None, \n",
" forecast_type=\"demand\", \n",
" )\n",
"\n",
"modelspecs = ModelSpecificationDataClass(id=pj['id'])\n",
Expand All @@ -71,6 +81,17 @@
"input_data = pd.read_csv('data/get_model_input_pid_287.csv', index_col='index', parse_dates=True)\n"
]
},
{
"cell_type": "markdown",
"id": "b1483ccb",
"metadata": {},
"source": [
"## Perform backtest\n",
"Below, the backtest is performed using ``train_model_and_forecast_back_test``, which not only outputs the forecast but also the model, train data, validation data and test data. The availability of both the forecast and realised values, enables you to evaluate the results of the model.\n",
"\n",
"One of the variables in the ``train_model_and_forecast_back_test`` are the ``training_horizons``. This entails how far into the future, the model has to predict. Thus, a value of 0.25 means predicting 15 minutes into the future, where as 47.0 entails predicting 47 hours ahead."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -85,13 +106,15 @@
"source": [
"# Perform the backtest\n",
"n_folds = 2\n",
"\n",
"forecast, model, train_data, validation_data, test_data = train_model_and_forecast_back_test(\n",
" pj,\n",
" modelspecs = modelspecs,\n",
" input_data = input_data,\n",
" training_horizons=[0.25, 47.0],\n",
" n_folds=n_folds,\n",
" )\n",
"\n",
"# If n_folds>1, model is a list of models. In that case, only use the first model\n",
"if n_folds>1:\n",
" model=model[0]"
Expand All @@ -107,7 +130,8 @@
}
},
"source": [
"# Evaluate results"
"## Evaluate results\n",
"Below, the results of the backtest will be evaluated by means of visualisation (plots) and metrics. "
]
},
{
Expand All @@ -123,7 +147,7 @@
"outputs": [],
"source": [
"for horizon in set(forecast.horizon):\n",
" fig = forecast.loc[forecast.horizon==0.25,['quantile_P10','quantile_P30',\n",
" fig = forecast.loc[forecast.horizon==horizon,['quantile_P10','quantile_P30',\n",
" 'quantile_P50','quantile_P70','quantile_P90','realised','forecast']].plot(\n",
" title=f\"Horizon: {horizon}\")\n",
" fig.update_traces(\n",
Expand All @@ -141,6 +165,14 @@
" fig.show()"
]
},
{
"cell_type": "markdown",
"id": "9e418f36",
"metadata": {},
"source": [
"Evaluate the error of the forecast by subtracting the realised values by the forecasted values. The visualisation can help to analyse the errors."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -155,6 +187,24 @@
"outputs": [],
"source": [
"forecast['err']=forecast['realised']-forecast['forecast']\n",
"forecast['err'].plot()"
]
},
{
"cell_type": "markdown",
"id": "e765f5e1",
"metadata": {},
"source": [
"The mean absolute error (mea) gives insight into the scale of the errors. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e73f034",
"metadata": {},
"outputs": [],
"source": [
"mea = forecast.pivot_table(index='horizon', values=['err'], aggfunc=lambda x: x.abs().mean())\n",
"mea.index=mea.index.astype(str)\n",
"fig = mea.plot(kind='bar')\n",
Expand All @@ -163,6 +213,14 @@
" yaxis=dict(title='MAE [MW]')))"
]
},
{
"cell_type": "markdown",
"id": "e28744af",
"metadata": {},
"source": [
"Lastly, it is of interest too look into the importance of the features the model has used to make the forecast. The larger the block in this plot, the higher the importance of the feature for the forecast."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -177,14 +235,6 @@
"source": [
"plot_feature_importance(model.feature_importance_dataframe)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d51f7f1",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -203,7 +253,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.10.13"
},
"toc": {
"base_numbering": 1,
Expand Down
Loading

0 comments on commit cbf3648

Please sign in to comment.