updates on the example notebooks

Signed-off-by: Jonita Ruiter <[email protected]>
OpenSTEF · Feb 5, 2024 · 3841e81 · 3841e81
1 parent 9389e66
commit 3841e81
Show file tree

Hide file tree

Showing 11 changed files with 336 additions and 620,342 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,4 @@
+[submodule "openstef"]
+	path = openstef
+	url = https://github.com/OpenSTEF/openstef.git
+	branch = Update-pandas
diff --git a/examples/01. Train a model using high-level pipelines.ipynb b/examples/01. Train a model using high-level pipelines.ipynb
@@ -94,7 +94,7 @@
    "source": [
     "# Print the train data. \n",
     "# For every timestamp, bot the load as well as feature data is available. \n",
-    "train_data.head()"
+    "display(train_data.head())"
    ]
   },
   {
@@ -149,9 +149,9 @@
    "outputs": [],
    "source": [
     "# Inspect local files\n",
-    "IFrame('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']), width=900, height=400)\n",
-    "IFrame('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']), width=800, height=400)\n",
-    "IFrame('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']), width=800, height=400)\n",
+    "display(IFrame('./mlflow_artifacts/{}/Predictor0.25.html'.format(pj['id']), width=900, height=400))\n",
+    "display(IFrame('./mlflow_artifacts/{}/Predictor47.0.html'.format(pj['id']), width=800, height=400))\n",
+    "display(IFrame('./mlflow_artifacts/{}/weight_plot.html'.format(pj['id']), width=800, height=400))\n",
     "\n",
     "\n",
     "## Visual Studio Code has difficulties with displaying htmls. If you are working with VSC and are not able to inspect the plots, uncomment the code below\n",

diff --git a/...performance using Backtest Pipeline.ipynb → ... Train a model and perform backtest.ipynb b/...performance using Backtest Pipeline.ipynb → ... Train a model and perform backtest.ipynb
@@ -87,7 +87,9 @@
    "metadata": {},
    "source": [
     "## Perform backtest\n",
-    "Below, the backtest is performed using ``train_model_and_forecast_back_test``, which not only outputs the forecast but also the model, train data, validation data and test data. The availability of both the forecast and realised values, enables you to evaluate the results of the model."
+    "Below, the backtest is performed using ``train_model_and_forecast_back_test``, which not only outputs the forecast but also the model, train data, validation data and test data. The availability of both the forecast and realised values, enables you to evaluate the results of the model.\n",
+    "\n",
+    "One of the variables in the ``train_model_and_forecast_back_test`` are the ``training_horizons``. This entails how far into the future, the model has to predict. Thus, a value of 0.25 means predicting 15 minutes into the future, where as 47.0 entails predicting 47 hours ahead."
    ]
   },
   {

diff --git a/examples/04. Test_on_difficult_cases.ipynb → ...03. Train a model and make forecast.ipynb b/examples/04. Test_on_difficult_cases.ipynb → ...03. Train a model and make forecast.ipynb
@@ -101,7 +101,7 @@
    },
    "outputs": [],
    "source": [
-    "train_data.head()"
+    "display(train_data.head())"
    ]
   },
   {

diff --git a/examples/04. Split net load into components using DAZLS.ipynb b/examples/04. Split net load into components using DAZLS.ipynb
@@ -0,0 +1,299 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "64765210",
+   "metadata": {},
+   "source": [
+    "# Dazls model\n",
+    "\n",
+    "Energy splitting method in openSTEF. The DAZLS model is able to split a forecast into wind (on shore), solar and other.\n",
+    "\n",
+    "It trains one splitting model which can be used for every prediction job. As input it uses data from multiple substations with known components and it outputs the prediction of solar and wind power for unkown target substations.\n",
+    "\n",
+    "The model contains 2-steps which are deployed in sequence:\n",
+    "1. Domain model (any data-driven model can be used);\n",
+    "2. Adaptation model (any data-driven model can be used).\n",
+    "\n",
+    "#### For reference, see: [dazls.rst](https://github.com/OpenSTEF/openstef/tree/main/docs/dazls.rst)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69f0a54a",
+   "metadata": {},
+   "source": [
+    "## This notebook contains:\n",
+    "\n",
+    "1. Preprocessing of the data which will be used for the model;\n",
+    "\n",
+    "We use the \"combined_data\" folder, which contains the raw data, to preprocess the data and save it in the folder \"prep_data\". After this, the preprocessed data with metadata can be found in the \"prep_data\" file in the path we have set. Then we use the prep_data to run the dazls model.\n",
+    "\n",
+    "2. Train dazls and generate a prediction for 1 out-of-sample csv file; \n",
+    "\n",
+    "\n",
+    "3. Load and store the model.\n",
+    "\n",
+    "The dazls_stored.sav file is being produced and is being used in openstef for the components forecast ([create_component_forecast](https://github.com/OpenSTEF/openstef/blob/main/openstef/pipeline/create_component_forecast.py))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "056a49f8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error\n",
+    "import glob\n",
+    "import pandas as pd\n",
+    "import os\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.preprocessing import MinMaxScaler\n",
+    "from sklearn.neighbors import KNeighborsRegressor\n",
+    "import random \n",
+    "from sklearn.utils import shuffle\n",
+    "import joblib\n",
+    "\n",
+    "from openstef.model.regressors.dazls import Dazls\n",
+    "\n",
+    "pd.options.plotting.backend = 'plotly'\n",
+    "\n",
+    "#Seed and preparation\n",
+    "random.seed(999)\n",
+    "np.random.seed(999)\n",
+    "\n",
+    "#Path\n",
+    "combined_data=[]\n",
+    "station_name=[]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b7b4284a",
+   "metadata": {},
+   "source": [
+    "## Preprocess the data\n",
+    "Create the \"prep_data\" folder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cb4637ed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#Read, create metadata and save in prep_data folder\n",
+    "for file_name in glob.glob('combined_data/*.csv'):\n",
+    "    \n",
+    "    #Read and fill missing values\n",
+    "    x = pd.read_csv(file_name, low_memory=False,parse_dates=[\"datetime\"],index_col=0)\n",
+    "    x[\"datetime\"]=pd.to_datetime(x[\"datetime\"])\n",
+    "    x=x.set_index('datetime')\n",
+    "    x.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
+    "    x=x.ffill()\n",
+    "    # x=x.interpolate(method='ffill')\n",
+    "\n",
+    "    ## Get variance metadata ####\n",
+    "    var=x.iloc[:,:3].var()\n",
+    "    for i in range(3):\n",
+    "        x.loc[:, 'var'+str(i)] = var.iloc[i]\n",
+    "    ### end get variance ####\n",
+    "\n",
+    "    ## Get sem metadata ####\n",
+    "    sem=x.iloc[:,:3].sem()\n",
+    "    for i in range(3):\n",
+    "        x.loc[:, 'sem'+str(i)] = sem.iloc[i]\n",
+    "    ### end get sem ####\n",
+    "\n",
+    "\n",
+    "    ## Get min-max capacity physical metadata ####\n",
+    "    mini=x.iloc[:,3:5].min()\n",
+    "    maxi=x.iloc[:,3:5].max()\n",
+    "    for i in range(2):\n",
+    "        x.loc[:, 'min'+str(i)] = mini.iloc[i]\n",
+    "    for i in range(2):\n",
+    "        x.loc[:, 'max'+str(i)] = maxi.iloc[i]    \n",
+    "    ### end get sem ####    \n",
+    "    \n",
+    "    combined_data.append(x)\n",
+    "    sn=os.path.basename(file_name)\n",
+    "    station_name.append(sn[:len(sn)-4])\n",
+    "    x.to_csv(\"prep_data/\"+sn, index=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2d045c67",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "combined_data = []\n",
+    "station_name = []\n",
+    "\n",
+    "# Read prepared data\n",
+    "for file_name in glob.glob('prep_data/*.csv'):\n",
+    "    x = pd.read_csv(file_name, low_memory=False, parse_dates=[\"datetime\"])\n",
+    "    x[\"datetime\"] = pd.to_datetime(x[\"datetime\"])\n",
+    "    x = x.set_index('datetime')\n",
+    "    x.columns=[x.lower() for x in x.columns]\n",
+    "    combined_data.append(x)\n",
+    "    sn = os.path.basename(file_name)\n",
+    "    station_name.append(sn[:len(sn) - 4])\n",
+    "\n",
+    "\n",
+    "# Split data in train and test (the first substation is being used for the testing)\n",
+    "training_data = pd.concat(combined_data[1:])\n",
+    "test_data = combined_data[0]\n",
+    "target_columns =['total_solar_part', 'total_wind_part']\n",
+    "feature_columns = [x for x in test_data.columns if x not in target_columns]\n",
+    "print('Testing station:',station_name[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fca6b90",
+   "metadata": {},
+   "source": [
+    "## Initialize DAZLS model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "375a8b47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = Dazls()\n",
+    "# Fit model\n",
+    "model.fit(training_data.loc[:,feature_columns], training_data.loc[:,target_columns])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75e033fc",
+   "metadata": {},
+   "source": [
+    "## Generate a prediction\n",
+    "In this section, a prediction is generated using the DAZLS model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66e7dcdc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get predicted y\n",
+    "y = model.predict(test_data.loc[:,feature_columns])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d53da56c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Examine the results, which are stored in the variable y. The other variables are added to these results by copying the test_data.\n",
+    "result = test_data.loc[:,target_columns].copy()\n",
+    "result['DAZLS_wind_split'] = y[:,0]\n",
+    "result['DAZLS_solar_split'] = y[:,1]\n",
+    "result.iloc[60:]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53c59b68",
+   "metadata": {},
+   "source": [
+    "## Evaluate the results\n",
+    "Examine the results by looking at the build-in metrics and visualisation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6bb8ce3a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# print prediction performance\n",
+    "RMSE, R2=model.score(test_data.loc[:,target_columns], y)\n",
+    "\n",
+    "print(\"The root-mean-square error (RMSE) is {} \".format(RMSE) + \"and the R2 score is {}\".format(R2))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0cc8e7e0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Visualize the results by plotting the results of the DAZLS model.\n",
+    "\n",
+    "fig_result=result.plot()\n",
+    "fig_result.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eacf2558",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Compare the wind split results to the actual value\n",
+    "fig_compare_result_wind=pd.concat([result['DAZLS_wind_split'], test_data['total_wind_part']], axis=1).plot()\n",
+    "fig_compare_result_wind.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba34fcc9",
+   "metadata": {},
+   "source": [
+    "## Store the model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4ac275ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Store the dazls file\n",
+    "filename = 'dazls_stored.sav'\n",
+    "joblib.dump(model, filename)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/07. Obtain derived features.ipynb → examples/05. Obtain derived features.ipynb b/examples/07. Obtain derived features.ipynb → examples/05. Obtain derived features.ipynb