crate · amotl · Dec 4, 2023 · Nov 30, 2023 · Nov 30, 2023 · Nov 30, 2023
diff --git a/topic/machine-learning/automl/README.md b/topic/machine-learning/automl/README.md
@@ -71,11 +71,31 @@ and [CrateDB].
   performing model. The notebook also shows how to use CrateDB as storage for
   both the raw data and the expirement tracking and model registry data.
 
-- Accompanied to the Jupyter Notebook files, there are also basic variants of
-  the above examples,
-  [automl_timeseries_forecasting_with_pycaret.py](automl_timeseries_forecasting_with_pycaret.py),
-  [automl_classification_with_pycaret.py](automl_classification_with_pycaret.py).
+- Accompanied to the Jupyter Notebook files, there are also basic standalone
+  program variants of the above examples.
+  - [automl_timeseries_forecasting_with_pycaret.py](automl_timeseries_forecasting_with_pycaret.py),
+  - [automl_classification_with_pycaret.py](automl_classification_with_pycaret.py).
+
+
+## Software Tests
+
+The resources are validated by corresponding software tests on CI. You can
+also use those on your workstation. For example, to invoke the test cases
+validating the Notebook about timeseries classification with PyCaret, run:
+
+```shell
+pytest -k automl_classification_with_pycaret.ipynb
+```
+
+Alternatively, you can validate all resources in this folder by invoking a
+test runner program on the top-level folder of this repository. This is the
+same code path the CI jobs are taking.
+```shell
+pip install -r requirements.txt
+ngr test topic/machine-learning/automl
+```
+
 
-[PyCaret]: https://github.com/pycaret/pycaret
 [CrateDB]: https://github.com/crate/crate
 [Introduction to hyperparameter tuning]: https://medium.com/analytics-vidhya/comparison-of-hyperparameter-tuning-algorithms-grid-search-random-search-bayesian-optimization-5326aaef1bd1
+[PyCaret]: https://github.com/pycaret/pycaret
diff --git a/topic/machine-learning/automl/automl_classification_with_pycaret.ipynb b/topic/machine-learning/automl/automl_classification_with_pycaret.ipynb
@@ -115,12 +115,27 @@
    "source": [
     "## Getting started\n",
     "\n",
-    "First, install the required dependencies. \n",
-    "\n",
-    "```bash\n",
-    "pip install -r requirements.txt\n",
-    "```\n",
+    "First, install the required dependencies. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install -r requirements.txt\n",
     "\n",
+    "# In an environment like Google Colab, please use the absolute URL to the requirements.txt file.\n",
+    "# Note: Some inconsistencies of dependencies might get reported. They can usually be ignored.\n",
+    "# Restart the runtime, if asked by Colab.\n",
+    "#!pip install -r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/automl/requirements.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "**Note:** As of time of this writing, PyCaret requires Python 3.8, 3.9 or 3.10.\n",
     "\n",
     "Second, you will need a CrateDB instance to store and serve the data. The easiest\n",
@@ -131,31 +146,59 @@
     "create an `.env` file with the following content:\n",
     "\n",
     "```env\n",
-    "CRATE_HOST=<your-crate-host> # set this to localhost if you're running crate locally\n",
-    "CRATE_USER=<your-crate-user> # set this to crate if you're running crate locally\n",
-    "CRATE_PASSWORD=<your-crate-password> # set this to \"\" if you're running crate locally\n",
-    "CRATE_SSL=true # set this to false if you're running crate locally\n",
+    "# use this string for a connection to CrateDB Cloud\n",
+    "CONNECTION_STRING=crate://username:password@hostname/?ssl=true \n",
+    "\n",
+    "# use this string for a local connection to CrateDB\n",
+    "# CONNECTION_STRING=crate://crate@localhost/?ssl=false\n",
     "```\n",
     "\n",
     "You can find your CrateDB credentials in the [CrateDB Cloud Console].\n",
     "\n",
     "[CrateDB Cloud Console]: https://cratedb.com/docs/cloud/en/latest/reference/overview.html#cluster\n",
-    "[deploy a cluster]: https://cratedb.com/docs/cloud/en/latest/tutorials/deploy/stripe.html#deploy-cluster\n",
+    "[deploy a cluster]: https://cratedb.com/docs/cloud/en/latest/tutorials/deploy/stripe.html#deploy-cluster"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
     "\n",
-    "### Creating demo data\n",
+    "# Define database connectivity when connecting to CrateDB Cloud.\n",
+    "CONNECTION_STRING = os.environ.get(\n",
+    "    \"CRATEDB_CONNECTION_STRING\",\n",
+    "    \"crate://username:password@hostname/?ssl=true\",\n",
+    ")\n",
     "\n",
-    "For convenience, this notebook comes with an accompanying CSV dataset which you\n",
-    "can quickly import into the database. Upload the CSV file to your CrateDB cloud\n",
-    "cluster, as described [here](https://cratedb.com/docs/cloud/en/latest/reference/overview.html#import).\n",
-    "To follow this notebook, choose `pycaret_churn` for your table name.\n",
+    "# Define database connectivity when connecting to CrateDB on localhost.\n",
+    "# CONNECTION_STRING = os.environ.get(\n",
+    "#     \"CRATEDB_CONNECTION_STRING\",\n",
+    "#     \"crate://crate@localhost/?ssl=false\",\n",
+    "# )\n",
     "\n",
-    "This will automatically create a new database table and import the data."
+    "# Compute derived connection strings for SQLAlchemy use vs. MLflow use.\n",
+    "DBURI_DATA = f\"{CONNECTION_STRING}&schema=testdrive\"\n",
+    "DBURI_MLFLOW = f\"{CONNECTION_STRING}&schema=mlflow\""
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "### Creating demo data\n",
+    "\n",
+    "For convenience, this notebook comes with an accompanying CSV dataset which you\n",
+    "can quickly import into the database. Upload the CSV file to your CrateDB cloud\n",
+    "cluster, as described at [CrateDB Cloud » Import].\n",
+    "To follow this notebook, choose `pycaret_churn` for your table name.\n",
+    "\n",
+    "This will automatically create a new database table and import the data.\n",
+    "\n",
+    "[CrateDB Cloud » Import]: https://cratedb.com/docs/cloud/en/latest/reference/overview.html#import\n",
+    "\n",
     "### Alternative data import using code\n",
     "\n",
     "If you prefer to use code to import your data, please execute the following lines which read the CSV\n",
@@ -175,13 +218,16 @@
     "if os.path.exists(\".env\"):\n",
     "    dotenv.load_dotenv(\".env\", override=True)\n",
     "\n",
-    "dburi = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}\"\n",
-    "engine = sa.create_engine(dburi, echo=os.environ.get('DEBUG'))\n",
+    "# Connect to database.\n",
+    "engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get('DEBUG')))\n",
+    "\n",
+    "# Import data.\n",
     "df = pd.read_csv(\"https://github.com/crate/cratedb-datasets/raw/main/machine-learning/automl/churn-dataset.csv\")\n",
+    "df.to_sql(\"pycaret_churn\", engine, schema=\"testdrive\", index=False, chunksize=1000, if_exists=\"replace\")\n",
     "\n",
+    "# CrateDB is eventually consistent, so synchronize write operations.\n",
     "with engine.connect() as conn:\n",
-    "    df.to_sql(\"pycaret_churn\", conn, index=False, chunksize=1000, if_exists=\"replace\")\n",
-    "    conn.execute(sa.text(\"REFRESH TABLE pycaret_churn;\"))"
+    "    conn.execute(sa.text(\"REFRESH TABLE pycaret_churn\"))"
    ]
   },
   {
@@ -214,17 +260,14 @@
     "if os.path.exists(\".env\"):\n",
     "    dotenv.load_dotenv(\".env\", override=True)\n",
     "\n",
-    "dburi = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}\"\n",
-    "engine = sa.create_engine(dburi, echo=os.environ.get('DEBUG'))\n",
+    "engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get('DEBUG')))\n",
     "\n",
     "with engine.connect() as conn:\n",
     "    with conn.execute(sa.text(\"SELECT * FROM pycaret_churn\")) as cursor:\n",
     "        data = pd.DataFrame(cursor.fetchall(), columns=cursor.keys())\n",
     "\n",
-    "# We set the MLFLOW_TRACKING_URI to our CrateDB instance. We'll see later why\n",
-    "os.environ[\n",
-    "    \"MLFLOW_TRACKING_URI\"\n",
-    "] = f\"{dburi}&schema=mlflow\""
+    "# Configure MLflow to use CrateDB.\n",
+    "os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
    ]
   },
   {
@@ -966,8 +1009,10 @@
     "# - \"n_select\" defines how many models are selected.\n",
     "# - \"exclude\" defines which models are excluded from the comparison.\n",
     "\n",
+    "# Note: This is only relevant if we are executing automated tests\n",
     "if \"PYTEST_CURRENT_TEST\" in os.environ:\n",
     "    best_models = compare_models(sort=\"AUC\", include=[\"lr\", \"knn\"], n_select=3)\n",
+    "# If we are not in an automated test, compare the available models\n",
     "else:\n",
     "    # For production scenarios, it might be worth to include \"lightgbm\" again.\n",
     "    best_models = compare_models(sort=\"AUC\", exclude=[\"lightgbm\"], n_select=3)"
@@ -3404,9 +3449,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "os.environ[\n",
-    "    \"MLFLOW_TRACKING_URI\"\n",
-    "] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\""
+    "# Configure MLflow to use CrateDB.\n",
+    "os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
    ]
   },
   {
@@ -3484,7 +3528,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "crate",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -3498,7 +3542,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.0"
+   "version": "3.11.4"
   }
  },
  "nbformat": 4,

diff --git a/topic/machine-learning/automl/automl_classification_with_pycaret.py b/topic/machine-learning/automl/automl_classification_with_pycaret.py
@@ -17,16 +17,26 @@
     dotenv.load_dotenv(".env", override=True)
 
 
-# Configure database connection string.
-dburi = f"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}"
-os.environ["MLFLOW_TRACKING_URI"] = f"{dburi}&schema=mlflow"
+# Configure to connect to CrateDB server on localhost.
+CONNECTION_STRING = os.environ.get(
+    "CRATEDB_CONNECTION_STRING",
+    "crate://crate@localhost/?ssl=false",
+)
+
+# Compute derived connection strings for SQLAlchemy use vs. MLflow use.
+DBURI_DATA = f"{CONNECTION_STRING}&schema=testdrive"
+DBURI_MLFLOW = f"{CONNECTION_STRING}&schema=mlflow"
+
+# Propagate database connectivity settings.
+engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG")))
+os.environ["MLFLOW_TRACKING_URI"] = DBURI_MLFLOW
 
 
 def fetch_data():
     """
     Fetch data from CrateDB, using SQL and SQLAlchemy, and wrap result into pandas data frame.
     """
-    engine = sa.create_engine(dburi, echo=True)
+    engine = sa.create_engine(DBURI_DATA, echo=True)
 
     with engine.connect() as conn:
         with conn.execute(sa.text("SELECT * FROM pycaret_churn")) as cursor: