Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML: Improve notebooks for easier usage on Google Colab #174

Merged
merged 5 commits into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 25 additions & 5 deletions topic/machine-learning/automl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,11 +71,31 @@ and [CrateDB].
performing model. The notebook also shows how to use CrateDB as storage for
both the raw data and the expirement tracking and model registry data.

- Accompanied to the Jupyter Notebook files, there are also basic variants of
the above examples,
[automl_timeseries_forecasting_with_pycaret.py](automl_timeseries_forecasting_with_pycaret.py),
[automl_classification_with_pycaret.py](automl_classification_with_pycaret.py).
- Accompanied to the Jupyter Notebook files, there are also basic standalone
program variants of the above examples.
- [automl_timeseries_forecasting_with_pycaret.py](automl_timeseries_forecasting_with_pycaret.py),
- [automl_classification_with_pycaret.py](automl_classification_with_pycaret.py).


## Software Tests

The resources are validated by corresponding software tests on CI. You can
also use those on your workstation. For example, to invoke the test cases
validating the Notebook about timeseries classification with PyCaret, run:

```shell
pytest -k automl_classification_with_pycaret.ipynb
```

Alternatively, you can validate all resources in this folder by invoking a
test runner program on the top-level folder of this repository. This is the
same code path the CI jobs are taking.
```shell
pip install -r requirements.txt
ngr test topic/machine-learning/automl
```


[PyCaret]: https://github.com/pycaret/pycaret
[CrateDB]: https://github.com/crate/crate
[Introduction to hyperparameter tuning]: https://medium.com/analytics-vidhya/comparison-of-hyperparameter-tuning-algorithms-grid-search-random-search-bayesian-optimization-5326aaef1bd1
[PyCaret]: https://github.com/pycaret/pycaret
106 changes: 75 additions & 31 deletions topic/machine-learning/automl/automl_classification_with_pycaret.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -115,12 +115,27 @@
"source": [
"## Getting started\n",
"\n",
"First, install the required dependencies. \n",
"\n",
"```bash\n",
"pip install -r requirements.txt\n",
"```\n",
"First, install the required dependencies. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#!pip install -r requirements.txt\n",
"\n",
"# In an environment like Google Colab, please use the absolute URL to the requirements.txt file.\n",
"# Note: Some inconsistencies of dependencies might get reported. They can usually be ignored.\n",
"# Restart the runtime, if asked by Colab.\n",
"#!pip install -r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/automl/requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** As of time of this writing, PyCaret requires Python 3.8, 3.9 or 3.10.\n",
"\n",
"Second, you will need a CrateDB instance to store and serve the data. The easiest\n",
Expand All @@ -131,31 +146,59 @@
"create an `.env` file with the following content:\n",
"\n",
"```env\n",
"CRATE_HOST=<your-crate-host> # set this to localhost if you're running crate locally\n",
"CRATE_USER=<your-crate-user> # set this to crate if you're running crate locally\n",
"CRATE_PASSWORD=<your-crate-password> # set this to \"\" if you're running crate locally\n",
"CRATE_SSL=true # set this to false if you're running crate locally\n",
"# use this string for a connection to CrateDB Cloud\n",
"CONNECTION_STRING=crate://username:password@hostname/?ssl=true \n",
"\n",
"# use this string for a local connection to CrateDB\n",
"# CONNECTION_STRING=crate://crate@localhost/?ssl=false\n",
"```\n",
"\n",
"You can find your CrateDB credentials in the [CrateDB Cloud Console].\n",
"\n",
"[CrateDB Cloud Console]: https://cratedb.com/docs/cloud/en/latest/reference/overview.html#cluster\n",
"[deploy a cluster]: https://cratedb.com/docs/cloud/en/latest/tutorials/deploy/stripe.html#deploy-cluster\n",
"[deploy a cluster]: https://cratedb.com/docs/cloud/en/latest/tutorials/deploy/stripe.html#deploy-cluster"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"### Creating demo data\n",
"# Define database connectivity when connecting to CrateDB Cloud.\n",
"CONNECTION_STRING = os.environ.get(\n",
" \"CRATEDB_CONNECTION_STRING\",\n",
" \"crate://username:password@hostname/?ssl=true\",\n",
")\n",
"\n",
"For convenience, this notebook comes with an accompanying CSV dataset which you\n",
"can quickly import into the database. Upload the CSV file to your CrateDB cloud\n",
"cluster, as described [here](https://cratedb.com/docs/cloud/en/latest/reference/overview.html#import).\n",
"To follow this notebook, choose `pycaret_churn` for your table name.\n",
"# Define database connectivity when connecting to CrateDB on localhost.\n",
"# CONNECTION_STRING = os.environ.get(\n",
"# \"CRATEDB_CONNECTION_STRING\",\n",
"# \"crate://crate@localhost/?ssl=false\",\n",
"# )\n",
"\n",
"This will automatically create a new database table and import the data."
"# Compute derived connection strings for SQLAlchemy use vs. MLflow use.\n",
"DBURI_DATA = f\"{CONNECTION_STRING}&schema=testdrive\"\n",
"DBURI_MLFLOW = f\"{CONNECTION_STRING}&schema=mlflow\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating demo data\n",
"\n",
"For convenience, this notebook comes with an accompanying CSV dataset which you\n",
"can quickly import into the database. Upload the CSV file to your CrateDB cloud\n",
"cluster, as described at [CrateDB Cloud » Import].\n",
"To follow this notebook, choose `pycaret_churn` for your table name.\n",
"\n",
"This will automatically create a new database table and import the data.\n",
"\n",
"[CrateDB Cloud » Import]: https://cratedb.com/docs/cloud/en/latest/reference/overview.html#import\n",
"\n",
"### Alternative data import using code\n",
"\n",
"If you prefer to use code to import your data, please execute the following lines which read the CSV\n",
Expand All @@ -175,13 +218,16 @@
"if os.path.exists(\".env\"):\n",
" dotenv.load_dotenv(\".env\", override=True)\n",
"\n",
"dburi = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}\"\n",
"engine = sa.create_engine(dburi, echo=os.environ.get('DEBUG'))\n",
"# Connect to database.\n",
"engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get('DEBUG')))\n",
"\n",
"# Import data.\n",
"df = pd.read_csv(\"https://github.com/crate/cratedb-datasets/raw/main/machine-learning/automl/churn-dataset.csv\")\n",
"df.to_sql(\"pycaret_churn\", engine, schema=\"testdrive\", index=False, chunksize=1000, if_exists=\"replace\")\n",
"\n",
"# CrateDB is eventually consistent, so synchronize write operations.\n",
"with engine.connect() as conn:\n",
" df.to_sql(\"pycaret_churn\", conn, index=False, chunksize=1000, if_exists=\"replace\")\n",
" conn.execute(sa.text(\"REFRESH TABLE pycaret_churn;\"))"
" conn.execute(sa.text(\"REFRESH TABLE pycaret_churn\"))"
]
},
{
Expand Down Expand Up @@ -214,17 +260,14 @@
"if os.path.exists(\".env\"):\n",
" dotenv.load_dotenv(\".env\", override=True)\n",
"\n",
"dburi = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}\"\n",
"engine = sa.create_engine(dburi, echo=os.environ.get('DEBUG'))\n",
"engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get('DEBUG')))\n",
"\n",
"with engine.connect() as conn:\n",
" with conn.execute(sa.text(\"SELECT * FROM pycaret_churn\")) as cursor:\n",
" data = pd.DataFrame(cursor.fetchall(), columns=cursor.keys())\n",
"\n",
"# We set the MLFLOW_TRACKING_URI to our CrateDB instance. We'll see later why\n",
"os.environ[\n",
" \"MLFLOW_TRACKING_URI\"\n",
"] = f\"{dburi}&schema=mlflow\""
"# Configure MLflow to use CrateDB.\n",
"os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
]
},
{
Expand Down Expand Up @@ -966,8 +1009,10 @@
"# - \"n_select\" defines how many models are selected.\n",
"# - \"exclude\" defines which models are excluded from the comparison.\n",
"\n",
"# Note: This is only relevant if we are executing automated tests\n",
"if \"PYTEST_CURRENT_TEST\" in os.environ:\n",
" best_models = compare_models(sort=\"AUC\", include=[\"lr\", \"knn\"], n_select=3)\n",
"# If we are not in an automated test, compare the available models\n",
"else:\n",
" # For production scenarios, it might be worth to include \"lightgbm\" again.\n",
" best_models = compare_models(sort=\"AUC\", exclude=[\"lightgbm\"], n_select=3)"
Expand Down Expand Up @@ -3404,9 +3449,8 @@
"metadata": {},
"outputs": [],
"source": [
"os.environ[\n",
" \"MLFLOW_TRACKING_URI\"\n",
"] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\""
"# Configure MLflow to use CrateDB.\n",
"os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
]
},
{
Expand Down Expand Up @@ -3484,7 +3528,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "crate",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -3498,7 +3542,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
"version": "3.11.4"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,26 @@
dotenv.load_dotenv(".env", override=True)


# Configure database connection string.
dburi = f"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}"
os.environ["MLFLOW_TRACKING_URI"] = f"{dburi}&schema=mlflow"
# Configure to connect to CrateDB server on localhost.
CONNECTION_STRING = os.environ.get(
"CRATEDB_CONNECTION_STRING",
"crate://crate@localhost/?ssl=false",
)

# Compute derived connection strings for SQLAlchemy use vs. MLflow use.
DBURI_DATA = f"{CONNECTION_STRING}&schema=testdrive"
DBURI_MLFLOW = f"{CONNECTION_STRING}&schema=mlflow"

# Propagate database connectivity settings.
engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG")))
os.environ["MLFLOW_TRACKING_URI"] = DBURI_MLFLOW


def fetch_data():
"""
Fetch data from CrateDB, using SQL and SQLAlchemy, and wrap result into pandas data frame.
"""
engine = sa.create_engine(dburi, echo=True)
engine = sa.create_engine(DBURI_DATA, echo=True)

with engine.connect() as conn:
with conn.execute(sa.text("SELECT * FROM pycaret_churn")) as cursor:
Expand Down
Loading