Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML: Improve notebooks for easier usage on Google Colab #174

Merged
merged 5 commits into from
Dec 4, 2023

Conversation

ckurze
Copy link
Contributor

@ckurze ckurze commented Nov 30, 2023

About

For an upcoming presentation, this patch sands the AutoML notebooks a bit more, for improved usability. It is specifically about making it easier to run them on the Google Colab computing environment.

Details

  • Improve Python dependency definition by inlining it into the notebook.
  • Improve database connectivity definition by only using a single configuration value.

@ckurze
Copy link
Contributor Author

ckurze commented Nov 30, 2023

@amotl Could you please have a final review on the changes?

ckurze and others added 2 commits December 2, 2023 22:23
FAILED test.py::test_notebook[automl_timeseries_forecasting_with_pycaret.ipynb] -
nbclient.exceptions.CellTimeoutError: A cell timed out while it was
being executed, after 240 seconds.
The message was: Cell execution timed out.

Here is a preview of the cell contents:
-------------------
s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
@amotl
Copy link
Member

amotl commented Dec 2, 2023

It does not converge. I already tried with 86e77a7, but still: CellTimeoutError.

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870

@amotl

This comment was marked as duplicate.

@andnig
Copy link
Contributor

andnig commented Dec 3, 2023

^^ Do you see any chance to make this spot more efficient on CI, @andnig?

The highlighted cell just does the setup of the data to train the models on. It can't run for more than a Minute or so.
As our tests previously were successful at 240s timeout and now even fail at 300s is there a chance that the database connection for mlflow logging was changed? Maybe you can try setting log_experiments to false and check if this cell executes faster. Maybe it tries to get to the logging database but fails to do so.

@amotl
Copy link
Member

amotl commented Dec 3, 2023

The highlighted cell just does the setup of the data to train the models on. It can't run for more than a Minute or so.

So, you think the cell timeout actually isn't the consequence of hogging the CPU too much, but is rather originating from a network timeout? Interesting. I will consider this idea when debugging the problem in more detail.

Maybe you can try setting log_experiments to False?

Sure, will try. Thanks.

@amotl
Copy link
Member

amotl commented Dec 3, 2023

is there a chance that the database connection for mlflow logging was changed?

All the nightly scheduled runs of the "AutoML" job succeed, so it is absolutely likely this has caused the issue. Thanks!

In fact, @ckurze was improving database connection details here, so this patch will most probably need an update to the test runner, similar to how crate/crate-python#176 was needed to bring in crate/crate-python#173. I will take care about it, and add a corresponding commit here.

@amotl
Copy link
Member

amotl commented Dec 3, 2023

[...] improving database connection details here, so this patch will most probably need an update to the test runner [...]

This is the spot where corresponding environment variables for the test suite are being defined.

env = [
"CRATEDB_CONNECTION_STRING=crate://crate@localhost/?schema=testdrive",
"CRATE_USER=crate",
"CRATE_PASSWORD=",
"CRATE_HOST=localhost",
"CRATE_SSL=false",
"PYDEVD_DISABLE_FILE_VALIDATION=1",
]

Most probably, taking into account what @andnig is suggesting, those settings don't harmonize.

# Use CrateDB for MLflow tracking.
os.environ["MLFLOW_TRACKING_URI"] = f"{dburi}&schema=mlflow"

https://github.com/crate/cratedb-examples/blob/c4cad47ad95e331600613cdbef56506e057973b0/topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb?short_path=ff6d6a4#L255-L256

Comment on lines 256 to 292
"os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{dburi}&schema=mlflow\""
"os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{CONNECTION_STRING}&schema=mlflow\""
Copy link
Member

@amotl amotl Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem

Previously, dburi contained an SQLAlchemy database address URI / connection string without the schema parameter, so it was good to append one.

Now that it is coming from the CRATEDB_CONNECTION_STRING environment variable, which already includes a schema parameter, things most probably go south when trying to address the correct database location.

Thanks for the guidance into this direction, @andnig. I think it is a bit sad that the process runs into a timeout situation. While I did not yet look into the details why exactly this is happening, a more explicit exception raised by the software would have been nice, for better UX.

Note that this issue is exclusively related to software testing, and does not mean the notebook is incorrect. It is only about that it currently doesn't harmonize with automated testing.

Copy link
Member

@amotl amotl Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation

Now that we know what to look for, it is easy to spot the correct exception. It does not help too much that it is printed in green, and that I am not too familiar with analyzing pytest-notebook stacktraces yet.

2023/12/03 00:03:58 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object, got 'tuple'
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 0.1 seconds

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:952

image

Evaluation

That exception exactly reflects what's happening: The machinery receives two values for the schema parameter, in form of a tuple, and trips correspondingly.

Thoughts

Well, that this operation is retried, might by okay from the perspective of MLflow, because it probably can't judge good enough about the type of the underlying error.

However, regardless of the error, we see MLflow is retrying with increasing delay, using an exponential backoff mechanism. While that is absolutely perfectly fine in production environments, we may want to adjust the maximum delay to a much shorter time when running test cases, both in sandbox and CI environments.

When doing so, the error would have been revealed instantly, and not masked too much by being deferred to so much later in time.

Copy link
Member

@amotl amotl Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine, it yields a slightly different error message about the same error (schema parameter is present two times). Even when configuring nb_exec_timeout = 15 in pyproject.toml, it still attempts to retry for ages.

2023/12/04 00:42:00 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 0.1 seconds

[...]

2023/12/04 00:45:24 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 204.7 seconds
2023/12/04 00:48:48 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
(crate.client.exceptions.ProgrammingError) expected string or bytes-like object
(Background on this error at: https://sqlalche.me/e/20/f405)
Operation will be retried in 409.5 seconds

Copy link
Member

@amotl amotl Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, there are no time-based adjustments for the reconnection loop of MLflow.

The maximum number of times to retry is configured as MAX_RETRY_COUNT = 15.
-- https://github.com/mlflow/mlflow/blob/v2.8.1/mlflow/store/db/utils.py#L55

The reconnection loop will retry for up to 55 minutes (3276.7 seconds), if I got it right.
-- https://github.com/mlflow/mlflow/blob/v2.8.1/mlflow/store/db/utils.py#L224-L243

NB: On my machine, the process does not seem to respect the nb_exec_timeout configuration setting at all, already time.sleeping() at 819.1 seconds after launch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concluding all of the observations shared above, I think it will be a good idea to probe the database connection before going into the reconnect loop.

-- https://github.com/crate-workbench/mlflow-cratedb

Copy link
Member

@amotl amotl Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a followup error when naively configuring the database connection like

CRATEDB_CONNECTION_STRING=crate://crate@localhost/

Because &schema=mlflow will appended, the driver will effectively use crate://crate@localhost/&schema=mlflow, which will raise a different error, but will still go into the reconnect loop.

2023/12/04 01:20:53 WARNING mlflow.store.db.utils: SQLAlchemy engine could not be created. The following exception is caught.
Connection.__init__() got an unexpected keyword argument 'database'
Operation will be retried in 102.3 seconds

&schema=mlflow will be interpreted as a database parameter, and the Python driver doesn't accept it, because CrateDB does not understand the concept of databases.

NB: It will be sweet to improve the Python driver to raise an error message which reflects this detail better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this was successful. After fixing the database connectivity settings propagation, test cases signal "green" status now, and there are three more tasks to work on, in order to improve upstream library and adapter software.

Comment on lines 2126 to 2127
" \"MLFLOW_TRACKING_URI\"\n",
"] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\""
"] = f\"{CONNECTION_STRING}&schema=mlflow\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dito.

Comment on lines 227 to 262
"] = f\"{dburi}&schema=mlflow\""
"] = f\"{CONNECTION_STRING}&schema=mlflow\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same would apply here, but on the other notebook, the error doesn't happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, it does.

============= short test summary info =============
FAILED test.py::test_notebook[automl_timeseries_forecasting_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
FAILED test.py::test_notebook[automl_classification_with_pycaret.ipynb] - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
============= 2 failed, 2 passed, 62 warnings in 880.39s (0:14:40) =============

Comment on lines 3409 to 3446
"] = f\"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}&schema=mlflow\""
"] = f\"{CONNECTION_STRING}&schema=mlflow\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dito.

@amotl amotl changed the title Minor changes on AutoML notebooks for easier usage on Google Colab AutoML: Improve notebooks for easier usage on Google Colab Dec 3, 2023
Concatenating the `schema` query parameter to the SQLAlchemy connection
string correctly is crucial. In order to avoid anomalies or confusion,
this patch makes it so that both types of connection strings (regular
data vs. MLflow tracking) are configured side-by-side now, so it is
easier to understand what is going on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants