AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` #170

amotl · 2023-11-29T13:48:17Z

the CI caught an error from automl_timeseries_forecasting_with_pycaret.py ¹.

FAILED test.py::test_file[automl_timeseries_forecasting_with_pycaret.py] - ValueError: Input contains NaN.

Apparently, it started tripping like this only yesterday ², so it is likely the error is related to changed input data.

However, the result of debugging this error may well converge into a corresponding issue at PyCaret, because its promises are so high. On the other hand, the code may just need a particular data cleansing step, to accomodate the situation. May I ask you to have a look?

With kind regards,
Andreas.

The text was updated successfully, but these errors were encountered:

amotl · 2023-11-29T13:49:17Z

It is likely the error is related to changed input data.

Thinking about it once more, it is more likely that some dependency library of PyCaret was not pinned correctly, and that something changed in this area.

andnig · 2023-11-29T16:03:56Z

@amotl
All dependencies are pinned except the crate sqlalchemy one.
We can assume it's not related to pycaret itself.
Pycaret automatically interpolates nan values except if there are ONLY nan vals (or there are no values), which might indicate an issue with the testing infrastructure, connection or database.
Before I dig out my debug-rod, there were no changes in either the test runner or crate sqlalchemy which come to your mind which might prevent reading data via pandas?

amotl · 2023-11-29T17:19:22Z

All dependencies are pinned except the crate sqlalchemy one.
We can assume it's not related to pycaret itself.

That's true, but I am talking about transitive dependencies of PyCaret. I think it is the most likely reason, but sure it can also be different.

andnig · 2023-11-29T18:38:16Z

Hours and hours of debugging into dependencies of pycaret, googling the term transitive dependencies - just to find, that the test still ran on python 3.10 - life of a developer is fun😄
https://github.com/crate/cratedb-examples/actions/runs/7036786822/job/19150177672?pr=171

Can you confirm that the test is green?

To be honest I'm not sure if this issue is really resolved yet, as the pycaret timeseries notebook test was always green but the script version of it failed. Smells like flaky test or environment. Let's monitor the situation - but as the PR test is green for now, will not invest more time for now. Good for you?

amotl · 2023-11-29T18:58:18Z

Thank you very much for your efforts. Sure, let's merge the PR, close this issue, and monitor the situation into the future for similar events.

amotl · 2023-11-29T18:59:44Z

I am just re-running to most recently failed https://github.com/crate/cratedb-examples/actions/runs/7027445018, in order to rule out that it is related to the time-of-day when the test is executed.

If it will fail again, it is likely that the upgrade to Python 3.11 resolved the situation in one way or another, and that your debugging efforts had a positive outcome.

amotl · 2023-11-29T19:47:09Z

Aha, it is green again, so it was actually just a fluke. However, it is an interesting one which can also easily hit production applications, depending on what the actual root cause was.

andnig · 2023-11-29T20:06:33Z

This is related to how the tests in our repo here are designed. The model training pipeline itself is not of concern - see some of the reasons for why this error happens above. I know this error quite well from my projects - it happens if the data are not available as expected.

amotl · 2023-12-02T22:36:14Z

Hi again.

I think the root cause for this is actual the venerable CellTimeoutError, i.e. the Notebook just runs too much system load, see, for example, ¹:

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

^^ Do you see any chance to make this spot more efficient on CI, @andnig?

With kind regards,
Andreas.

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870
-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2872

https://github.com/crate/cratedb-examples/pull/174#issuecomment-1837265321 ↩

amotl · 2023-12-03T15:27:40Z

Another occurrance of the venerable CellTimeoutError. It also happens on a setup() call, but this time, on a different one.

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           from pycaret.classification import setup, compare_models, tune_model, ensemble_model, blend_models, automl, \
E               evaluate_model, finalize_model, save_model, predict_model
E           
E           s = setup(
E               data,
E               target="Churn",
E               ignore_features=["customerID"],
E               log_experiment=True,
E               fix_imbalance=True,
E           )
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2746

amotl · 2023-12-04T02:16:04Z

We found the reason for this was mainly due to a misconfiguration of the MLFLOW_TRACKING_URL. It has been fixed on behalf of GH-174, unless further notice. Thanks for your support, @andnig!

amotl · 2024-02-10T12:20:20Z

Hi again. This issue is still present, and is constantly haunting us, which is unfortunate.

The most recent occurrance, just about two hours ago, happened after we tried to re-schedule the corresponding job to run on day times, as we figured it would work better. Turns out, it doesn't help.

Now, looking a bit closer at the error output, I am just now also spotting this warning:

  /opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
  STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
  
  Increase the number of iterations (max_iter) or scale the data as shown in:
      https://scikit-learn.org/stable/modules/preprocessing.html
  Please also refer to the documentation for alternative solver options:
      https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

-- https://github.com/crate/cratedb-examples/actions/runs/7854158803/job/21434611151#step:6:1155

Could that actually be related to the job occasionally (50/50) stalling/freezing/timing out?

andnig · 2024-02-10T13:16:55Z

Hey Andreas, happy to chime in. 👋

Please think about separating these two topics for more clarity. CellTimeout and Input NAN are mostly two separate issues if they still occur both. Celltimeout is more often than not related to the jupyter test runner (or, well, simple timeouts), while input nan can mean multiple things, more common ones are either data not there, your infrastructure is about to get killed or training iterators running amok.
As your test infrastructure is quite limited in terms of CPU power, we added the PYTEST_CURRENT_TEST env variable which only runs 3 models which are also rather fast to train. If I remember correctly we used two ets model variants and a naive one.
From the logs you shared it seems however that all the models are trained. (also the non-converge error is related to a model which we excluded for test runs).

I would suggest utilizing the PYTEST_CURRENT_TEST environment variable for both, the ipynb and the py tests to reduce training time and potentially solve both issues related to how you test these nbs. Please just make sure that the env vars are "visible" for the jupyter notebooks as well. Exact config depends on which jupyter test runner you use.

I hope this helps so far, let me know, how it goes.

PS: As you mentioned that the tests fail 50/50 but on quick glance I was only able to find 2 failed tasks, would you mind checking if the input nan failures are always on notebook tests or also on .py file tests?

amotl · 2024-02-10T13:24:57Z

Hi Andreas, thanks for your quick reply.

From the logs you shared it seems however that all the models(!) are trained, [while we intended to only run a few of them]. [I can] also [spot] a non-converge error, which is related to a model which we excluded for test runs.
[Most probably, PYTEST_CURRENT_TEST is not getting evaluated properly.] Please just make sure that the env vars are "visible" for the jupyter notebooks as well.

That's to the point. I also had the suspicion that the measures we took last time, to bring down required compute resources, did not work well, or had flaws, but I did not analyze the log output yet about this topic. So, if you think this is the issue still tripping us, I now have a thing to hang on and investigate. Thank you so much!

With kind regards,
Andreas.

amotl · 2024-02-13T19:37:02Z

Hi again. We've explored the situation, and the outcome is that we can confirm that the call to compare_models works well, including its guard using a corresponding if "PYTEST_CURRENT_TEST" in os.environ clause.

The guard works when invoking pytest in the local directory, and it works when invoking ngr test from the repository root directory.
The guard also works in both files equally well, the pure .py file, and the .ipynb file, so it is apparently not obstructed by pytest / nbtest runners.

I wouldn't know why it should be different on GHA. So, maybe the selected algorithms ["arima", "ets", "exp_smooth"] / ["ets", "et_cds_dt", "naive"] are still too heavy on CPU and/or memory?

andnig · 2024-02-13T20:21:18Z

Hi Andreas, if you look at the logs it's not a timeout error, it's the nan input error. As mentioned above I'd suggest to keep these two issues separated. The timeout issue is most probably related to the jupyter test runner. This input nan error however is not related to jupyter.

If I look at the failed run, I see the the esm model has an incredibly high MASE and RMSSE. This mostly indicates that the model is not very well suited for the data. I suggested it, as it is very lightweight, but well, too lightweight as it seems 😓

To go forward, you could:

Use a different model for the test run, one which has less MASE. Run the whole pycaret model suite locally and select one of the top 5 models instead of the exp_smooth one, for your test run.
If this does not help, can you provide some local reproduction steps? If you can reproduce it locally, I'm better able to help.

amotl · 2024-02-13T20:47:08Z

Thanks, and sorry that I mixed up those two different errors again. I've diverted those into separate issues now, so this one can be closed after carrying over the relevant information.

amotl · 2024-02-27T12:16:32Z

After splitting the issue up into different tickets, but without applying any other fixes, we are currently not facing any problems on nightly runs of the corresponding CI jobs.

Therefore, I am closing the issue now, for the time being. Thanks again, @andnig!

amotl changed the title ~~AutoML: ValueError: Input contains NaN.~~ AutoML: Example program breaks with ValueError: Input contains NaN. Nov 29, 2023

amotl changed the title ~~AutoML: Example program breaks with ValueError: Input contains NaN.~~ AutoML: Example program trips with ValueError: Input contains NaN. Nov 29, 2023

amotl closed this as completed Nov 29, 2023

amotl changed the title ~~AutoML: Example program trips with ValueError: Input contains NaN.~~ AutoML: Example program trips with CellTimeoutError / ValueError: Input contains NaN. Dec 2, 2023

amotl reopened this Dec 2, 2023

amotl changed the title ~~AutoML: Example program trips with CellTimeoutError / ValueError: Input contains NaN.~~ AutoML: CI trips with CellTimeoutError / ValueError: Input contains NaN. Dec 2, 2023

amotl closed this as completed Dec 4, 2023

amotl reopened this Feb 10, 2024

amotl self-assigned this Feb 10, 2024

amotl mentioned this issue Feb 13, 2024

LangChain: cratedb_rag_customer_support.ipynb trips with CellTimeoutError #297

Closed

This was referenced Feb 13, 2024

AutoML: CI trips with ValueError: Input contains NaN. #298

Open

AutoML: CI trips with CellTimeoutError #299

Closed

MLflow: On CI, use model "ets_cds_dt" only, having the smallest MASE #300

Closed

seut added bug Something isn't working and removed bug Something isn't working labels Feb 20, 2024

amotl closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` #170

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` #170

amotl commented Nov 29, 2023 •

edited

Loading

amotl commented Nov 29, 2023

andnig commented Nov 29, 2023

amotl commented Nov 29, 2023

andnig commented Nov 29, 2023 •

edited

Loading

amotl commented Nov 29, 2023

amotl commented Nov 29, 2023 •

edited

Loading

amotl commented Nov 29, 2023 •

edited

Loading

andnig commented Nov 29, 2023

amotl commented Dec 2, 2023 •

edited

Loading

amotl commented Dec 3, 2023

amotl commented Dec 4, 2023 •

edited

Loading

amotl commented Feb 10, 2024 •

edited

Loading

andnig commented Feb 10, 2024 •

edited

Loading

amotl commented Feb 10, 2024 •

edited

Loading

amotl commented Feb 13, 2024

andnig commented Feb 13, 2024

amotl commented Feb 13, 2024 •

edited

Loading

amotl commented Feb 27, 2024

AutoML: CI trips with CellTimeoutError / ValueError: Input contains NaN. #170

AutoML: CI trips with CellTimeoutError / ValueError: Input contains NaN. #170

Comments

amotl commented Nov 29, 2023 • edited Loading

Footnotes

amotl commented Nov 29, 2023

andnig commented Nov 29, 2023

amotl commented Nov 29, 2023

andnig commented Nov 29, 2023 • edited Loading

amotl commented Nov 29, 2023

amotl commented Nov 29, 2023 • edited Loading

amotl commented Nov 29, 2023 • edited Loading

andnig commented Nov 29, 2023

amotl commented Dec 2, 2023 • edited Loading

Footnotes

amotl commented Dec 3, 2023

amotl commented Dec 4, 2023 • edited Loading

amotl commented Feb 10, 2024 • edited Loading

andnig commented Feb 10, 2024 • edited Loading

amotl commented Feb 10, 2024 • edited Loading

amotl commented Feb 13, 2024

andnig commented Feb 13, 2024

amotl commented Feb 13, 2024 • edited Loading

amotl commented Feb 27, 2024

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` #170

AutoML: CI trips with `CellTimeoutError` / `ValueError: Input contains NaN.` #170

amotl commented Nov 29, 2023 •

edited

Loading

andnig commented Nov 29, 2023 •

edited

Loading

amotl commented Nov 29, 2023 •

edited

Loading

amotl commented Nov 29, 2023 •

edited

Loading

amotl commented Dec 2, 2023 •

edited

Loading

amotl commented Dec 4, 2023 •

edited

Loading

amotl commented Feb 10, 2024 •

edited

Loading

andnig commented Feb 10, 2024 •

edited

Loading

amotl commented Feb 10, 2024 •

edited

Loading

amotl commented Feb 13, 2024 •

edited

Loading