Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML: CI trips with CellTimeoutError / ValueError: Input contains NaN. #170

Closed
amotl opened this issue Nov 29, 2023 · 18 comments
Closed
Assignees

Comments

@amotl
Copy link
Member

amotl commented Nov 29, 2023

Dear @andnig,

the CI caught an error from automl_timeseries_forecasting_with_pycaret.py 1.

FAILED test.py::test_file[automl_timeseries_forecasting_with_pycaret.py] - ValueError: Input contains NaN.

Apparently, it started tripping like this only yesterday 2, so it is likely the error is related to changed input data.

However, the result of debugging this error may well converge into a corresponding issue at PyCaret, because its promises are so high. On the other hand, the code may just need a particular data cleansing step, to accomodate the situation. May I ask you to have a look?

With kind regards,
Andreas.

Footnotes

  1. https://github.com/crate/cratedb-examples/actions/runs/7027445018/job/19121837475#step:6:1492

  2. https://github.com/crate/cratedb-examples/actions/runs/7013620591/job/19080058650

@amotl
Copy link
Member Author

amotl commented Nov 29, 2023

It is likely the error is related to changed input data.

Thinking about it once more, it is more likely that some dependency library of PyCaret was not pinned correctly, and that something changed in this area.

@amotl amotl changed the title AutoML: ValueError: Input contains NaN. AutoML: Example program breaks with ValueError: Input contains NaN. Nov 29, 2023
@amotl amotl changed the title AutoML: Example program breaks with ValueError: Input contains NaN. AutoML: Example program trips with ValueError: Input contains NaN. Nov 29, 2023
@andnig
Copy link
Contributor

andnig commented Nov 29, 2023

@amotl
All dependencies are pinned except the crate sqlalchemy one.
We can assume it's not related to pycaret itself.
Pycaret automatically interpolates nan values except if there are ONLY nan vals (or there are no values), which might indicate an issue with the testing infrastructure, connection or database.
Before I dig out my debug-rod, there were no changes in either the test runner or crate sqlalchemy which come to your mind which might prevent reading data via pandas?

@amotl
Copy link
Member Author

amotl commented Nov 29, 2023

All dependencies are pinned except the crate sqlalchemy one.
We can assume it's not related to pycaret itself.

That's true, but I am talking about transitive dependencies of PyCaret. I think it is the most likely reason, but sure it can also be different.

@andnig
Copy link
Contributor

andnig commented Nov 29, 2023

Hours and hours of debugging into dependencies of pycaret, googling the term transitive dependencies - just to find, that the test still ran on python 3.10 - life of a developer is fun😄
https://github.com/crate/cratedb-examples/actions/runs/7036786822/job/19150177672?pr=171

Can you confirm that the test is green?

To be honest I'm not sure if this issue is really resolved yet, as the pycaret timeseries notebook test was always green but the script version of it failed. Smells like flaky test or environment. Let's monitor the situation - but as the PR test is green for now, will not invest more time for now. Good for you?

@amotl
Copy link
Member Author

amotl commented Nov 29, 2023

Thank you very much for your efforts. Sure, let's merge the PR, close this issue, and monitor the situation into the future for similar events.

@amotl amotl closed this as completed Nov 29, 2023
@amotl
Copy link
Member Author

amotl commented Nov 29, 2023

I am just re-running to most recently failed https://github.com/crate/cratedb-examples/actions/runs/7027445018, in order to rule out that it is related to the time-of-day when the test is executed.

If it will fail again, it is likely that the upgrade to Python 3.11 resolved the situation in one way or another, and that your debugging efforts had a positive outcome.

@amotl
Copy link
Member Author

amotl commented Nov 29, 2023

Aha, it is green again, so it was actually just a fluke. However, it is an interesting one which can also easily hit production applications, depending on what the actual root cause was.

@andnig
Copy link
Contributor

andnig commented Nov 29, 2023

This is related to how the tests in our repo here are designed. The model training pipeline itself is not of concern - see some of the reasons for why this error happens above. I know this error quite well from my projects - it happens if the data are not available as expected.

@amotl amotl changed the title AutoML: Example program trips with ValueError: Input contains NaN. AutoML: Example program trips with CellTimeoutError / ValueError: Input contains NaN. Dec 2, 2023
@amotl
Copy link
Member Author

amotl commented Dec 2, 2023

Hi again.

I think the root cause for this is actual the venerable CellTimeoutError, i.e. the Notebook just runs too much system load, see, for example, 1:

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

^^ Do you see any chance to make this spot more efficient on CI, @andnig?

With kind regards,
Andreas.

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870
-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2872

Footnotes

  1. https://github.com/crate/cratedb-examples/pull/174#issuecomment-1837265321

@amotl amotl reopened this Dec 2, 2023
@amotl amotl changed the title AutoML: Example program trips with CellTimeoutError / ValueError: Input contains NaN. AutoML: CI trips with CellTimeoutError / ValueError: Input contains NaN. Dec 2, 2023
@amotl
Copy link
Member Author

amotl commented Dec 3, 2023

Another occurrance of the venerable CellTimeoutError. It also happens on a setup() call, but this time, on a different one.

E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           from pycaret.classification import setup, compare_models, tune_model, ensemble_model, blend_models, automl, \
E               evaluate_model, finalize_model, save_model, predict_model
E           
E           s = setup(
E               data,
E               target="Churn",
E               ignore_features=["customerID"],
E               log_experiment=True,
E               fix_imbalance=True,
E           )
E           -------------------

/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError

-- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2746

@amotl
Copy link
Member Author

amotl commented Dec 4, 2023

We found the reason for this was mainly due to a misconfiguration of the MLFLOW_TRACKING_URL. It has been fixed on behalf of GH-174, unless further notice. Thanks for your support, @andnig!

@amotl amotl closed this as completed Dec 4, 2023
@amotl
Copy link
Member Author

amotl commented Feb 10, 2024

Hi again. This issue is still present, and is constantly haunting us, which is unfortunate.

The most recent occurrance, just about two hours ago, happened after we tried to re-schedule the corresponding job to run on day times, as we figured it would work better. Turns out, it doesn't help.

Now, looking a bit closer at the error output, I am just now also spotting this warning:

  /opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
  STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
  
  Increase the number of iterations (max_iter) or scale the data as shown in:
      https://scikit-learn.org/stable/modules/preprocessing.html
  Please also refer to the documentation for alternative solver options:
      https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

-- https://github.com/crate/cratedb-examples/actions/runs/7854158803/job/21434611151#step:6:1155

Could that actually be related to the job occasionally (50/50) stalling/freezing/timing out?

@amotl amotl reopened this Feb 10, 2024
@andnig
Copy link
Contributor

andnig commented Feb 10, 2024

Hey Andreas, happy to chime in. 👋

  1. Please think about separating these two topics for more clarity. CellTimeout and Input NAN are mostly two separate issues if they still occur both. Celltimeout is more often than not related to the jupyter test runner (or, well, simple timeouts), while input nan can mean multiple things, more common ones are either data not there, your infrastructure is about to get killed or training iterators running amok.

  2. As your test infrastructure is quite limited in terms of CPU power, we added the PYTEST_CURRENT_TEST env variable which only runs 3 models which are also rather fast to train. If I remember correctly we used two ets model variants and a naive one.
    From the logs you shared it seems however that all the models are trained. (also the non-converge error is related to a model which we excluded for test runs).

I would suggest utilizing the PYTEST_CURRENT_TEST environment variable for both, the ipynb and the py tests to reduce training time and potentially solve both issues related to how you test these nbs. Please just make sure that the env vars are "visible" for the jupyter notebooks as well. Exact config depends on which jupyter test runner you use.

I hope this helps so far, let me know, how it goes.


PS: As you mentioned that the tests fail 50/50 but on quick glance I was only able to find 2 failed tasks, would you mind checking if the input nan failures are always on notebook tests or also on .py file tests?

@amotl
Copy link
Member Author

amotl commented Feb 10, 2024

Hi Andreas, thanks for your quick reply.

From the logs you shared it seems however that all the models(!) are trained, [while we intended to only run a few of them]. [I can] also [spot] a non-converge error, which is related to a model which we excluded for test runs.
[Most probably, PYTEST_CURRENT_TEST is not getting evaluated properly.] Please just make sure that the env vars are "visible" for the jupyter notebooks as well.

That's to the point. I also had the suspicion that the measures we took last time, to bring down required compute resources, did not work well, or had flaws, but I did not analyze the log output yet about this topic. So, if you think this is the issue still tripping us, I now have a thing to hang on and investigate. Thank you so much!

With kind regards,
Andreas.

@amotl
Copy link
Member Author

amotl commented Feb 13, 2024

Hi again. We've explored the situation, and the outcome is that we can confirm that the call to compare_models works well, including its guard using a corresponding if "PYTEST_CURRENT_TEST" in os.environ clause.

  • The guard works when invoking pytest in the local directory, and it works when invoking ngr test from the repository root directory.
  • The guard also works in both files equally well, the pure .py file, and the .ipynb file, so it is apparently not obstructed by pytest / nbtest runners.

I wouldn't know why it should be different on GHA. So, maybe the selected algorithms ["arima", "ets", "exp_smooth"] / ["ets", "et_cds_dt", "naive"] are still too heavy on CPU and/or memory?

@andnig
Copy link
Contributor

andnig commented Feb 13, 2024

Hi Andreas, if you look at the logs it's not a timeout error, it's the nan input error. As mentioned above I'd suggest to keep these two issues separated. The timeout issue is most probably related to the jupyter test runner. This input nan error however is not related to jupyter.

If I look at the failed run, I see the the esm model has an incredibly high MASE and RMSSE. This mostly indicates that the model is not very well suited for the data. I suggested it, as it is very lightweight, but well, too lightweight as it seems 😓

Untitled

To go forward, you could:

  1. Use a different model for the test run, one which has less MASE. Run the whole pycaret model suite locally and select one of the top 5 models instead of the exp_smooth one, for your test run.
  2. If this does not help, can you provide some local reproduction steps? If you can reproduce it locally, I'm better able to help.

@amotl
Copy link
Member Author

amotl commented Feb 13, 2024

Thanks, and sorry that I mixed up those two different errors again. I've diverted those into separate issues now, so this one can be closed after carrying over the relevant information.

@seut seut added bug Something isn't working and removed bug Something isn't working labels Feb 20, 2024
@amotl
Copy link
Member Author

amotl commented Feb 27, 2024

After splitting the issue up into different tickets, but without applying any other fixes, we are currently not facing any problems on nightly runs of the corresponding CI jobs.

Therefore, I am closing the issue now, for the time being. Thanks again, @andnig!

@amotl amotl closed this as completed Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants