-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoML: CI trips with CellTimeoutError
/ ValueError: Input contains NaN.
#170
Comments
Thinking about it once more, it is more likely that some dependency library of PyCaret was not pinned correctly, and that something changed in this area. |
ValueError: Input contains NaN.
ValueError: Input contains NaN.
ValueError: Input contains NaN.
@amotl |
That's true, but I am talking about transitive dependencies of PyCaret. I think it is the most likely reason, but sure it can also be different. |
Hours and hours of debugging into dependencies of pycaret, googling the term transitive dependencies - just to find, that the test still ran on python 3.10 - life of a developer is fun😄 Can you confirm that the test is green? To be honest I'm not sure if this issue is really resolved yet, as the pycaret timeseries notebook test was always green but the script version of it failed. Smells like flaky test or environment. Let's monitor the situation - but as the PR test is green for now, will not invest more time for now. Good for you? |
Thank you very much for your efforts. Sure, let's merge the PR, close this issue, and monitor the situation into the future for similar events. |
I am just re-running to most recently failed https://github.com/crate/cratedb-examples/actions/runs/7027445018, in order to rule out that it is related to the time-of-day when the test is executed. If it will fail again, it is likely that the upgrade to Python 3.11 resolved the situation in one way or another, and that your debugging efforts had a positive outcome. |
Aha, it is green again, so it was actually just a fluke. However, it is an interesting one which can also easily hit production applications, depending on what the actual root cause was. |
This is related to how the tests in our repo here are designed. The model training pipeline itself is not of concern - see some of the reasons for why this error happens above. I know this error quite well from my projects - it happens if the data are not available as expected. |
ValueError: Input contains NaN.
CellTimeoutError
/ ValueError: Input contains NaN.
Hi again. I think the root cause for this is actual the venerable E nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E The message was: Cell execution timed out.
E Here is a preview of the cell contents:
E -------------------
E s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
E -------------------
/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError ^^ Do you see any chance to make this spot more efficient on CI, @andnig? With kind regards, -- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19251742707?pr=174#step:6:2870 Footnotes |
CellTimeoutError
/ ValueError: Input contains NaN.
CellTimeoutError
/ ValueError: Input contains NaN.
Another occurrance of the venerable E nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 300 seconds.
E The message was: Cell execution timed out.
E Here is a preview of the cell contents:
E -------------------
E from pycaret.classification import setup, compare_models, tune_model, ensemble_model, blend_models, automl, \
E evaluate_model, finalize_model, save_model, predict_model
E
E s = setup(
E data,
E target="Churn",
E ignore_features=["customerID"],
E log_experiment=True,
E fix_imbalance=True,
E )
E -------------------
/opt/hostedtoolcache/Python/3.11.6/x64/lib/python3.11/site-packages/nbclient/client.py:801: CellTimeoutError -- https://github.com/crate/cratedb-examples/actions/runs/7072661550/job/19253059998?pr=174#step:6:2746 |
Hi again. This issue is still present, and is constantly haunting us, which is unfortunate. The most recent occurrance, just about two hours ago, happened after we tried to re-schedule the corresponding job to run on day times, as we figured it would work better. Turns out, it doesn't help. Now, looking a bit closer at the error output, I am just now also spotting this warning: /opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression -- https://github.com/crate/cratedb-examples/actions/runs/7854158803/job/21434611151#step:6:1155 Could that actually be related to the job occasionally (50/50) stalling/freezing/timing out? |
Hey Andreas, happy to chime in. 👋
I would suggest utilizing the PYTEST_CURRENT_TEST environment variable for both, the ipynb and the py tests to reduce training time and potentially solve both issues related to how you test these nbs. Please just make sure that the env vars are "visible" for the jupyter notebooks as well. Exact config depends on which jupyter test runner you use. I hope this helps so far, let me know, how it goes. PS: As you mentioned that the tests fail 50/50 but on quick glance I was only able to find 2 failed tasks, would you mind checking if the input nan failures are always on notebook tests or also on .py file tests? |
Hi Andreas, thanks for your quick reply.
That's to the point. I also had the suspicion that the measures we took last time, to bring down required compute resources, did not work well, or had flaws, but I did not analyze the log output yet about this topic. So, if you think this is the issue still tripping us, I now have a thing to hang on and investigate. Thank you so much! With kind regards, |
Hi again. We've explored the situation, and the outcome is that we can confirm that the call to
I wouldn't know why it should be different on GHA. So, maybe the selected algorithms ["arima", "ets", "exp_smooth"] / ["ets", "et_cds_dt", "naive"] are still too heavy on CPU and/or memory? |
Hi Andreas, if you look at the logs it's not a timeout error, it's the nan input error. As mentioned above I'd suggest to keep these two issues separated. The timeout issue is most probably related to the jupyter test runner. This input nan error however is not related to jupyter. If I look at the failed run, I see the the esm model has an incredibly high MASE and RMSSE. This mostly indicates that the model is not very well suited for the data. I suggested it, as it is very lightweight, but well, too lightweight as it seems 😓 To go forward, you could:
|
Thanks, and sorry that I mixed up those two different errors again. I've diverted those into separate issues now, so this one can be closed after carrying over the relevant information. |
After splitting the issue up into different tickets, but without applying any other fixes, we are currently not facing any problems on nightly runs of the corresponding CI jobs. Therefore, I am closing the issue now, for the time being. Thanks again, @andnig! |
Dear @andnig,
the CI caught an error from
automl_timeseries_forecasting_with_pycaret.py
1.Apparently, it started tripping like this only yesterday 2, so it is likely the error is related to changed input data.
However, the result of debugging this error may well converge into a corresponding issue at PyCaret, because its promises are so high. On the other hand, the code may just need a particular data cleansing step, to accomodate the situation. May I ask you to have a look?
With kind regards,
Andreas.
Footnotes
https://github.com/crate/cratedb-examples/actions/runs/7027445018/job/19121837475#step:6:1492 ↩
https://github.com/crate/cratedb-examples/actions/runs/7013620591/job/19080058650 ↩
The text was updated successfully, but these errors were encountered: