-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Programmatic execution of notebooks #2031
Programmatic execution of notebooks #2031
Conversation
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Weird error. The input is 100k, but the regex parser inputs 10k
Similar error but with a different notebook:
|
Signed-off-by: miguelgfierro <[email protected]>
@miguelgfierro Sorry to ask dumb question as I missed the discussion, but why do we reinvent the wheel here? Couldn't |
Signed-off-by: miguelgfierro <[email protected]>
It seems that papermill is also not maintained: https://pypi.org/project/papermill/#history. They haven't updated it in over a year. MLFlow for recording is an interesting idea, the only problem would be that we would add another dependency. One of the reasons to do this from scratch is to reduce dependencies. This code doesn't add any new dependency and it will allow us to do the same functionality we had. If in the future, there is an appetite to change the recording of the data with MLFlow, we can add it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes. These are huge changes and I really appreciate your hard work! I left few comments that are not critical, so feel free to fix them or leave it for later.
I think at some point, we'll want to split this "notebook util" into a separate project/package because of two reasons: 1) it's not relevant to "recommenders" 2) this utility is super useful for any DS projects that has notebook examples and it will be very beneficial for them to use the utility.
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: Simon Zhao <[email protected]>
@miguelgfierro I think the pattern matching is incorrect. See the example below that uses the pattern matching in execute_notebook(): >>> import re
>>> pattern = re.compile(rf"\bmy_param\s*=\s*([^#\n]+)(?:#.*$)?", re.MULTILINE)
>>> cell_source = "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\""
>>> matches = re.findall(pattern, "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\"")
>>> matches
["'abc'"]
>>> cell_source.replace(matches[0].strip(), '10')
'"my_param = 10\n", "another_param = 10\n"' All parameters whose value is 'abc' above are changed. |
Signed-off-by: Simon Zhao <[email protected]>
@miguelgfierro I fixed the pattern matching bug. Now a new error is catched. I'll take a look the day after tomorrow. |
@SimonYansenZhao can we modularize the parameter pattern matching & replace part to pull out from |
Signed-off-by: Simon Zhao <[email protected]>
"import tensorflow as tf\n", | ||
"tf.get_logger().setLevel(\"ERROR\") # only show error messages\n", | ||
"tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)\n", | ||
"\n", | ||
"from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources, prepare_hparams\n", | ||
"from recommenders.models.deeprec.models.dkn import DKN\n", | ||
"from recommenders.models.deeprec.io.dkn_iterator import DKNTextIterator\n", | ||
"from recommenders.utils.notebook_utils import store_metadata\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a weird error in the DKN notebook. It's a timeout. I have never seen this error.
It might be related to a bad configuration of CUDA (see below)? Let me rerun that test.
@pytest.mark.notebooks
@pytest.mark.gpu
def test_dkn_quickstart(notebooks, output_notebook, kernel_name):
notebook_path = notebooks["dkn_quickstart"]
> execute_notebook(
notebook_path,
output_notebook,
kernel_name=kernel_name,
parameters=dict(EPOCHS=1, BATCH_SIZE=500),
)
tests/unit/examples/test_notebooks_gpu.py:118:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
recommenders/utils/notebook_utils.py:107: in execute_notebook
executed_notebook, _ = execute_preprocessor.preprocess(
/azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:102: in preprocess
self.preprocess_cell(cell, resources, index)
/azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py:123: in preprocess_cell
cell = self.execute_cell(cell, index, store_history=True)
/azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib/python3.8/site-packages/jupyter_core/utils/__init__.py:173: in wrapped
return loop.run_until_complete(inner)
/azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
return future.result()
/azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib/python3.8/site-packages/nbclient/client.py:1005: in async_execute_cell
exec_reply = await self.task_poll_for_reply
/azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib/python3.8/site-packages/nbclient/client.py:806: in _async_poll_for_reply
error_on_timeout_execute_reply = await self._async_handle_timeout(timeout, cell)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x152370dcb400>
timeout = 600
cell = ***'cell_type': 'code', 'execution_count': 7, 'metadata': ***'pycharm': ***'is_executing': False***, 'scrolled': True, 'execut...\x1b[49m\x1b[43m)\x1b[49m\n', '\x1b[0;31mKeyboardInterrupt\x1b[0m: ']***], 'source': 'model.fit(train_file, valid_file)'***
async def _async_handle_timeout(
self, timeout: int, cell: NotebookNode | None = None
) -> None | dict[str, t.Any]:
self.log.error("Timeout waiting for execute reply (%is)." % timeout)
if self.interrupt_on_timeout:
self.log.error("Interrupting kernel")
assert self.km is not None
await ensure_async(self.km.interrupt_kernel())
if self.error_on_timeout:
execute_reply = ***"content": *****self.error_on_timeout, "status": "error"***
return execute_reply
return None
else:
assert cell is not None
> raise CellTimeoutError.error_from_timeout_and_cell(
"Cell execution timed out", timeout, cell
)
E nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 600 seconds.
E The message was: Cell execution timed out.
E Here is a preview of the cell contents:
E -------------------
E model.fit(train_file, valid_file)
E -------------------
/azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib/python3.8/site-packages/nbclient/client.py:856: CellTimeoutError
----------------------------- Captured stdout call -----------------------------
ERROR:traitlets:Timeout waiting for execute reply (600s).
----------------------------- Captured stderr call -----------------------------
2023-11-18 07:19:58.273399: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:19:58.273444: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-11-18 07:20:01.260672: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.260819: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.260909: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.260994: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.261077: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.261163: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.261246: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.261330: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 07:20:01.261341: W tensorflow/core/common_runtime/gpu/gpu_device.cc:[1850](https://github.com/recommenders-team/recommenders/actions/runs/6912445137/job/18808189826?pr=2031#step:3:1857)] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-11-18 07:20:02.067266: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-18 07:20:02.068875: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
------------------------------ Captured log call -------------------------------
ERROR traitlets:client.py:845 Timeout waiting for execute reply (600s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SimonYansenZhao this is the current error. I believe it is not related to the multiline problem
The pattern matching in the notebook utils now cannot extract multiline parameter values. For example the following test recommenders/tests/unit/examples/test_notebooks_gpu.py Lines 77 to 94 in b000b78
when doing the value substitution for RANKING_METRICS = [
evaluator.ndcg_at_k.__name__,
evaluator.precision_at_k.__name__,
] will leads the following result:
So the current solution is to rewrite all multiline parameters into one line. See the commit. |
@loomlike Sure, but now we need to make all tests passed before refactoring. |
there is an error, the system doesn't install cuda 11, but 12:
I tried to I tried to Tried remove Tried to comment pytorch ->still getting installed cuda 12 https://github.com/recommenders-team/recommenders/actions/runs/6987799139/job/19014709919 Tried commenting pytorch, fastai, tfslim and leave only Tried with TF and torch ->Torch is installing cuda12 like Trying Trying Try again without nvidia-nvjitlink-cu11 -> I still get the time out error. See https://github.com/recommenders-team/recommenders/actions/runs/7007905771/job/19063048904 Try Installed in local with:
Got an the same error:
|
Signed-off-by: Simon Zhao <[email protected]>
Description
This PR removes papermill and scrapbook and adds the same functionality
Related Issues
Fixes #2012
References
Checklist:
staging branch
and not tomain branch
.