Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Hanging At Start of Training #46

Open
AseemGill opened this issue Jul 13, 2023 · 0 comments
Open

Code Hanging At Start of Training #46

AseemGill opened this issue Jul 13, 2023 · 0 comments

Comments

@AseemGill
Copy link

Hi, I am running the CDVAE carbon experiment and I have been seeing a weird error. It appears that my code will just hang after completely three iterations of the first epoch.

I run **python cdvae/run.py data=carbon expname=carbon model.predict_property=True**

The output I see is this:

`[2023-07-13 16:57:36,190][hydra.utils][INFO] - Instantiating <cdvae.pl_data.datamodule.CrystDataModule>
[2023-07-13 16:57:37,161][numexpr.utils][INFO] - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2023-07-13 16:57:37,161][numexpr.utils][INFO] - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
 25%|█████████████████████▍                                                                | 1521/6091 [00:25<01:29, 50.81it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 34%|█████████████████████████████▎                                                        | 2080/6091 [00:34<01:05, 61.51it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 46%|███████████████████████████████████████▊                                              | 2820/6091 [00:46<01:02, 52.70it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 50%|██████████████████████████████████████████▋                                           | 3021/6091 [00:49<00:52, 58.26it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 51%|███████████████████████████████████████████▍                                          | 3079/6091 [00:50<00:54, 55.41it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 51%|████████████████████████████████████████████▏                                         | 3132/6091 [00:51<00:40, 72.78it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 52%|████████████████████████████████████████████▎                                         | 3140/6091 [00:51<00:49, 59.77it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 60%|███████████████████████████████████████████████████▊                                  | 3673/6091 [00:59<00:38, 63.39it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 66%|████████████████████████████████████████████████████████▋                             | 4018/6091 [01:05<00:32, 63.75it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 67%|█████████████████████████████████████████████████████████▌                            | 4077/6091 [01:06<00:33, 60.74it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 67%|█████████████████████████████████████████████████████████▊                            | 4098/6091 [01:06<00:29, 67.92it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 69%|███████████████████████████████████████████████████████████▊                          | 4233/6091 [01:08<00:29, 63.53it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 74%|████████████████████████████████████████████████████████████████                      | 4536/6091 [01:13<00:23, 67.50it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 80%|████████████████████████████████████████████████████████████████████▋                 | 4869/6091 [01:18<00:16, 72.17it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 84%|████████████████████████████████████████████████████████████████████████              | 5106/6091 [01:22<00:18, 53.65it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 91%|██████████████████████████████████████████████████████████████████████████████▌       | 5566/6091 [01:29<00:08, 63.96it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 95%|█████████████████████████████████████████████████████████████████████████████████▋    | 5786/6091 [01:33<00:05, 59.95it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 96%|██████████████████████████████████████████████████████████████████████████████████▏   | 5822/6091 [01:33<00:04, 64.72it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 98%|████████████████████████████████████████████████████████████████████████████████████▎ | 5974/6091 [01:36<00:01, 66.91it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
100%|██████████████████████████████████████████████████████████████████████████████████████| 6091/6091 [01:39<00:00, 61.48it/s]
/gpfs/fs1/home/cdvae-old/cdvae/cdvae/common/data_utils.py:644: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /data/miniconda3/envs/opence-1.7/conda-bld/pytorch-base_1663986328871/work/torch/csrc/utils/tensor_new.cpp:201.)
  targets = torch.tensor([d[key] for d in data_list])
/gpfs/fs1/home/cdvae-old/cdvae/cdvae/common/data_utils.py:612: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  X = torch.tensor(X, dtype=torch.float)
[2023-07-13 16:59:20,540][hydra.utils][INFO] - Instantiating <cdvae.pl_modules.model.CDVAE>
[2023-07-13 16:59:20,615][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmpwv1glt9u
[2023-07-13 16:59:20,615][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmpwv1glt9u/_remote_module_non_scriptable.py
[2023-07-13 16:59:53,346][hydra.utils][INFO] - Passing scaler from datamodule to model <StandardScalerTorch(means: -154.2510223388672, stds: 0.13738815486431122)>
[2023-07-13 16:59:53,348][hydra.utils][INFO] - Adding callback <LearningRateMonitor>
[2023-07-13 16:59:53,349][hydra.utils][INFO] - Adding callback <EarlyStopping>
[2023-07-13 16:59:53,350][hydra.utils][INFO] - Adding callback <ModelCheckpoint>
[2023-07-13 16:59:53,354][hydra.utils][INFO] - Instantiating <WandbLogger>
wandb: Currently logged in as: _. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/cdvae-old/cdvae/wabdb/wandb/run-20230713_165954-u04zv43g
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run carbon
wandb: ⭐️ View project at https://wandb.ai/_/crystal_generation_mit
wandb: 🚀 View run at https://wandb.ai/_/crystal_generation_mit/runs/u04zv43g
[2023-07-13 17:00:07,550][hydra.utils][INFO] - W&B is now watching <{cfg.logging.wandb_watch.log}>!
wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
[2023-07-13 17:00:07,588][hydra.utils][INFO] - Instantiating the Trainer
/home/.conda/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:96: LightningDeprecationWarning: Setting `Trainer(progress_bar_refresh_rate=20)` is deprecated in v1.5 and will be removedin v1.7. Please pass `pytorch_lightning.callbacks.progress.TQDMProgressBar` with `refresh_rate` directly to the Trainer's `callbacks` argument instead. Or, to disable the progress bar pass `enable_progress_bar = False` to the Trainer.
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2023-07-13 17:00:07,650][hydra.utils][INFO] - Starting training!
  0%|                                                                                         | 2/6091 [00:00<32:19,  3.14it/s

I am running on MIST HPC, so I have turned off WandB logging.

Environment

Package                  Version           Editable project location
------------------------ ----------------- --------------------------------------------------
absl-py                  1.4.0
aiofiles                 22.1.0
aiohttp                  3.8.4
aiosignal                1.3.1
aiosqlite                0.18.0
altair                   5.0.1
antlr4-python3-runtime   4.8
anyio                    3.5.0
appdirs                  1.4.4
argon2-cffi              21.3.0
argon2-cffi-bindings     21.2.0
ase                      3.22.0
astor                    0.8.1
astroid                  2.14.2
asttokens                2.0.5
async-timeout            4.0.2
attrs                    22.1.0
autopep8                 2.0.2
av                       9.2.0
Babel                    2.11.0
backcall                 0.2.0
backports.zoneinfo       0.2.1
base58                   2.1.1
beautifulsoup4           4.12.2
bleach                   4.1.0
blinker                  1.6.2
Bottleneck               1.3.5
brotlipy                 0.7.0
cachetools               5.3.1
cdvae                    0.0.1             
certifi                  2023.5.7
cffi                     1.15.1
charset-normalizer       2.0.4
click                    8.0.4
colorama                 0.4.6
comm                     0.1.2
configparser             6.0.0
contourpy                1.0.5
coverage                 7.2.2
cryptography             39.0.1
cycler                   0.11.0
debugpy                  1.5.1
decorator                5.1.1
defusedxml               0.7.1
dill                     0.3.6
distlib                  0.3.6
dnspython                2.3.0
docker-pycreds           0.4.0
emmet-core               0.60.1
entrypoints              0.4
exceptiongroup           1.0.4
executing                0.8.3
fastjsonschema           2.16.2
filelock                 3.12.0
fonttools                4.25.0
frozenlist               1.3.3
fsspec                   2023.4.0
future                   0.18.3
gitdb                    4.0.10
GitPython                3.1.32
google-auth              2.22.0
google-auth-oauthlib     1.0.0
googledrivedownloader    0.4
grpcio                   1.48.2
higher                   0.2.1
html5lib                 1.1
hydra-core               1.1.0
hydra-joblib-launcher    1.1.5
idna                     3.4
importlib-metadata       6.0.0
importlib-resources      5.12.0
iniconfig                1.1.1
ipykernel                6.19.2
ipython                  8.12.0
ipython-genutils         0.2.0
ipywidgets               8.0.4
isodate                  0.6.1
isort                    5.9.3
jedi                     0.18.1
Jinja2                   3.1.2
joblib                   1.2.0
json5                    0.9.6
jsonschema               4.17.3
jupyter_client           8.1.0
jupyter_core             5.3.0
jupyter-events           0.6.3
jupyter_server           2.5.0
jupyter_server_fileid    0.9.0
jupyter_server_terminals 0.4.4
jupyter_server_ydoc      0.8.0
jupyter-ydoc             0.2.4
jupyterlab               3.6.3
jupyterlab-pygments      0.1.2
jupyterlab_server        2.22.0
jupyterlab-widgets       3.0.5
kiwisolver               1.4.4
latexcodec               2.0.1
lazy-object-proxy        1.6.0
lightning-utilities      0.7.1
lxml                     4.9.2
Markdown                 3.4.3
MarkupSafe               2.1.1
matminer                 0.7.3
matplotlib               3.7.1
matplotlib-inline        0.1.6
mccabe                   0.7.0
mistune                  0.8.4
monty                    2023.5.8
mp-api                   0.33.3
mpmath                   1.3.0
msgpack                  1.0.5
multidict                6.0.4
multiprocess             0.70.14
munkres                  1.1.4
nbclassic                0.5.5
nbclient                 0.5.13
nbconvert                6.5.4
nbformat                 5.7.0
nest-asyncio             1.5.6
networkx                 2.8.4
nglview                  3.0.6
notebook                 6.5.4
notebook_shim            0.2.2
numexpr                  2.8.4
numpy                    1.23.5
oauthlib                 3.2.2
omegaconf                2.1.2
p-tqdm                   1.3.3
packaging                23.0
palettable               3.3.3
pandas                   1.5.3
pandocfilters            1.5.0
parso                    0.8.3
pathos                   0.3.0
pathtools                0.1.2
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   9.4.0
Pint                     0.21.1
pip                      23.1.2
pkgutil_resolve_name     1.3.10
platformdirs             3.2.0
plotly                   5.15.0
pluggy                   1.0.0
pox                      0.3.2
ppft                     1.7.6.6
prometheus-client        0.14.1
promise                  2.3
prompt-toolkit           3.0.36
protobuf                 3.19.6
psutil                   5.9.0
ptyprocess               0.7.0
pure-eval                0.2.2
py                       1.11.0
pyarrow                  8.0.0
pyasn1                   0.5.0
pyasn1-modules           0.3.0
pybtex                   0.24.0
pycodestyle              2.10.0
pycparser                2.21
pydantic                 1.10.11
pydeck                   0.8.1b0
pyDeprecate              0.3.1
pyg-nightly              2.4.0.dev20230711
Pygments                 2.15.1
pylint                   2.16.2
pymatgen                 2023.7.11
pymongo                  4.4.0
pyOpenSSL                23.0.0
pyparsing                3.0.9
pyrsistent               0.18.0
PySocks                  1.7.1
pytest                   7.3.1
pytest-cov               4.0.0
python-dateutil          2.8.2
python-dotenv            1.0.0
python-json-logger       2.0.7
python-louvain           0.15
pytorch-lightning        1.6.5
pytz                     2022.7
PyYAML                   5.4.1
pyzmq                    25.1.0
rdflib                   6.1.1
requests                 2.29.0
requests-oauthlib        1.3.1
rfc3339-validator        0.1.4
rfc3986-validator        0.1.1
rsa                      4.9
ruamel.yaml              0.17.32
ruamel.yaml.clib         0.2.7
scikit-learn             1.2.2
scipy                    1.8.1
Send2Trash               1.8.0
sentencepiece            0.1.96
sentry-sdk               1.28.0
setproctitle             1.3.2
setuptools               67.8.0
shortuuid                1.0.11
six                      1.16.0
SMACT                    2.2.1
smmap                    5.0.0
sniffio                  1.2.0
soupsieve                2.4
spglib                   2.0.2
stack-data               0.2.0
streamlit                0.79.0
subprocess32             3.5.4
sympy                    1.12
tabulate                 0.8.10
tenacity                 8.2.2
tensorboard              2.13.0
tensorboard-data-server  0.7.1
terminado                0.17.1
threadpoolctl            2.2.0
tinycss2                 1.2.1
toml                     0.10.2
tomli                    2.0.1
tomlkit                  0.11.1
toolz                    0.12.0
torch                    1.12.1
torch-cluster            1.6.1
torch-geometric          1.7.2
torch-scatter            2.0.8
torch-sparse             0.6.10
torch-spline-conv        1.2.2
torchdiffeq              0.0.1
torchmetrics             1.0.0
torchtext                0.13.1a0+35066f2
torchvision              0.13.1
tornado                  6.2
tqdm                     4.65.0
traitlets                5.7.1
typing_extensions        4.6.3
tzlocal                  5.0.1
uncertainties            3.1.7
urllib3                  1.26.16
validators               0.20.0
virtualenv               20.22.0
wandb                    0.15.5
watchdog                 3.0.0
wcwidth                  0.2.5
webencodings             0.5.1
websocket-client         0.58.0
Werkzeug                 2.3.6
wheel                    0.38.4
widgetsnbextension       4.0.5
wrapt                    1.14.1
y-py                     0.5.9
yacs                     0.1.6
yarl                     1.9.2
ypy-websocket            0.8.2
zipp                     3.11.0

Any suggestions on how to resolve this? I am not very familiar with Hydra and Pytorch Lightning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant