dpgen run stuck #937
kluophysics
started this conversation in
General
Replies: 1 comment 3 replies
-
I figured it out the conda environment (default source inside .bashrc file) somehow interfere with the ssh connection. After disabling it, now the jobs starts. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
After new installation of dpgen, I tried to run the program as the following. It was running okay before the new installation. But now it got stuck at the first step.
The file inside the scratch dir is empty!
The machine.json is
{
"api_version": "1.0",
"train" :[
{
"command": "
/.conda/envs/deepmd/bin/dp",/.conda/envs/deepmd/bin/lmp","machine": {
"batch_type": "slurm",
"context_type": "SSHContext",
"local_root" : "./",
"remote_root": "/public/home/kluo/work/calc/dpgen/scratch",
"remote_profile":{
"hostname": "localhost",
"username": "kluo"
}
},
"resources": {
"number_node": 1,
"cpu_per_node": 8,
"gpu_per_node": 0,
"queue_name": "small",
"group_size": 1,
"source_list": ["/public/home/kluo/work/calc/dpgen/Al-liq-vapor/run/dp.env"]
}
}],
"model_devi":
[{
"command": "
"machine": {
"batch_type": "slurm",
"context_type": "SSHContext",
"local_root" : "./",
"remote_root": "/public/home/kluo/work/calc/dpgen/scratch",
"remote_profile":{
"hostname": "localhost",
"username": "kluo"
}
},
"resources": {
"number_node": 1,
"cpu_per_node": 36,
"gpu_per_node": 0,
"queue_name": "medium",
"group_size": 20,
"source_list": ["/public/home/kluo/work/calc/dpgen/Al-liq-vapor/run/lmp.env"]
}
}],
"fp":
[{
"command": "mpirun -np $SLURM_NTASKS vasp_std",
"machine": {
"batch_type": "slurm",
"context_type": "SSHContext",
"local_root" : "./",
"remote_root": "/public/home/kluo/work/calc/dpgen/scratch",
"remote_profile":{
"hostname": "localhost",
"username": "kluo"
}
},
"resources": {
"number_node": 1,
"cpu_per_node": 36,
"gpu_per_node": 0,
"queue_name": "medium",
"group_size": 10,
"custom_flags": ["#SBATCH --time=20-00:00:00", "#SBATCH --job-name=vasp_dpgen"],
"module_purge": true,
"module_list": ["mpi/intelmpi/2017.4.239", "compiler/intel/intel-compiler-2017.5.239", "apps/vasp/5.4.1"],
"source_list": [],
"_comment": "if use 2 nodes, then with #SBATCH --exclude=node5,node20. otherwise, do not include it"
}
}]
}
the param.json is
{
"type_map": ["Fe"],
"mass_map": [55.847],
"use_ele_temp": 1,
}
During the run, a .tgz file was generated somehow, which wasn't seen before.
$ ls 00.train/
000 001 002 003 data.init eacfa1fffd6cec401a47d6ca4318895982d926ad.tgz
Any idea of how the stuck happens?Thanks!
Kai
Beta Was this translation helpful? Give feedback.
All reactions