Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there any detailed steps for training our own data? #27

Open
Oops37 opened this issue Mar 26, 2024 · 10 comments
Open

Are there any detailed steps for training our own data? #27

Oops37 opened this issue Mar 26, 2024 · 10 comments

Comments

@Oops37
Copy link
Contributor

Oops37 commented Mar 26, 2024

This is great work. We hope to use 4K4D to reconstruct our own scenes.I'd like to know how long it will take for the training code to be updated?

@dendenxu
Copy link
Member

Hi, thanks for the interest!
I'm finalizing some example training scripts and documentation and I should be able to release them around the CVPR camera-ready deadline (should be..)

@dendenxu
Copy link
Member

Hi @Oops37, I just released the training code and documentation for training on custom datasets, check it out in our readme.
Feel free to drop in a comment if you encounter any issues and I'll be happy to help.

@Oops37
Copy link
Contributor Author

Oops37 commented Mar 27, 2024

Thanks for your sharing!When I tried to train the enerf_outdoor dataset on Windows11, I got the following error message:
2024-03-27 214536
2024-03-27 214630
But I have installed pyopengl3.1.7.
2024-03-27 215255
By the way, no error was reported when I used the gui to render.

@dendenxu
Copy link
Member

Hi, unfortunately, we don't support training on Windows, only rendering, since I haven't found a way to make screenless rendering on Windows work (as shown by the error). I would suggest an Ubuntu Linux environment or a wsl2 setup.

@Oops37
Copy link
Contributor Author

Oops37 commented Mar 29, 2024

I have installed 4K4D on autodl and installed related dependencies. During training, an egl error still occurs, as follows:
image
image
Does this project not support training in a headless environment?

@dendenxu
Copy link
Member

Training should work fine in a headless environment. This looks like a driver issue, could you share the environment you're running the training on (python, pytorch, linux and nvidia driver version)? If the driver version is too low, a possible fix is to upgrade your nvidia driver.

Another known issue is related to docker (I'm not sure whether it's a similar situation on autodl): NVIDIA/nvidia-docker#1520
If that's the case, it can be solved as documented here: https://github.com/zju3dv/4K4D/blob/main/easyvolcap/utils/egl_utils.py#L25

@Oops37
Copy link
Contributor Author

Oops37 commented Mar 29, 2024

Here is my environment:
linux: ubuntu20.04
python 3.10.14+pytorch1.13.0+cuda11.6
nvidia driver version: 510.60.02
The driver cannot be updated and the 10_nvidia.json file cannot be found because it is a cloud server rented from the autodl website.

@dendenxu
Copy link
Member

The driver looks new enough.

But I'm curious as to why the 10_nvidia.json file couldn't be created? Is it because of insufficient privilege?

@Oops37
Copy link
Contributor Author

Oops37 commented Mar 29, 2024

I'm really sorry to bother you...This is my mistake...I tried creating 10_nvidia.json under both "/etc/glvnd/egl_vendor.d" and "/usr/share/glvnd/egl_vendor.d". But a new EGL error occurred.Complete information is as follows:

(easyvolcap) root@autodl-container-45f511a5e8-b65a70b6:~/4K4D# evc -c configs/exps/4k4d/4k4d_actor1_4_r4.yaml,configs/specs/static.yaml,configs/specs/tiny.yaml
2024-03-29 17:43:10.850257 main -> preflight: Starting experiment: 4k4d_actor1_4_r4, command: train main.py:80
2024-03-29 easyvolcap.dataload… Preparing vhulls data/enerf_outdoor/actor1_4/surfs TRAIN 100% ━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 < 0:00:00 ? it/s v…
17:43:11.516507 -> load_vhulls:
2024-03-29 easyvolcap.utils.… Loading mask bytes for data/enerf_outdoor/actor1_4/bgmtv2 TRAIN 100% ━━━━━━━━━━━━━━━━━ 18/18 0:00:00 < 0:00:00 574.7 it/s p…
17:43:11.692238 ->
load_resize_undis…
2024-03-29 easyvolcap.utils.… Loading imgs bytes for data/enerf_outdoor/actor1_4/images TRAIN 100% ━━━━━━━━━━━━━━━━━ 18/18 0:00:00 < 0:00:00 654.0 it/s p…
17:43:12.511301 ->
load_resize_undis…
2024-03-29 easyvolcap.dataloader… Preparing vhulls data/enerf_outdoor/actor1_4/surfs VAL 100% ━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 < 0:00:00 ? it/s v…
17:43:12.718030 -> load_vhulls:
2024-03-29 easyvolcap.utils.… Loading mask bytes for data/enerf_outdoor/actor1_4/bgmtv2 VAL 100% ━━━━━━━━━━━━━━━━━━ 18/18 0:00:00 < 0:00:00 458.5 it/s p…
17:43:12.892547 ->
load_resize_undis…
2024-03-29 easyvolcap.utils.… Loading imgs bytes for data/enerf_outdoor/actor1_4/images VAL 100% ━━━━━━━━━━━━━━━━━━ 18/18 0:00:00 < 0:00:00 400.6 it/s p…
17:43:13.741633 ->
load_resize_undis…
2024-03-29 easyvolcap.models.s… Loading init pcds from data/enerf_outdoor/actor1_4/surfs 100% ━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 < 0:00:00 ? it/s p…
17:43:13.912998 -> init:
2024-03-29 easyvolcap.models.s… Loading init pcds from data/enerf_outdoor/actor1_4/surfs 100% ━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 < 0:00:00 ? it/s p…
17:43:15.255820 -> init:
2024-03-29 17:43:15.269679 easyvolcap.utils.net_utils -> load_pretrained: Pretrained network: does not exist net_utils.py:294
2024-03-29 17:43:15.270982 easyvolcap.utils.net_utils -> load_pretrained: Pretrained network: does not exist net_utils.py:294
2024-03-29 17:43:15.278386 easyvolcap.runners.visualizers.volumetric_video_visualizer -> init: Visualization output: volumetric_video_visualizer.py:80
data/result/4k4d_actor1_4_r4/{RENDER,DEPTH,ALPHA}
2024-03-29 17:43:15.284505 easyvolcap.runners.recorders -> init: Saved config file to recorders.py:105
data/record/4k4d_actor1_4_r4/4k4d_actor1_4_r4_1711705395.yaml
2024-03-29 17:43:15.286848 easyvolcap.runners.optimizers -> ConfigurableOptimizer: Starting learning rate config: {'lr': 0.005, 'eps': 1e-15, optimizers.py:48
'weight_decay': 0.0}
2024-03-29 17:43:15.288581 easyvolcap.runners.optimizers -> ConfigurableOptimizer: Special learning rate config: {'lr': {'pcds': 0.0005, optimizers.py:54
'resd_regressor': 0.0005, 'geo_regressor': 0.0005}}
2024-03-29 17:43:15.299501 main -> train: Number of optimizable parameters: 5644236 (5.64 M) main.py:259
2024-03-29 17:43:15.300802 main -> launcher: Launching runner for experiment: 4k4d_actor1_4_r4 main.py:50
2024-03-29 17:43:25.700940 easyvolcap.utils.gl_utils -> : Could not import EGL related modules. ImportError: eglQueryDeviceAttribEXT is not gl_utils.py:47
available.
2024-03-29 17:43:25.935400 easyvolcap.models.samplers.point_planes_sampler -> prepare_opengl: Init eglctx with h, w: 1080, 1920 point_planes_sampler.py:491
2024-03-29 17:43:27.065016 easyvolcap.utils.console_utils -> inner: Runtime exception: eglQueryDeviceAttribEXT is not available. console_utils.py:391
╭─────────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────────╮
│ /root/4K4D/easyvolcap/utils/egl_utils.py:107 in │
│ │
│ ❱ 107 │ eglQueryDeviceAttribEXT = PFNEGLQUERYDEVICESATTRIBEXTPROC( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: argument must be callable or integer function address

The above exception was the direct cause of the following exception:

╭─────────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────────╮
│ /root/4K4D/easyvolcap/utils/console_utils.py:388 in inner │
│ │
│ ❱ 388 │ │ │ │ return func(*args, **kwargs) │
│ │
│ /root/4K4D/easyvolcap/scripts/main.py:272 in main │
│ │
│ ❱ 272 │ else: globals()args.type # invoke this (call callable_from_cfg -> call_from_cfg) │
│ │
│ /root/4K4D/easyvolcap/engine/registry.py:56 in inner │
│ │
│ ❱ 56 │ │ return call_from_cfg(func, cfg) │
│ │
│ /root/4K4D/easyvolcap/engine/registry.py:47 in call_from_cfg │
│ │
│ ❱ 47 │ return func(**call_args) │
│ │
│ /root/4K4D/easyvolcap/scripts/main.py:267 in train │
│ │
│ ❱ 267 │ launcher(**kwargs, runner_function=runner.train, runner_object=runner) │
│ │
│ /root/4K4D/easyvolcap/scripts/main.py:52 in launcher │
│ │
│ ❱ 52 │ runner_function() │
│ │
│ /root/4K4D/easyvolcap/runners/volumetric_video_runner.py:288 in train │
│ │
│ ❱ 288 │ │ │ next(train_generator) # avoid reconstruction of the dataloader │
│ │
│ /root/4K4D/easyvolcap/runners/volumetric_video_runner.py:350 in train_generator │
│ │
│ ❱ 350 │ │ │ │ output: dotdict = self.model(batch) # random dict storing various forms of output │
│ │
│ /root/miniconda3/envs/easyvolcap/lib/python3.10/site-packages/torch/nn/modules/module.py:1190 in _call_impl │
│ │
│ ❱ 1190 │ │ │ return forward_call(*input, **kwargs) │
│ │
│ /root/4K4D/easyvolcap/models/volumetric_video_model.py:245 in forward │
│ │
│ ❱ 245 │ │ output = rendering_function(*input, batch=batch) │
│ │
│ /root/4K4D/easyvolcap/models/volumetric_video_model.py:104 in render_rays │
│ │
│ ❱ 104 │ │ xyz, dir, t, dist = self.sampler.sample(ray_o, ray_d, near, far, t, batch) # B, P, S │
│ │
│ /root/4K4D/easyvolcap/utils/net_utils.py:95 in sample │
│ │
│ ❱ 95 │ │ │ self.forward(batch) │
│ │
│ /root/4K4D/easyvolcap/models/samplers/r4dvb_sampler.py:113 in forward │
│ │
│ ❱ 113 │ │ rgb, acc, dpt = self.bg_sampler.render_points(xyz, rgb, rad, occ, batch) # B, HW, C │
│ │
│ /root/4K4D/easyvolcap/models/samplers/r4dv_sampler.py:71 in render_points │
│ │
│ ❱ 71 │ │ │ rgb, acc, dpt = super().render_points(xyz, rgb, rad, occ, batch) # almost always use render_cudagl │
│ │
│ /root/4K4D/easyvolcap/models/samplers/point_planes_sampler.py:411 in render_points │
│ │
│ ❱ 411 │ │ │ elif self.use_diffgl: return self.render_diffgl(*args, **kwargs) │
│ │
│ /root/4K4D/easyvolcap/models/samplers/point_planes_sampler.py:532 in render_diffgl │
│ │
│ ❱ 532 │ │ self.prepare_opengl('diffgl', HardwarePeeling, self.dtype, self.dtype, batch.meta.H[0].item(), batch.meta.W[0].item(), xyz │
│ │
│ /root/4K4D/easyvolcap/models/samplers/point_planes_sampler.py:492 in prepare_opengl │
│ │
│ ❱ 492 │ │ │ from easyvolcap.utils.egl_utils import eglContextManager │
│ │
│ /root/4K4D/easyvolcap/utils/egl_utils.py:111 in │
│ │
│ ❱ 111 │ raise ImportError('eglQueryDeviceAttribEXT is not available.') from e │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: eglQueryDeviceAttribEXT is not available.
*** eglQueryDeviceAttribEXT is not available.

/root/4K4D/easyvolcap/utils/egl_utils.py(111)()
109 )
110 except TypeError as e:
--> 111 raise ImportError('eglQueryDeviceAttribEXT is not available.') from e
112
113 # From the EGL_EXT_platform_device extension.

(Pdbr) exit

@dendenxu
Copy link
Member

Hi, it looks like the autodl server doesn't support this particular egl function: eglQueryDeviceAttribEXT.

This function is used to match the EGL_DEVICE_ID with the CUDA Context id. If the CUDA_VISIBLE_DEVICES environment variable is unset, this function shouldn't be called. So maybe you could try commenting out the try-except block here: https://github.com/zju3dv/4K4D/blob/main/easyvolcap/utils/egl_utils.py#L106, make sure CUDA_VISIBLE_DEVICES is not set and try running again.

PS: We have a test for such obscure EGL errors in tests/headless_opengl_tests.py. It should be easier to run that script using:

python tests/headless_opengl_tests.py

than the training command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants