-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add Multi-GPU Support to v1.1 #1449
Comments
@samet-akcay Is there any code implementation for using multiple GPUs? |
@lemonbuilder, this has now been added to the roadmap. This task would close the following issues: #930 #1110 #930 #1398 |
@samet-akcay , sorry, I got the error when training with multi-GPU with v1. How can I use only 1 GPU for example id 3 for training? Now I'm using this code for training:
|
@nguyenanhtuan1008, you could refer to this link. In this case, you could initialize the Engine class as ; engine = Engine(accelerator="gpu", devices="3") |
@samet-akcay |
Hello, I wish to take this issue. |
Hi @samet-akcay I would like to work on this issue. Can I take this issue? |
@RitikaxShakya, thanks for your interest. I've totally missed this one, but looks like @Bepitic already shown interest in this. If he doesn't want to work on it, it could be all yours. How does that sound? |
@Bepitic, are you still interested in this issue? If not @RitikaxShakya can take it? |
Yes for sure, since no one confirmed me I also forgot about the one of multi-gpu 😅 |
sorry about that |
@RitikaxShakya, all yours then |
.take |
@blaz-r @samet-akcay Hello! I need help regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support. |
I am not that familiar with these topics within the Anomalib. @ashwinvaidya17 could you provide some insight here? |
@ashwinvaidya17 Hello! Please help me regarding the parts of the code that deal with GPU initialization, data parallelization, and GPU-specific operations as these are the areas i think I'll need to modify to add Multi-GPU support. |
@RitikaxShakya currently we override the number of devices to 1 in Engine and the CLI. To start with, we should remove these lines. anomalib/src/anomalib/engine/engine.py Line 305 in debdae7
anomalib/src/anomalib/utils/config.py Line 130 in debdae7
Doing this will break a bunch of stuff across the repo.
I might have missed something so feel free to report any difficulties you run into. |
Using latest anomalib 1.1.0 from pip I create the Engine like so: engine = Engine(
max_epochs=100,
task=task_type,
accelerator="gpu",
devices=-1,
) By passing
However, when looking at the output from
Is this a bug or how can I use all my GPUs for training with anomalig 1.1.0? |
@haimat, we aim to enable multi-gpu support in v1.2 |
@samet-akcay Thanks for your quick reply. |
yeah, that's the plan hopefully :) |
Hello guys, any news on this, do you have an ETA on 1.2 and multui-GPU training? |
@samet-akcay Hello Samet, can you estimate when multi-GPU training will be available? |
@haimat unfortunately we don't have an exact timeline for this. Currently, we are busy with some other high-priority tasks. |
@samet-akcay Create the model and enginemodel = Patchcore() Train a Patchcore model on the given datamoduleengine.train(datamodule=datamodule, model=model) What is the default epoch? |
@goldwater668, as mentioned above, multi-GPU is not currently supported. anomalib/src/anomalib/engine/engine.py Lines 327 to 328 in 2bd2842
|
@samet-akcay Can you specify the GPU ID? |
Yes, you could specify the GPU ID.
If you are experiencing out of memory issues with Patchcore, your dataset is probably to large to fit a PatchCore memory bank. You could configure Patchcore arguments to make it more memory efficient. For example, changing the backbone to a more efficient backbone, changing the layers to extract etc. anomalib/src/anomalib/models/image/patchcore/lightning_model.py Lines 25 to 48 in 2bd2842
|
@samet-akcay I specified GPU cards 1 and 2 for training. However, during training, I still trained on card 0. Is there anything wrong with the GPU specified in the above settings? |
@goldwater668, you currently cannot set multiple GPUs as it will be mapped back to a single GPU. With that being said, I noticed that Engine always configures the device to run on the default GPU even when the user explicitly chooses a specific GPU. I've created a PR to fix this |
I have created an official feature issue here: #2258 I'm closing this one. Those who are interested in this feature can follow the above issue. |
@samet-akcay I used EfficientAd for 10 epochs and single category training to get the following results: Epoch 9/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43944/43944 0:23:59 ? 0:00:00 30.44it/s train_st_step: 0.530 train_ae_step: 0.337 train_stae_step: 0.042 train_loss_step: 0.909
image_AUROC: 0.906 image_F1Score: 0.709 train_st_epoch: 0.553 train_ae_epoch: 0.428
Epoch 9/9 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43944/43944 0:23:59 ? 0:00:00 30.44it/s train_st_step: 0.530 train_ae_step: 0.337 train_stae_step: 0.042 train_loss_step: 0.909
image_AUROC: 0.956 image_F1Score: 0.699 train_st_epoch: 0.541 train_ae_epoch: 0.425
train_stae_epoch: 0.056 train_loss_epoch: 1.022
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ image_AUROC │ 0.6983320713043213 │
│ image_F1Score │ 0.6690346598625183 │
└───────────────────────────┴───────────────────────────┘ predictions = engine.predict(
datamodule=datamodule,
model=model,
ckpt_path="latest/weights/lightning/model.ckpt",
) When predicting data, do I have to go through datamodule = Folder()? Is there a way to test the image directly? |
@goldwater668, the post above is not related to this issue. Can you create a Q&A in Discussions section |
What is the motivation for this task?
I'm going to train custom dataset using EfficientAd model.
How do I train or test using Multi-GPU?
Please, tell me which command is used.
Describe the solution you'd like
Currently, I'm training using only single devices.
$ python3 tools/train.py --model efficient_ad
Additional context
No response
The text was updated successfully, but these errors were encountered: