Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not Reproducible Training Result #6

Open
WPCJATH opened this issue Jan 3, 2025 · 1 comment
Open

Not Reproducible Training Result #6

WPCJATH opened this issue Jan 3, 2025 · 1 comment

Comments

@WPCJATH
Copy link

WPCJATH commented Jan 3, 2025

We attempted to train the model by following the instructions provided in the Train Diffusion Texture Painting guide. However, the results we obtained were significantly below the quality of the pre-trained checkpoints provided in the repository.

Experiment Setup

Here is a detailed description of our training setup:

Observations and Issues

Below are the key issues we encountered during training:

  1. Validation Results: The model struggles to inpaint certain samples effectively. We suspect that the random mask generation algorithm might be a contributing factor, especially as it sometimes creates large/entire areas of black or white regions.
    Final Validation Result

  2. Training Loss: The training loss does not decrease as expected, which is abnormal and suggests potential issues with the training process or configuration.
    Train Loss

  3. Drawing Results: The generated stamps are not connected properly, and the previews fail to reproduce the intended brush strokes.
    Drawing Result

Request for Assistance

Given these issues, we kindly request the authors or maintainers to guide the following points:

  • Are there specific considerations or adjustments to the random mask generation algorithm that could improve performance, particularly for large regions of missing pixels?
  • Could the abnormal training loss behavior be related to the dataset, optimizer settings, or any overlooked configuration details?
  • Are there any additional settings, debugging tips, or known limitations of the current implementation that might help us achieve better results?

We would greatly appreciate your insights or suggestions. Thank you for your work on this project, and we look forward to your response.

@anita-hu
Copy link
Collaborator

anita-hu commented Feb 18, 2025

The training loss is expected for diffusion models. Since the model is predicting the noise at randomly sampled timesteps, at earlier timesteps, it is harder to denoise (higher loss at that training step), while at later time steps it is easier (lower loss at that training step).

The released model weights was trained for 100 epochs using 8 gpus with batch size 32 per gpu. For single gpu, you can try training with gradient accumulation to achieve same total training batch size of 256
This was my setup

  1 09/07/2023 02:51:23 - INFO - __main__ - ***** Running training *****
  2 09/07/2023 02:51:23 - INFO - __main__ -   Num examples = 5640
  3 09/07/2023 02:51:23 - INFO - __main__ -   Num Epochs = 100
  4 09/07/2023 02:51:23 - INFO - __main__ -   Instantaneous batch size per device = 32
  5 09/07/2023 02:51:23 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 256
  6 09/07/2023 02:51:23 - INFO - __main__ -   Gradient Accumulation steps = 1
  7 09/07/2023 02:51:23 - INFO - __main__ -   Total optimization steps = 2300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants