Not Reproducible Training Result #6

WPCJATH · 2025-01-03T17:01:20Z

We attempted to train the model by following the instructions provided in the Train Diffusion Texture Painting guide. However, the results we obtained were significantly below the quality of the pre-trained checkpoints provided in the repository.

Experiment Setup

Here is a detailed description of our training setup:

🤗Accelerate Configuration: Default settings with single GPU training
Dataset: DTD Dataset
Stable Diffusion Inpainting Model: benjamin-paine/stable-diffusion-v1-5-inpainting
Hyperparameters: Same as outlined in the Train Diffusion Texture Painting documentation

Observations and Issues

Below are the key issues we encountered during training:

Validation Results: The model struggles to inpaint certain samples effectively. We suspect that the random mask generation algorithm might be a contributing factor, especially as it sometimes creates large/entire areas of black or white regions.
Training Loss: The training loss does not decrease as expected, which is abnormal and suggests potential issues with the training process or configuration.
Drawing Results: The generated stamps are not connected properly, and the previews fail to reproduce the intended brush strokes.

Request for Assistance

Given these issues, we kindly request the authors or maintainers to guide the following points:

Are there specific considerations or adjustments to the random mask generation algorithm that could improve performance, particularly for large regions of missing pixels?
Could the abnormal training loss behavior be related to the dataset, optimizer settings, or any overlooked configuration details?
Are there any additional settings, debugging tips, or known limitations of the current implementation that might help us achieve better results?

We would greatly appreciate your insights or suggestions. Thank you for your work on this project, and we look forward to your response.

anita-hu · 2025-02-18T16:46:50Z

The training loss is expected for diffusion models. Since the model is predicting the noise at randomly sampled timesteps, at earlier timesteps, it is harder to denoise (higher loss at that training step), while at later time steps it is easier (lower loss at that training step).

The released model weights was trained for 100 epochs using 8 gpus with batch size 32 per gpu. For single gpu, you can try training with gradient accumulation to achieve same total training batch size of 256
This was my setup

  1 09/07/2023 02:51:23 - INFO - __main__ - ***** Running training *****
  2 09/07/2023 02:51:23 - INFO - __main__ -   Num examples = 5640
  3 09/07/2023 02:51:23 - INFO - __main__ -   Num Epochs = 100
  4 09/07/2023 02:51:23 - INFO - __main__ -   Instantaneous batch size per device = 32
  5 09/07/2023 02:51:23 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 256
  6 09/07/2023 02:51:23 - INFO - __main__ -   Gradient Accumulation steps = 1
  7 09/07/2023 02:51:23 - INFO - __main__ -   Total optimization steps = 2300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not Reproducible Training Result #6

Not Reproducible Training Result #6

WPCJATH commented Jan 3, 2025 •

edited

Loading

anita-hu commented Feb 18, 2025 •

edited

Loading

Not Reproducible Training Result #6

Not Reproducible Training Result #6

Comments

WPCJATH commented Jan 3, 2025 • edited Loading

Experiment Setup

Observations and Issues

Request for Assistance

anita-hu commented Feb 18, 2025 • edited Loading

WPCJATH commented Jan 3, 2025 •

edited

Loading

anita-hu commented Feb 18, 2025 •

edited

Loading