Training Fails #72

agombert · 2024-12-19T18:52:59Z

Hey got an error in training which is quite weird when doing:

python DocLayout-YOLO/train.py --data $DATA_PATH/config --model m-doclayout --epoch $EPOCHS --image-size 1024 --batch-size $BATCH_SIZE --patience $PATIENCE --project $OUTPUT_PATH --optimizer Adam --lr0 0.001 --pretrain $MODEL_PATH --device 0

train: Scanning /home/ubuntu/trocr_handwritten/trocr_handwritten/parse/data/labels/train.cache... 386 images, 10 backgrounds, 0 corrupt: 100%|██████████| 386/386 [00:00<?, ?it/s]
/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/albumentations/core/composition.py:250: UserWarning: Got processor for bboxes, but no transform to process it.
  self._set_keys()
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01, num_output_channels=3, method='weighted_average'), CLAHE(p=0.01, clip_limit=(1.0, 4.0), tile_grid_size=(8, 8))
val: Scanning /home/ubuntu/trocr_handwritten/trocr_handwritten/parse/data/labels/val.cache... 97 images, 3 backgrounds, 0 corrupt: 100%|██████████| 97/97 [00:00<?, ?it/s]
optimizer: Adam(lr=0.001, momentum=0.9) with parameter groups 171 weight(decay=0.0), 184 weight(decay=0.0005), 183 bias(decay=0.0)
Image sizes 1024 train, 1024 val
Using 1 dataloader workers
Logging results to yolo_ft/yolov10m-doclayout_data/config_epoch50_imgsz1024_bs8_pretrain_unknown3
Starting training for 50 epochs...

      Epoch    GPU_mem     box_om     cls_om     dfl_om     box_oo     cls_oo     dfl_oo  Instances       Size
  0%|          | 0/49 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/train.py", line 63, in <module>
    results = model.train(
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/model.py", line 660, in train
    self.trainer.train()
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/trainer.py", line 214, in train
    self._do_train(world_size)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/trainer.py", line 364, in _do_train
    for i, batch in pbar:
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/build.py", line 49, in __iter__
    yield next(self.iterator)
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
    data.reraise()
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/_utils.py", line 715, in reraise
    raise exception
cv2.error: Caught error in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/base.py", line 266, in __getitem__
    return self.transforms(self.get_image_and_label(index))
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 74, in __call__
    data = t(data)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 74, in __call__
    data = t(data)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 534, in __call__
    img, M, scale = self.affine_transform(img, border)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 434, in affine_transform
    img = cv2.warpAffine(img, M[:2], dsize=self.size, borderValue=(114, 114, 114))
cv2.error: OpenCV(4.10.0) :-1: error: (-5:Bad argument) in function 'warpAffine'
> Overload resolution failed:
>  - src is not a numpy array, neither a scalar
>  - Expected Ptr<cv::UMat> for argument 'src'

Looks like there is a problem to read the picture with openCV, but I tried with an old dataset I used to train the model and same error. Do you have any idea what's happening ?

Best,

Arnault

The text was updated successfully, but these errors were encountered:

JulioZhao97 · 2024-12-20T01:56:25Z

@agombert Hello! It seems that the error is due to the image source file? Which image format is your using?

agombert · 2024-12-20T08:03:05Z

Yes, it's what I saw, and tried to debug with openCV. But that's quite weird as it worked with the same config on a folder I had a month ago, and now not anymore.

I'm using .jpg, tried to convert to RGB and have .png instead but still same problem. It looks to come from the augment.py. Any idea ?

JulioZhao97 · 2024-12-20T09:33:31Z

@agombert Maybe this is due to the update of albumentations? Did you update your albumentations version?

agombert · 2024-12-20T09:46:03Z

I'm using the installation from this repo on a conda environment with "albumentations>=1.4.11", but yes this can come from there. Let me check in a few hours and I'll let you know if keeping 1.4.11 solves the problem. I see that the last release (3 days ago) is 1.4.23.

agombert · 2024-12-20T14:36:57Z

@JulioZhao97 I tested with different versions of albumentations and got the same problem. That's quite weird. I looked into other potential libraries but only albumentations got quite recent updates.

I saw this issue that may be related to it, but unsure as I'm not an expert in cv.

I can give you a sample of data for you to test id you'd like to test it yourself.

hengrui0516 · 2024-12-21T14:59:51Z

@agombert Can you use following setting and see whether this problem will be solved?
albumentations 1.4.18
opencv-python 4.10.0.84
opencv-python-headless 4.10.0.84

JulioZhao97 · 2024-12-23T06:10:05Z

@agombert Hello! Could you please provide your sample data? You can upload it here or send it via my email: [email protected]

agombert · 2024-12-23T10:10:12Z

Hey @JulioZhao97 👋 I've just sent you the email with a link to the data sample.

@hengrui0516, thanks for your help 🙏, I tried to go with the versions you mentioned (even in the extra of the pyproject.toml) but unfortunately it did not work either.

JulioZhao97 · 2024-12-24T01:49:20Z

@agombert I will see to it today

JulioZhao97 · 2024-12-24T07:20:18Z

@agombert It turns out that I can train with your sample data successfully
train_batch0.jpg:

config:

task: detect
mode: train
model: yolov10m-doclayout.yaml
data: data.yaml
epochs: 500
time: null
patience: 100
batch: 1
imgsz: 1120
save: true
save_period: 10
val_period: 1
cache: false
device: '3'
workers: 4
project: public_dataset/data
name: yolov10m-doclayout_data_epoch500_imgsz1120_bs1_pretrain_None
exist_ok: false
pretrained: true
optimizer: SGD
verbose: true
seed: 0
deterministic: true
single_cls: false
rect: false
cos_lr: false
close_mosaic: 10
resume: null
amp: true
fraction: 1.0
profile: false
freeze: null
multi_scale: false
overlap_mask: true
mask_ratio: 4
dropout: 0.0
val: true
split: val
save_json: false
save_hybrid: false
conf: null
iou: 0.7
max_det: 300
half: false
dnn: false
plots: true
source: null
vid_stride: 1
stream_buffer: false
visualize: false
augment: false
agnostic_nms: false
classes: null
retina_masks: false
embed: null
show: false
save_frames: false
save_txt: false
save_conf: false
save_crop: false
show_labels: true
show_conf: true
show_boxes: true
line_width: null
format: torchscript
keras: false
optimize: false
int8: false
dynamic: false
simplify: false
opset: null
workspace: 4
nms: false
lr0: 0.02
lrf: 0.01
momentum: 0.9
weight_decay: 0.0005
warmup_epochs: 3.0
warmup_momentum: 0.8
warmup_bias_lr: 0.1
box: 7.5
cls: 0.5
dfl: 1.5
pose: 12.0
kobj: 1.0
label_smoothing: 0.0
nbs: 64
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
degrees: 0.0
translate: 0.1
scale: 0.5
shear: 0.0
perspective: 0.0
flipud: 0.0
fliplr: 0.5
bgr: 0.0
mosaic: 1.0
mixup: 0.0
copy_paste: 0.0
auto_augment: randaugment
erasing: 0.4
crop_fraction: 1.0
cfg: null
tracker: botsort.yaml
save_dir: public_dataset/data/yolov10m-doclayout_data_epoch500_imgsz1120_bs1_pretrain_None

I provide my environment for your reference:

name: doclayout_yolo
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.3.11=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - absl-py==2.1.0
      - aiofiles==23.2.1
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - albucore==0.0.12
      - albumentations==1.4.11
      - aliyun-python-sdk-core==2.16.0
      - aliyun-python-sdk-kms==2.16.5
      - altair==5.3.0
      - annotated-types==0.7.0
      - anyio==4.4.0
      - astor==0.8.1
      - asttokens==2.4.1
      - async-timeout==4.0.3
      - attrs==23.2.0
      - backports-tarfile==1.2.0
      - beautifulsoup4==4.12.3
      - build==1.2.2.post1
      - certifi==2024.7.4
      - cffi==1.17.1
      - chardet==4.0.0
      - charset-normalizer==3.3.2
      - click==8.1.7
      - cmake==3.30.4
      - comm==0.2.2
      - contourpy==1.2.1
      - crcmod==1.7
      - cryptography==43.0.1
      - cycler==0.12.1
      - cython==3.0.10
      - debugpy==1.8.2
      - decorator==5.1.1
      - dnspython==2.6.1
      - doclayout-yolo==0.0.2
      - docutils==0.21.2
      - email-validator==2.2.0
      - eval-type-backport==0.2.0
      - exceptiongroup==1.2.1
      - executing==2.0.1
      - fastapi==0.111.0
      - fastapi-cli==0.0.4
      - ffmpy==0.3.2
      - filelock==3.14.0
      - filetype==1.2.0
      - fire==0.6.0
      - flatbuffers==24.3.25
      - fonttools==4.53.0
      - frozenlist==1.4.1
      - fsspec==2024.6.1
      - gitdb==4.0.11
      - gitpython==3.1.43
      - grad-cam==1.5.3
      - gradio==4.31.5
      - gradio-client==0.16.4
      - grpcio==1.65.1
      - h11==0.14.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.0
      - huggingface-hub==0.23.2
      - idna==3.7
      - imageio==2.34.2
      - imgaug==0.4.0
      - importlib-metadata==8.5.0
      - importlib-resources==6.4.0
      - ipykernel==6.29.5
      - ipython==8.26.0
      - jaraco-classes==3.4.0
      - jaraco-context==6.0.1
      - jaraco-functools==4.1.0
      - jedi==0.19.1
      - jeepney==0.8.0
      - jinja2==3.1.4
      - jmespath==0.10.0
      - joblib==1.4.2
      - jq==1.7.0
      - jsonschema==4.22.0
      - jsonschema-specifications==2023.12.1
      - jupyter-client==8.6.2
      - jupyter-core==5.7.2
      - keyring==25.4.1
      - kiwisolver==1.4.5
      - lapx==0.5.9.post1
      - lazy-loader==0.4
      - lit==18.1.8
      - lmdb==1.5.1
      - loguru==0.7.2
      - lxml==5.2.2
      - markdown==3.6
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - matplotlib==3.9.0
      - matplotlib-inline==0.1.7
      - mdurl==0.1.2
      - more-itertools==10.5.0
      - mpmath==1.3.0
      - multidict==6.0.5
      - nest-asyncio==1.6.0
      - networkx==3.3
      - nh3==0.2.18
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.4.5.8
      - nvidia-cuda-cupti-cu12==12.4.127
      - nvidia-cuda-nvrtc-cu12==12.4.127
      - nvidia-cuda-runtime-cu12==12.4.127
      - nvidia-cudnn-cu12==9.1.0.70
      - nvidia-cufft-cu12==11.2.1.3
      - nvidia-curand-cu12==10.3.5.147
      - nvidia-cusolver-cu12==11.6.1.9
      - nvidia-cusparse-cu12==12.3.1.170
      - nvidia-nccl-cu12==2.21.5
      - nvidia-nvjitlink-cu12==12.4.127
      - nvidia-nvtx-cu12==12.4.127
      - onnx==1.14.0
      - onnxruntime==1.15.1
      - onnxruntime-gpu==1.16.3
      - onnxslim==0.1.31
      - opencv-contrib-python==4.10.0.84
      - opencv-python==4.9.0.80
      - opencv-python-headless==4.10.0.84
      - openxlab==0.1.2
      - opt-einsum==3.3.0
      - orjson==3.10.6
      - oss2==2.17.0
      - packaging==24.1
      - paddleocr==2.8.1
      - paddlepaddle-gpu==2.6.1
      - pandas==2.2.2
      - parso==0.8.4
      - pexpect==4.9.0
      - pillow==10.3.0
      - pkginfo==1.10.0
      - platformdirs==4.2.2
      - prompt-toolkit==3.0.47
      - protobuf==3.20.3
      - psutil==5.9.8
      - ptyprocess==0.7.0
      - pure-eval==0.2.2
      - py-cpuinfo==9.0.0
      - pyarrow==17.0.0
      - pybboxes==0.1.6
      - pyclipper==1.3.0.post5
      - pycocotools==2.0.7
      - pycparser==2.22
      - pycryptodome==3.21.0
      - pydantic==2.8.2
      - pydantic-core==2.20.1
      - pydub==0.25.1
      - pygments==2.18.0
      - pymupdf==1.24.7
      - pymupdfb==1.24.6
      - pyparsing==3.1.2
      - pyproject-hooks==1.2.0
      - pytesseract==0.3.10
      - python-docx==1.1.2
      - python-multipart==0.0.9
      - pytz==2023.4
      - pyyaml==6.0.1
      - pyzmq==26.0.3
      - rapidfuzz==3.9.4
      - readme-renderer==44.0
      - referencing==0.35.1
      - requests==2.28.2
      - requests-toolbelt==1.0.0
      - rfc3986==2.0.0
      - rich==13.4.2
      - roboflow==1.1.36
      - rpds-py==0.18.1
      - ruff==0.5.0
      - safetensors==0.4.3
      - sahi==0.11.18
      - scikit-image==0.24.0
      - scikit-learn==1.5.1
      - scipy==1.13.1
      - seaborn==0.13.2
      - secretstorage==3.3.3
      - semantic-version==2.10.0
      - setuptools==60.2.0
      - shapely==2.0.5
      - shellingham==1.5.4
      - smmap==5.0.1
      - sniffio==1.3.1
      - soupsieve==2.5
      - stack-data==0.6.3
      - starlette==0.37.2
      - sympy==1.13.1
      - tensorboard==2.17.0
      - tensorboard-data-server==0.7.2
      - tensorrt==8.6.1
      - tensorrt-bindings==8.6.1
      - tensorrt-libs==8.6.1
      - termcolor==2.4.0
      - terminaltables==3.1.10
      - thop==0.1.1-2209072238
      - threadpoolctl==3.5.0
      - tifffile==2024.7.2
      - timm==1.0.9
      - tomli==2.0.1
      - tomlkit==0.12.0
      - toolz==0.12.1
      - torch==2.5.0
      - torch-geometric==2.5.3
      - torchvision==0.20.0
      - tornado==6.4.1
      - tqdm==4.65.2
      - traitlets==5.14.3
      - triton==3.1.0
      - ttach==0.0.3
      - twine==5.1.1
      - typer==0.12.3
      - typing-extensions==4.12.2
      - tzdata==2024.1
      - ujson==5.10.0
      - ultralytics==8.1.34
      - urllib3==1.26.20
      - uvicorn==0.30.1
      - uvloop==0.19.0
      - watchfiles==0.22.0
      - wcwidth==0.2.13
      - websockets==11.0.3
      - werkzeug==3.0.3
      - yarl==1.9.4
      - yolov10==0.0.1
      - yolov5==7.0.13
      - zipp==3.20.2

agombert · 2024-12-24T10:53:27Z

Hey @JulioZhao97 thanks for your help. I'll try a couple of things to see if I can handle the problem and let you know asap !

agombert · 2024-12-24T15:53:37Z

Ok @JulioZhao97 I found the problem !! 😌

It was coming from a 🐍 conda install -c conda-forge datasets -y I used when I activated the environment to work with HF datasets library before training. It was mixing things up. But adding datasets to the pyproject.toml instead solved the issue !! 👍

agombert · 2024-12-24T15:53:52Z

Thank you very much for your help ! 🙏

JulioZhao97 mentioned this issue Dec 23, 2024

Enable training with DocLayout-YOLO #35

Open

agombert closed this as completed Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Fails #72

Training Fails #72

agombert commented Dec 19, 2024

JulioZhao97 commented Dec 20, 2024

agombert commented Dec 20, 2024

JulioZhao97 commented Dec 20, 2024

agombert commented Dec 20, 2024

agombert commented Dec 20, 2024

hengrui0516 commented Dec 21, 2024

JulioZhao97 commented Dec 23, 2024

agombert commented Dec 23, 2024

JulioZhao97 commented Dec 24, 2024

JulioZhao97 commented Dec 24, 2024

agombert commented Dec 24, 2024

agombert commented Dec 24, 2024 •

edited

Loading

agombert commented Dec 24, 2024

Training Fails #72

Training Fails #72

Comments

agombert commented Dec 19, 2024

JulioZhao97 commented Dec 20, 2024

agombert commented Dec 20, 2024

JulioZhao97 commented Dec 20, 2024

agombert commented Dec 20, 2024

agombert commented Dec 20, 2024

hengrui0516 commented Dec 21, 2024

JulioZhao97 commented Dec 23, 2024

agombert commented Dec 23, 2024

JulioZhao97 commented Dec 24, 2024

JulioZhao97 commented Dec 24, 2024

agombert commented Dec 24, 2024

agombert commented Dec 24, 2024 • edited Loading

agombert commented Dec 24, 2024

agombert commented Dec 24, 2024 •

edited

Loading