Training Fails #72

agombert opened this issue Dec 19, 2024 · 13 comments

Hey got an error in training which is quite weird when doing:

python DocLayout-YOLO/ --data $DATA_PATH/config --model m-doclayout --epoch $EPOCHS --image-size 1024 --batch-size $BATCH_SIZE --patience $PATIENCE --project $OUTPUT_PATH --optimizer Adam --lr0 0.001 --pretrain $MODEL_PATH --device 0
train: Scanning /home/ubuntu/trocr_handwritten/trocr_handwritten/parse/data/labels/train.cache... 386 images, 10 backgrounds, 0 corrupt: 100%|██████████| 386/386 [00:00<?, ?it/s]
/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/albumentations/core/ UserWarning: Got processor for bboxes, but no transform to process it.
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01, num_output_channels=3, method='weighted_average'), CLAHE(p=0.01, clip_limit=(1.0, 4.0), tile_grid_size=(8, 8))
val: Scanning /home/ubuntu/trocr_handwritten/trocr_handwritten/parse/data/labels/val.cache... 97 images, 3 backgrounds, 0 corrupt: 100%|██████████| 97/97 [00:00<?, ?it/s]
optimizer: Adam(lr=0.001, momentum=0.9) with parameter groups 171 weight(decay=0.0), 184 weight(decay=0.0005), 183 bias(decay=0.0)
Image sizes 1024 train, 1024 val
Using 1 dataloader workers
Logging results to yolo_ft/yolov10m-doclayout_data/config_epoch50_imgsz1024_bs8_pretrain_unknown3
Starting training for 50 epochs...

      Epoch    GPU_mem     box_om     cls_om     dfl_om     box_oo     cls_oo     dfl_oo  Instances       Size
  0%|          | 0/49 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/", line 63, in <module>
    results = model.train(
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/", line 660, in train
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/", line 214, in train
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/", line 364, in _do_train
    for i, batch in pbar:
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/tqdm/", line 1181, in __iter__
    for obj in iterable:
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/", line 49, in __iter__
    yield next(self.iterator)
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/", line 701, in __next__
    data = self._next_data()
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/", line 1465, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/", line 1491, in _process_data
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/", line 715, in reraise
    raise exception
cv2.error: Caught error in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/", line 266, in __getitem__
    return self.transforms(self.get_image_and_label(index))
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/", line 74, in __call__
    data = t(data)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/", line 74, in __call__
    data = t(data)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/", line 534, in __call__
    img, M, scale = self.affine_transform(img, border)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/", line 434, in affine_transform
    img = cv2.warpAffine(img, M[:2], dsize=self.size, borderValue=(114, 114, 114))
cv2.error: OpenCV(4.10.0) :-1: error: (-5:Bad argument) in function 'warpAffine'
> Overload resolution failed:
>  - src is not a numpy array, neither a scalar
>  - Expected Ptr<cv::UMat> for argument 'src'

Looks like there is a problem to read the picture with openCV, but I tried with an old dataset I used to train the model and same error. Do you have any idea what's happening ?



@agombert Hello! It seems that the error is due to the image source file? Which image format is your using?

Yes, it's what I saw, and tried to debug with openCV. But that's quite weird as it worked with the same config on a folder I had a month ago, and now not anymore.

I'm using .jpg, tried to convert to RGB and have .png instead but still same problem. It looks to come from the Any idea ?

@agombert Maybe this is due to the update of albumentations? Did you update your albumentations version?

I'm using the installation from this repo on a conda environment with "albumentations>=1.4.11", but yes this can come from there. Let me check in a few hours and I'll let you know if keeping 1.4.11 solves the problem. I see that the last release (3 days ago) is 1.4.23.

@JulioZhao97 I tested with different versions of albumentations and got the same problem. That's quite weird. I looked into other potential libraries but only albumentations got quite recent updates.

I saw this issue that may be related to it, but unsure as I'm not an expert in cv.

I can give you a sample of data for you to test id you'd like to test it yourself.

@agombert Can you use following setting and see whether this problem will be solved?
albumentations 1.4.18

@agombert Hello! Could you please provide your sample data? You can upload it here or send it via my email: [email protected]

Hey @JulioZhao97 👋 I've just sent you the email with a link to the data sample.

@hengrui0516, thanks for your help 🙏, I tried to go with the versions you mentioned (even in the extra of the pyproject.toml) but unfortunately it did not work either.

@agombert I will see to it today

@agombert It turns out that I can train with your sample data successfully


task: detect
mode: train
model: yolov10m-doclayout.yaml
data: data.yaml
epochs: 500
time: null
patience: 100
batch: 1
imgsz: 1120
save: true
save_period: 10
val_period: 1
cache: false
device: '3'
workers: 4
project: public_dataset/data
name: yolov10m-doclayout_data_epoch500_imgsz1120_bs1_pretrain_None
exist_ok: false
pretrained: true
optimizer: SGD
verbose: true
seed: 0
deterministic: true
single_cls: false
rect: false
cos_lr: false
close_mosaic: 10
resume: null
amp: true
fraction: 1.0
profile: false
freeze: null
multi_scale: false
overlap_mask: true
mask_ratio: 4
dropout: 0.0
val: true
split: val
save_json: false
save_hybrid: false
conf: null
iou: 0.7
max_det: 300
half: false
dnn: false
plots: true
source: null
vid_stride: 1
stream_buffer: false
visualize: false
augment: false
agnostic_nms: false
classes: null
retina_masks: false
embed: null
show: false
save_frames: false
save_txt: false
save_conf: false
save_crop: false
show_labels: true
show_conf: true
show_boxes: true
line_width: null
format: torchscript
keras: false
optimize: false
int8: false
dynamic: false
simplify: false
opset: null
workspace: 4
nms: false
lr0: 0.02
lrf: 0.01
momentum: 0.9
weight_decay: 0.0005
warmup_epochs: 3.0
warmup_momentum: 0.8
warmup_bias_lr: 0.1
box: 7.5
cls: 0.5
dfl: 1.5
pose: 12.0
kobj: 1.0
label_smoothing: 0.0
nbs: 64
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
degrees: 0.0
translate: 0.1
scale: 0.5
shear: 0.0
perspective: 0.0
flipud: 0.0
fliplr: 0.5
bgr: 0.0
mosaic: 1.0
mixup: 0.0
copy_paste: 0.0
auto_augment: randaugment
erasing: 0.4
crop_fraction: 1.0
cfg: null
tracker: botsort.yaml
save_dir: public_dataset/data/yolov10m-doclayout_data_epoch500_imgsz1120_bs1_pretrain_None

I provide my environment for your reference:

name: doclayout_yolo
  - defaults
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.3.11=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - absl-py==2.1.0
      - aiofiles==23.2.1
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - albucore==0.0.12
      - albumentations==1.4.11
      - aliyun-python-sdk-core==2.16.0
      - aliyun-python-sdk-kms==2.16.5
      - altair==5.3.0
      - annotated-types==0.7.0
      - anyio==4.4.0
      - astor==0.8.1
      - asttokens==2.4.1
      - async-timeout==4.0.3
      - attrs==23.2.0
      - backports-tarfile==1.2.0
      - beautifulsoup4==4.12.3
      - build==1.2.2.post1
      - certifi==2024.7.4
      - cffi==1.17.1
      - chardet==4.0.0
      - charset-normalizer==3.3.2
      - click==8.1.7
      - cmake==3.30.4
      - comm==0.2.2
      - contourpy==1.2.1
      - crcmod==1.7
      - cryptography==43.0.1
      - cycler==0.12.1
      - cython==3.0.10
      - debugpy==1.8.2
      - decorator==5.1.1
      - dnspython==2.6.1
      - doclayout-yolo==0.0.2
      - docutils==0.21.2
      - email-validator==2.2.0
      - eval-type-backport==0.2.0
      - exceptiongroup==1.2.1
      - executing==2.0.1
      - fastapi==0.111.0
      - fastapi-cli==0.0.4
      - ffmpy==0.3.2
      - filelock==3.14.0
      - filetype==1.2.0
      - fire==0.6.0
      - flatbuffers==24.3.25
      - fonttools==4.53.0
      - frozenlist==1.4.1
      - fsspec==2024.6.1
      - gitdb==4.0.11
      - gitpython==3.1.43
      - grad-cam==1.5.3
      - gradio==4.31.5
      - gradio-client==0.16.4
      - grpcio==1.65.1
      - h11==0.14.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.0
      - huggingface-hub==0.23.2
      - idna==3.7
      - imageio==2.34.2
      - imgaug==0.4.0
      - importlib-metadata==8.5.0
      - importlib-resources==6.4.0
      - ipykernel==6.29.5
      - ipython==8.26.0
      - jaraco-classes==3.4.0
      - jaraco-context==6.0.1
      - jaraco-functools==4.1.0
      - jedi==0.19.1
      - jeepney==0.8.0
      - jinja2==3.1.4
      - jmespath==0.10.0
      - joblib==1.4.2
      - jq==1.7.0
      - jsonschema==4.22.0
      - jsonschema-specifications==2023.12.1
      - jupyter-client==8.6.2
      - jupyter-core==5.7.2
      - keyring==25.4.1
      - kiwisolver==1.4.5
      - lapx==0.5.9.post1
      - lazy-loader==0.4
      - lit==18.1.8
      - lmdb==1.5.1
      - loguru==0.7.2
      - lxml==5.2.2
      - markdown==3.6
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - matplotlib==3.9.0
      - matplotlib-inline==0.1.7
      - mdurl==0.1.2
      - more-itertools==10.5.0
      - mpmath==1.3.0
      - multidict==6.0.5
      - nest-asyncio==1.6.0
      - networkx==3.3
      - nh3==0.2.18
      - numpy==1.26.4
      - nvidia-cublas-cu12==
      - nvidia-cuda-cupti-cu12==12.4.127
      - nvidia-cuda-nvrtc-cu12==12.4.127
      - nvidia-cuda-runtime-cu12==12.4.127
      - nvidia-cudnn-cu12==
      - nvidia-cufft-cu12==
      - nvidia-curand-cu12==
      - nvidia-cusolver-cu12==
      - nvidia-cusparse-cu12==
      - nvidia-nccl-cu12==2.21.5
      - nvidia-nvjitlink-cu12==12.4.127
      - nvidia-nvtx-cu12==12.4.127
      - onnx==1.14.0
      - onnxruntime==1.15.1
      - onnxruntime-gpu==1.16.3
      - onnxslim==0.1.31
      - opencv-contrib-python==
      - opencv-python==
      - opencv-python-headless==
      - openxlab==0.1.2
      - opt-einsum==3.3.0
      - orjson==3.10.6
      - oss2==2.17.0
      - packaging==24.1
      - paddleocr==2.8.1
      - paddlepaddle-gpu==2.6.1
      - pandas==2.2.2
      - parso==0.8.4
      - pexpect==4.9.0
      - pillow==10.3.0
      - pkginfo==1.10.0
      - platformdirs==4.2.2
      - prompt-toolkit==3.0.47
      - protobuf==3.20.3
      - psutil==5.9.8
      - ptyprocess==0.7.0
      - pure-eval==0.2.2
      - py-cpuinfo==9.0.0
      - pyarrow==17.0.0
      - pybboxes==0.1.6
      - pyclipper==1.3.0.post5
      - pycocotools==2.0.7
      - pycparser==2.22
      - pycryptodome==3.21.0
      - pydantic==2.8.2
      - pydantic-core==2.20.1
      - pydub==0.25.1
      - pygments==2.18.0
      - pymupdf==1.24.7
      - pymupdfb==1.24.6
      - pyparsing==3.1.2
      - pyproject-hooks==1.2.0
      - pytesseract==0.3.10
      - python-docx==1.1.2
      - python-multipart==0.0.9
      - pytz==2023.4
      - pyyaml==6.0.1
      - pyzmq==26.0.3
      - rapidfuzz==3.9.4
      - readme-renderer==44.0
      - referencing==0.35.1
      - requests==2.28.2
      - requests-toolbelt==1.0.0
      - rfc3986==2.0.0
      - rich==13.4.2
      - roboflow==1.1.36
      - rpds-py==0.18.1
      - ruff==0.5.0
      - safetensors==0.4.3
      - sahi==0.11.18
      - scikit-image==0.24.0
      - scikit-learn==1.5.1
      - scipy==1.13.1
      - seaborn==0.13.2
      - secretstorage==3.3.3
      - semantic-version==2.10.0
      - setuptools==60.2.0
      - shapely==2.0.5
      - shellingham==1.5.4
      - smmap==5.0.1
      - sniffio==1.3.1
      - soupsieve==2.5
      - stack-data==0.6.3
      - starlette==0.37.2
      - sympy==1.13.1
      - tensorboard==2.17.0
      - tensorboard-data-server==0.7.2
      - tensorrt==8.6.1
      - tensorrt-bindings==8.6.1
      - tensorrt-libs==8.6.1
      - termcolor==2.4.0
      - terminaltables==3.1.10
      - thop==0.1.1-2209072238
      - threadpoolctl==3.5.0
      - tifffile==2024.7.2
      - timm==1.0.9
      - tomli==2.0.1
      - tomlkit==0.12.0
      - toolz==0.12.1
      - torch==2.5.0
      - torch-geometric==2.5.3
      - torchvision==0.20.0
      - tornado==6.4.1
      - tqdm==4.65.2
      - traitlets==5.14.3
      - triton==3.1.0
      - ttach==0.0.3
      - twine==5.1.1
      - typer==0.12.3
      - typing-extensions==4.12.2
      - tzdata==2024.1
      - ujson==5.10.0
      - ultralytics==8.1.34
      - urllib3==1.26.20
      - uvicorn==0.30.1
      - uvloop==0.19.0
      - watchfiles==0.22.0
      - wcwidth==0.2.13
      - websockets==11.0.3
      - werkzeug==3.0.3
      - yarl==1.9.4
      - yolov10==0.0.1
      - yolov5==7.0.13
      - zipp==3.20.2

Hey @JulioZhao97 thanks for your help. I'll try a couple of things to see if I can handle the problem and let you know asap !

agombert commented Dec 24, 2024

Ok @JulioZhao97 I found the problem !! 😌

It was coming from a 🐍 conda install -c conda-forge datasets -y I used when I activated the environment to work with HF datasets library before training. It was mixing things up. But adding datasets to the pyproject.toml instead solved the issue !! 👍

Thank you very much for your help ! 🙏

