Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Fails #72

Closed
agombert opened this issue Dec 19, 2024 · 13 comments
Closed

Training Fails #72

agombert opened this issue Dec 19, 2024 · 13 comments

Comments

@agombert
Copy link

Hey got an error in training which is quite weird when doing:

python DocLayout-YOLO/train.py --data $DATA_PATH/config --model m-doclayout --epoch $EPOCHS --image-size 1024 --batch-size $BATCH_SIZE --patience $PATIENCE --project $OUTPUT_PATH --optimizer Adam --lr0 0.001 --pretrain $MODEL_PATH --device 0
train: Scanning /home/ubuntu/trocr_handwritten/trocr_handwritten/parse/data/labels/train.cache... 386 images, 10 backgrounds, 0 corrupt: 100%|██████████| 386/386 [00:00<?, ?it/s]
/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/albumentations/core/composition.py:250: UserWarning: Got processor for bboxes, but no transform to process it.
  self._set_keys()
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01, num_output_channels=3, method='weighted_average'), CLAHE(p=0.01, clip_limit=(1.0, 4.0), tile_grid_size=(8, 8))
val: Scanning /home/ubuntu/trocr_handwritten/trocr_handwritten/parse/data/labels/val.cache... 97 images, 3 backgrounds, 0 corrupt: 100%|██████████| 97/97 [00:00<?, ?it/s]
optimizer: Adam(lr=0.001, momentum=0.9) with parameter groups 171 weight(decay=0.0), 184 weight(decay=0.0005), 183 bias(decay=0.0)
Image sizes 1024 train, 1024 val
Using 1 dataloader workers
Logging results to yolo_ft/yolov10m-doclayout_data/config_epoch50_imgsz1024_bs8_pretrain_unknown3
Starting training for 50 epochs...

      Epoch    GPU_mem     box_om     cls_om     dfl_om     box_oo     cls_oo     dfl_oo  Instances       Size
  0%|          | 0/49 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/train.py", line 63, in <module>
    results = model.train(
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/model.py", line 660, in train
    self.trainer.train()
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/trainer.py", line 214, in train
    self._do_train(world_size)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/engine/trainer.py", line 364, in _do_train
    for i, batch in pbar:
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/build.py", line 49, in __iter__
    yield next(self.iterator)
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
    data.reraise()
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/_utils.py", line 715, in reraise
    raise exception
cv2.error: Caught error in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/miniconda/envs/doclayout_yolo/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/base.py", line 266, in __getitem__
    return self.transforms(self.get_image_and_label(index))
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 74, in __call__
    data = t(data)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 74, in __call__
    data = t(data)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 534, in __call__
    img, M, scale = self.affine_transform(img, border)
  File "/home/ubuntu/trocr_handwritten/trocr_handwritten/parse/DocLayout-YOLO/doclayout_yolo/data/augment.py", line 434, in affine_transform
    img = cv2.warpAffine(img, M[:2], dsize=self.size, borderValue=(114, 114, 114))
cv2.error: OpenCV(4.10.0) :-1: error: (-5:Bad argument) in function 'warpAffine'
> Overload resolution failed:
>  - src is not a numpy array, neither a scalar
>  - Expected Ptr<cv::UMat> for argument 'src'

Looks like there is a problem to read the picture with openCV, but I tried with an old dataset I used to train the model and same error. Do you have any idea what's happening ?

Best,

Arnault

@JulioZhao97
Copy link
Collaborator

@agombert Hello! It seems that the error is due to the image source file? Which image format is your using?

@agombert
Copy link
Author

Yes, it's what I saw, and tried to debug with openCV. But that's quite weird as it worked with the same config on a folder I had a month ago, and now not anymore.

I'm using .jpg, tried to convert to RGB and have .png instead but still same problem. It looks to come from the augment.py. Any idea ?

@JulioZhao97
Copy link
Collaborator

@agombert Maybe this is due to the update of albumentations? Did you update your albumentations version?

@agombert
Copy link
Author

I'm using the installation from this repo on a conda environment with "albumentations>=1.4.11", but yes this can come from there. Let me check in a few hours and I'll let you know if keeping 1.4.11 solves the problem. I see that the last release (3 days ago) is 1.4.23.

@agombert
Copy link
Author

@JulioZhao97 I tested with different versions of albumentations and got the same problem. That's quite weird. I looked into other potential libraries but only albumentations got quite recent updates.

I saw this issue that may be related to it, but unsure as I'm not an expert in cv.

I can give you a sample of data for you to test id you'd like to test it yourself.

@hengrui0516
Copy link
Collaborator

@agombert Can you use following setting and see whether this problem will be solved?
albumentations 1.4.18
opencv-python 4.10.0.84
opencv-python-headless 4.10.0.84

@JulioZhao97
Copy link
Collaborator

@agombert Hello! Could you please provide your sample data? You can upload it here or send it via my email: [email protected]

@agombert
Copy link
Author

Hey @JulioZhao97 👋 I've just sent you the email with a link to the data sample.

@hengrui0516, thanks for your help 🙏, I tried to go with the versions you mentioned (even in the extra of the pyproject.toml) but unfortunately it did not work either.

@JulioZhao97
Copy link
Collaborator

@agombert I will see to it today

@JulioZhao97
Copy link
Collaborator

@agombert It turns out that I can train with your sample data successfully
train_batch0.jpg:
train_batch0

config:

task: detect
mode: train
model: yolov10m-doclayout.yaml
data: data.yaml
epochs: 500
time: null
patience: 100
batch: 1
imgsz: 1120
save: true
save_period: 10
val_period: 1
cache: false
device: '3'
workers: 4
project: public_dataset/data
name: yolov10m-doclayout_data_epoch500_imgsz1120_bs1_pretrain_None
exist_ok: false
pretrained: true
optimizer: SGD
verbose: true
seed: 0
deterministic: true
single_cls: false
rect: false
cos_lr: false
close_mosaic: 10
resume: null
amp: true
fraction: 1.0
profile: false
freeze: null
multi_scale: false
overlap_mask: true
mask_ratio: 4
dropout: 0.0
val: true
split: val
save_json: false
save_hybrid: false
conf: null
iou: 0.7
max_det: 300
half: false
dnn: false
plots: true
source: null
vid_stride: 1
stream_buffer: false
visualize: false
augment: false
agnostic_nms: false
classes: null
retina_masks: false
embed: null
show: false
save_frames: false
save_txt: false
save_conf: false
save_crop: false
show_labels: true
show_conf: true
show_boxes: true
line_width: null
format: torchscript
keras: false
optimize: false
int8: false
dynamic: false
simplify: false
opset: null
workspace: 4
nms: false
lr0: 0.02
lrf: 0.01
momentum: 0.9
weight_decay: 0.0005
warmup_epochs: 3.0
warmup_momentum: 0.8
warmup_bias_lr: 0.1
box: 7.5
cls: 0.5
dfl: 1.5
pose: 12.0
kobj: 1.0
label_smoothing: 0.0
nbs: 64
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
degrees: 0.0
translate: 0.1
scale: 0.5
shear: 0.0
perspective: 0.0
flipud: 0.0
fliplr: 0.5
bgr: 0.0
mosaic: 1.0
mixup: 0.0
copy_paste: 0.0
auto_augment: randaugment
erasing: 0.4
crop_fraction: 1.0
cfg: null
tracker: botsort.yaml
save_dir: public_dataset/data/yolov10m-doclayout_data_epoch500_imgsz1120_bs1_pretrain_None

I provide my environment for your reference:

name: doclayout_yolo
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2024.3.11=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_1
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.14=h5eee18b_0
  - pip=24.0=py310h06a4308_0
  - python=3.10.14=h955ad1f_1
  - readline=8.2=h5eee18b_0
  - sqlite=3.45.3=h5eee18b_0
  - tk=8.6.14=h39e8969_0
  - wheel=0.43.0=py310h06a4308_0
  - xz=5.4.6=h5eee18b_1
  - zlib=1.2.13=h5eee18b_1
  - pip:
      - absl-py==2.1.0
      - aiofiles==23.2.1
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - albucore==0.0.12
      - albumentations==1.4.11
      - aliyun-python-sdk-core==2.16.0
      - aliyun-python-sdk-kms==2.16.5
      - altair==5.3.0
      - annotated-types==0.7.0
      - anyio==4.4.0
      - astor==0.8.1
      - asttokens==2.4.1
      - async-timeout==4.0.3
      - attrs==23.2.0
      - backports-tarfile==1.2.0
      - beautifulsoup4==4.12.3
      - build==1.2.2.post1
      - certifi==2024.7.4
      - cffi==1.17.1
      - chardet==4.0.0
      - charset-normalizer==3.3.2
      - click==8.1.7
      - cmake==3.30.4
      - comm==0.2.2
      - contourpy==1.2.1
      - crcmod==1.7
      - cryptography==43.0.1
      - cycler==0.12.1
      - cython==3.0.10
      - debugpy==1.8.2
      - decorator==5.1.1
      - dnspython==2.6.1
      - doclayout-yolo==0.0.2
      - docutils==0.21.2
      - email-validator==2.2.0
      - eval-type-backport==0.2.0
      - exceptiongroup==1.2.1
      - executing==2.0.1
      - fastapi==0.111.0
      - fastapi-cli==0.0.4
      - ffmpy==0.3.2
      - filelock==3.14.0
      - filetype==1.2.0
      - fire==0.6.0
      - flatbuffers==24.3.25
      - fonttools==4.53.0
      - frozenlist==1.4.1
      - fsspec==2024.6.1
      - gitdb==4.0.11
      - gitpython==3.1.43
      - grad-cam==1.5.3
      - gradio==4.31.5
      - gradio-client==0.16.4
      - grpcio==1.65.1
      - h11==0.14.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.0
      - huggingface-hub==0.23.2
      - idna==3.7
      - imageio==2.34.2
      - imgaug==0.4.0
      - importlib-metadata==8.5.0
      - importlib-resources==6.4.0
      - ipykernel==6.29.5
      - ipython==8.26.0
      - jaraco-classes==3.4.0
      - jaraco-context==6.0.1
      - jaraco-functools==4.1.0
      - jedi==0.19.1
      - jeepney==0.8.0
      - jinja2==3.1.4
      - jmespath==0.10.0
      - joblib==1.4.2
      - jq==1.7.0
      - jsonschema==4.22.0
      - jsonschema-specifications==2023.12.1
      - jupyter-client==8.6.2
      - jupyter-core==5.7.2
      - keyring==25.4.1
      - kiwisolver==1.4.5
      - lapx==0.5.9.post1
      - lazy-loader==0.4
      - lit==18.1.8
      - lmdb==1.5.1
      - loguru==0.7.2
      - lxml==5.2.2
      - markdown==3.6
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - matplotlib==3.9.0
      - matplotlib-inline==0.1.7
      - mdurl==0.1.2
      - more-itertools==10.5.0
      - mpmath==1.3.0
      - multidict==6.0.5
      - nest-asyncio==1.6.0
      - networkx==3.3
      - nh3==0.2.18
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.4.5.8
      - nvidia-cuda-cupti-cu12==12.4.127
      - nvidia-cuda-nvrtc-cu12==12.4.127
      - nvidia-cuda-runtime-cu12==12.4.127
      - nvidia-cudnn-cu12==9.1.0.70
      - nvidia-cufft-cu12==11.2.1.3
      - nvidia-curand-cu12==10.3.5.147
      - nvidia-cusolver-cu12==11.6.1.9
      - nvidia-cusparse-cu12==12.3.1.170
      - nvidia-nccl-cu12==2.21.5
      - nvidia-nvjitlink-cu12==12.4.127
      - nvidia-nvtx-cu12==12.4.127
      - onnx==1.14.0
      - onnxruntime==1.15.1
      - onnxruntime-gpu==1.16.3
      - onnxslim==0.1.31
      - opencv-contrib-python==4.10.0.84
      - opencv-python==4.9.0.80
      - opencv-python-headless==4.10.0.84
      - openxlab==0.1.2
      - opt-einsum==3.3.0
      - orjson==3.10.6
      - oss2==2.17.0
      - packaging==24.1
      - paddleocr==2.8.1
      - paddlepaddle-gpu==2.6.1
      - pandas==2.2.2
      - parso==0.8.4
      - pexpect==4.9.0
      - pillow==10.3.0
      - pkginfo==1.10.0
      - platformdirs==4.2.2
      - prompt-toolkit==3.0.47
      - protobuf==3.20.3
      - psutil==5.9.8
      - ptyprocess==0.7.0
      - pure-eval==0.2.2
      - py-cpuinfo==9.0.0
      - pyarrow==17.0.0
      - pybboxes==0.1.6
      - pyclipper==1.3.0.post5
      - pycocotools==2.0.7
      - pycparser==2.22
      - pycryptodome==3.21.0
      - pydantic==2.8.2
      - pydantic-core==2.20.1
      - pydub==0.25.1
      - pygments==2.18.0
      - pymupdf==1.24.7
      - pymupdfb==1.24.6
      - pyparsing==3.1.2
      - pyproject-hooks==1.2.0
      - pytesseract==0.3.10
      - python-docx==1.1.2
      - python-multipart==0.0.9
      - pytz==2023.4
      - pyyaml==6.0.1
      - pyzmq==26.0.3
      - rapidfuzz==3.9.4
      - readme-renderer==44.0
      - referencing==0.35.1
      - requests==2.28.2
      - requests-toolbelt==1.0.0
      - rfc3986==2.0.0
      - rich==13.4.2
      - roboflow==1.1.36
      - rpds-py==0.18.1
      - ruff==0.5.0
      - safetensors==0.4.3
      - sahi==0.11.18
      - scikit-image==0.24.0
      - scikit-learn==1.5.1
      - scipy==1.13.1
      - seaborn==0.13.2
      - secretstorage==3.3.3
      - semantic-version==2.10.0
      - setuptools==60.2.0
      - shapely==2.0.5
      - shellingham==1.5.4
      - smmap==5.0.1
      - sniffio==1.3.1
      - soupsieve==2.5
      - stack-data==0.6.3
      - starlette==0.37.2
      - sympy==1.13.1
      - tensorboard==2.17.0
      - tensorboard-data-server==0.7.2
      - tensorrt==8.6.1
      - tensorrt-bindings==8.6.1
      - tensorrt-libs==8.6.1
      - termcolor==2.4.0
      - terminaltables==3.1.10
      - thop==0.1.1-2209072238
      - threadpoolctl==3.5.0
      - tifffile==2024.7.2
      - timm==1.0.9
      - tomli==2.0.1
      - tomlkit==0.12.0
      - toolz==0.12.1
      - torch==2.5.0
      - torch-geometric==2.5.3
      - torchvision==0.20.0
      - tornado==6.4.1
      - tqdm==4.65.2
      - traitlets==5.14.3
      - triton==3.1.0
      - ttach==0.0.3
      - twine==5.1.1
      - typer==0.12.3
      - typing-extensions==4.12.2
      - tzdata==2024.1
      - ujson==5.10.0
      - ultralytics==8.1.34
      - urllib3==1.26.20
      - uvicorn==0.30.1
      - uvloop==0.19.0
      - watchfiles==0.22.0
      - wcwidth==0.2.13
      - websockets==11.0.3
      - werkzeug==3.0.3
      - yarl==1.9.4
      - yolov10==0.0.1
      - yolov5==7.0.13
      - zipp==3.20.2

@agombert
Copy link
Author

Hey @JulioZhao97 thanks for your help. I'll try a couple of things to see if I can handle the problem and let you know asap !

@agombert
Copy link
Author

agombert commented Dec 24, 2024

Ok @JulioZhao97 I found the problem !! 😌

It was coming from a 🐍 conda install -c conda-forge datasets -y I used when I activated the environment to work with HF datasets library before training. It was mixing things up. But adding datasets to the pyproject.toml instead solved the issue !! 👍

@agombert
Copy link
Author

Thank you very much for your help ! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants