Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Caught RuntimeError in pin memory thread for device 0 #642

Open
johnlockejrr opened this issue Sep 19, 2024 · 19 comments
Open

Comments

@johnlockejrr
Copy link

Specs:

kraken, version 5.2.10.dev2
Python 3.11.10
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
NVIDIA GeForce RTX 3060 12Gb

Training small amount of data, about 35 pages. Worked on 25 pages.

ketos segtrain -d cuda:0 -f page -t output.txt -q early --min-epochs 100 --resize both -tl -i biblialong02_se3_2_tl.mlmodel -o /home/incognito/kraken-train/teyman_print/teyman_print_blong/teyman_print_blong_tl_v1 --pad 20 20

Error:

(kraken-5.2.9) incognito@DESKTOP-H1BS9PO:~/kraken-train/teyman_print$ ketos segtrain -d cuda:0 -f page -t output.txt -q early --min-epochs 100 --resize both -tl -i biblialong02_se3_2_tl.mlmodel -o /home/incognito/kraken-train/teyman_print/teyman_print_blong/teyman_print_blong_tl_v1 --pad 20 20
Training line types:
  textline      2       368
Training region types:
  textzone      3       35
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  1.3 M │             [1, 1, 1800, 300] │   [[1, 4, 450, 75], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  3.2 K │      [[1, 1, 1800, 300], '?'] │ [[1, 64, 900, 150], '?'] │
│ 2  │ net.Gn_1          │ GroupNorm                │    128 │ [[1, 64, 900, 150], '?', '?'] │ [[1, 64, 900, 150], '?'] │
│ 3  │ net.C_2           │ ActConv2D                │ 73.9 K │ [[1, 64, 900, 150], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 4  │ net.Gn_3          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 5  │ net.C_4           │ ActConv2D                │  147 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 6  │ net.Gn_5          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │  295 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 8  │ net.Gn_7          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 9  │ net.C_8           │ ActConv2D                │  590 K │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 10 │ net.Gn_9          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 11 │ net.L_10          │ TransposedSummarizingRNN │ 74.2 K │ [[1, 256, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 12 │ net.L_11          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 13 │ net.C_12          │ ActConv2D                │  2.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 14 │ net.Gn_13         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │ 16.9 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 16 │ net.L_15          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 17 │ net.C_16          │ ActConv2D                │ 18.5 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 18 │ net.Gn_17         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 19 │ net.C_18          │ ActConv2D                │ 18.5 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 20 │ net.Gn_19         │ GroupNorm                │    128 │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 21 │ net.l_20          │ ActConv2D                │    260 │  [[1, 64, 450, 75], '?', '?'] │   [[1, 4, 450, 75], '?'] │
│ 22 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 23 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 24 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
│ 25 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 1.3 M
Non-trainable params: 0
Total params: 1.3 M
Total estimated model params size (MB): 5
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━ 29/31 0:00:04 • 0:00:01 7.64it/s  early_stopping: 0/10 -inf
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/incognito/miniconda3/envs/kraken-5.2.9/bin/ketos:8 in <module>                             │
│                                                                                                  │
│   5 from kraken.ketos import cli                                                                 │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1157 in  │
│ __call__                                                                                         │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1078 in  │
│ main                                                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1688 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1434 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:783 in   │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/decorators.py:33 │
│ in new_func                                                                                      │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/kraken/ketos/segmentat │
│ ion.py:366 in segtrain                                                                           │
│                                                                                                  │
│   363 │   │   │   │   │   │   │   **val_check_interval)                                          │
│   364 │                                                                                          │
│   365 │   with threadpool_limits(limits=threads):                                                │
│ ❱ 366 │   │   trainer.fit(model)                                                                 │
│   367 │                                                                                          │
│   368 │   if model.best_epoch == -1:                                                             │
│   369 │   │   logger.warning('Model did not improve during training.')                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/kraken/lib/train.py:12 │
│ 9 in fit                                                                                         │
│                                                                                                  │
│    126 │   │   with warnings.catch_warnings():                                                   │
│    127 │   │   │   warnings.filterwarnings(action='ignore', category=UserWarning,                │
│    128 │   │   │   │   │   │   │   │   │   message='The dataloader,')                            │
│ ❱  129 │   │   │   super().fit(*args, **kwargs)                                                  │
│    130                                                                                           │
│    131                                                                                           │
│    132 class KrakenFreezeBackbone(BaseFinetuning):                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:544 in fit                                                                        │
│                                                                                                  │
│    541 │   │   self.state.fn = TrainerFn.FITTING                                                 │
│    542 │   │   self.state.status = TrainerStatus.RUNNING                                         │
│    543 │   │   self.training = True                                                              │
│ ❱  544 │   │   call._call_and_handle_interrupt(                                                  │
│    545 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule,  │
│    546 │   │   )                                                                                 │
│    547                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/call.py:44 in _call_and_handle_interrupt                                                     │
│                                                                                                  │
│    41 │   try:                                                                                   │
│    42 │   │   if trainer.strategy.launcher is not None:                                          │
│    43 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,    │
│ ❱  44 │   │   return trainer_fn(*args, **kwargs)                                                 │
│    45 │                                                                                          │
│    46 │   except _TunerExitException:                                                            │
│    47 │   │   _call_teardown_hook(trainer)                                                       │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:580 in _fit_impl                                                                  │
│                                                                                                  │
│    577 │   │   │   model_provided=True,                                                          │
│    578 │   │   │   model_connected=self.lightning_module is not None,                            │
│    579 │   │   )                                                                                 │
│ ❱  580 │   │   self._run(model, ckpt_path=ckpt_path)                                             │
│    581 │   │                                                                                     │
│    582 │   │   assert self.state.stopped                                                         │
│    583 │   │   self.training = False                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:987 in _run                                                                       │
│                                                                                                  │
│    984 │   │   # ----------------------------                                                    │
│    985 │   │   # RUN THE TRAINER                                                                 │
│    986 │   │   # ----------------------------                                                    │
│ ❱  987 │   │   results = self._run_stage()                                                       │
│    988 │   │                                                                                     │
│    989 │   │   # ----------------------------                                                    │
│    990 │   │   # POST-Training CLEAN UP                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:1033 in _run_stage                                                                │
│                                                                                                  │
│   1030 │   │   │   with isolate_rng():                                                           │
│   1031 │   │   │   │   self._run_sanity_check()                                                  │
│   1032 │   │   │   with torch.autograd.set_detect_anomaly(self._detect_anomaly):                 │
│ ❱ 1033 │   │   │   │   self.fit_loop.run()                                                       │
│   1034 │   │   │   return None                                                                   │
│   1035 │   │   raise RuntimeError(f"Unexpected state {self.state}")                              │
│   1036                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:205 in run                                                                         │
│                                                                                                  │
│   202 │   │   while not self.done:                                                               │
│   203 │   │   │   try:                                                                           │
│   204 │   │   │   │   self.on_advance_start()                                                    │
│ ❱ 205 │   │   │   │   self.advance()                                                             │
│   206 │   │   │   │   self.on_advance_end()                                                      │
│   207 │   │   │   │   self._restarting = False                                                   │
│   208 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:363 in advance                                                                     │
│                                                                                                  │
│   360 │   │   │   )                                                                              │
│   361 │   │   with self.trainer.profiler.profile("run_training_epoch"):                          │
│   362 │   │   │   assert self._data_fetcher is not None                                          │
│ ❱ 363 │   │   │   self.epoch_loop.run(self._data_fetcher)                                        │
│   364 │                                                                                          │
│   365 │   def on_advance_end(self) -> None:                                                      │
│   366 │   │   trainer = self.trainer                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:140 in run                                                              │
│                                                                                                  │
│   137 │   │   self.on_run_start(data_fetcher)                                                    │
│   138 │   │   while not self.done:                                                               │
│   139 │   │   │   try:                                                                           │
│ ❱ 140 │   │   │   │   self.advance(data_fetcher)                                                 │
│   141 │   │   │   │   self.on_advance_end(data_fetcher)                                          │
│   142 │   │   │   │   self._restarting = False                                                   │
│   143 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:212 in advance                                                          │
│                                                                                                  │
│   209 │   │   │   batch_idx = data_fetcher._batch_idx                                            │
│   210 │   │   else:                                                                              │
│   211 │   │   │   dataloader_iter = None                                                         │
│ ❱ 212 │   │   │   batch, _, __ = next(data_fetcher)                                              │
│   213 │   │   │   # TODO: we should instead use the batch_idx returned by the fetcher, however   │
│   214 │   │   │   # fetcher state so that the batch_idx is correct after restarting              │
│   215 │   │   │   batch_idx = self.batch_idx + 1                                                 │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:133 in __next__                                                                    │
│                                                                                                  │
│   130 │   │   │   │   self.done = not self.batches                                               │
│   131 │   │   elif not self.done:                                                                │
│   132 │   │   │   # this will run only when no pre-fetching was done.                            │
│ ❱ 133 │   │   │   batch = super().__next__()                                                     │
│   134 │   │   else:                                                                              │
│   135 │   │   │   # the iterator is empty                                                        │
│   136 │   │   │   raise StopIteration                                                            │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:60 in __next__                                                                     │
│                                                                                                  │
│    57 │   │   assert self.iterator is not None                                                   │
│    58 │   │   self._start_profiler()                                                             │
│    59 │   │   try:                                                                               │
│ ❱  60 │   │   │   batch = next(self.iterator)                                                    │
│    61 │   │   except StopIteration:                                                              │
│    62 │   │   │   self.done = True                                                               │
│    63 │   │   │   raise                                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:341 in __next__                                                         │
│                                                                                                  │
│   338 │                                                                                          │
│   339 │   def __next__(self) -> _ITERATOR_RETURN:                                                │
│   340 │   │   assert self._iterator is not None                                                  │
│ ❱ 341 │   │   out = next(self._iterator)                                                         │
│   342 │   │   if isinstance(self._iterator, _Sequential):                                        │
│   343 │   │   │   return out                                                                     │
│   344 │   │   out, batch_idx, dataloader_idx = out                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:78 in __next__                                                          │
│                                                                                                  │
│    75 │   │   out = [None] * n  # values per iterator                                            │
│    76 │   │   for i in range(n):                                                                 │
│    77 │   │   │   try:                                                                           │
│ ❱  78 │   │   │   │   out[i] = next(self.iterators[i])                                           │
│    79 │   │   │   except StopIteration:                                                          │
│    80 │   │   │   │   self._consumed[i] = True                                                   │
│    81 │   │   │   │   if all(self._consumed):                                                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:630 in __next__                                                                         │
│                                                                                                  │
│    627 │   │   │   if self._sampler_iter is None:                                                │
│    628 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    629 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  630 │   │   │   data = self._next_data()                                                      │
│    631 │   │   │   self._num_yielded += 1                                                        │
│    632 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    633 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1345 in _next_data                                                                      │
│                                                                                                  │
│   1342 │   │   │   │   self._task_info[idx] += (data,)                                           │
│   1343 │   │   │   else:                                                                         │
│   1344 │   │   │   │   del self._task_info[idx]                                                  │
│ ❱ 1345 │   │   │   │   return self._process_data(data)                                           │
│   1346 │                                                                                         │
│   1347 │   def _try_put_index(self):                                                             │
│   1348 │   │   assert self._tasks_outstanding < self._prefetch_factor * self._num_workers        │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1371 in _process_data                                                                   │
│                                                                                                  │
│   1368 │   │   self._rcvd_idx += 1                                                               │
│   1369 │   │   self._try_put_index()                                                             │
│   1370 │   │   if isinstance(data, ExceptionWrapper):                                            │
│ ❱ 1371 │   │   │   data.reraise()                                                                │
│   1372 │   │   return data                                                                       │
│   1373 │                                                                                         │
│   1374 │   def _mark_worker_as_unavailable(self, worker_id, shutdown=False):                     │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/_utils.py:694 in │
│ reraise                                                                                          │
│                                                                                                  │
│   691 │   │   │   # If the exception takes multiple arguments, don't try to                      │
│   692 │   │   │   # instantiate since we don't know how to                                       │
│   693 │   │   │   raise RuntimeError(msg) from None                                              │
│ ❱ 694 │   │   raise exception                                                                    │
│   695                                                                                            │
│   696                                                                                            │
│   697 def _get_available_device_type():                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37, in do_one_step
    data = pin_memory(data, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 63, in pin_memory
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 63, in <dictcomp>
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
    return data.pin_memory(device)
           ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@johnlockejrr
Copy link
Author

Exactly the same conf and command works on kraken-5.2.9

┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  1.3 M │             [1, 1, 1800, 300] │   [[1, 4, 450, 75], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  3.2 K │      [[1, 1, 1800, 300], '?'] │ [[1, 64, 900, 150], '?'] │
│ 2  │ net.Gn_1          │ GroupNorm                │    128 │ [[1, 64, 900, 150], '?', '?'] │ [[1, 64, 900, 150], '?'] │
│ 3  │ net.C_2           │ ActConv2D                │ 73.9 K │ [[1, 64, 900, 150], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 4  │ net.Gn_3          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 5  │ net.C_4           │ ActConv2D                │  147 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 6  │ net.Gn_5          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │  295 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 8  │ net.Gn_7          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 9  │ net.C_8           │ ActConv2D                │  590 K │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 10 │ net.Gn_9          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 11 │ net.L_10          │ TransposedSummarizingRNN │ 74.2 K │ [[1, 256, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 12 │ net.L_11          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 13 │ net.C_12          │ ActConv2D                │  2.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 14 │ net.Gn_13         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │ 16.9 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 16 │ net.L_15          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 17 │ net.C_16          │ ActConv2D                │ 18.5 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 18 │ net.Gn_17         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 19 │ net.C_18          │ ActConv2D                │ 18.5 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 20 │ net.Gn_19         │ GroupNorm                │    128 │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 21 │ net.l_20          │ ActConv2D                │    260 │  [[1, 64, 450, 75], '?', '?'] │   [[1, 4, 450, 75], '?'] │
│ 22 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 23 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 24 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
│ 25 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 1.3 M
Non-trainable params: 0
Total params: 1.3 M
Total estimated model params size (MB): 5
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:04 • 0:00:00 7.34it/s val_accuracy: 0.977 val_mean_acc: 0.977 val_mean_iu: 0.001 val_freq_iu: 0.001 early_stopping: 0/10 0.00099
stage 1/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:06 • 0:00:00 4.41it/s val_accuracy: 0.986 val_mean_acc: 0.986 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 1/10 0.00099
stage 2/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:07 • 0:00:00 4.41it/s val_accuracy: 0.993 val_mean_acc: 0.993 val_mean_iu: 0.163 val_freq_iu: 0.481 early_stopping: 0/10 0.16270
stage 3/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:07 • 0:00:00 4.40it/s val_accuracy: 0.995 val_mean_acc: 0.995 val_mean_iu: 0.223 val_freq_iu: 0.655 early_stopping: 0/10 0.22259
stage 4/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:06 • 0:00:00 4.42it/s val_accuracy: 0.996 val_mean_acc: 0.996 val_mean_iu: 0.226 val_freq_iu: 0.683 early_stopping: 0/10 0.22590
stage 5/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:06 • 0:00:00 4.42it/s val_accuracy: 0.996 val_mean_acc: 0.996 val_mean_iu: 0.236 val_freq_iu: 0.710 early_stopping: 0/10 0.23637

@johnlockejrr
Copy link
Author

Really strage, I can't see my GPU being out of memory:

image

@jesusbft
Copy link

Try this argument --device cuda:0 --batch-size 12

Also, check the GPU use with this command: watch -n 1 nvidia-smi

@johnlockejrr
Copy link
Author

Error: No such option: --batch-size (Possible options: --resize, --step-size)

(kraken-5.2.9-py3.10) incognito@DESKTOP-H1BS9PO:~/kraken-train/teyman_print$ nvidia-smi
Thu Sep 19 22:26:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 561.09         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:04:00.0 Off |                  N/A |
|  0%   29C    P8             11W /  170W |       8MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

@johnlockejrr
Copy link
Author

I train YOLOv8 bigger models on the same environment without any problems.

@johnlockejrr
Copy link
Author

Not happening in kraken-4.3.13 so for sure is not my GPU

image

@mittagessen
Copy link
Owner

mittagessen commented Sep 24, 2024

12Gb is fairly close to the 10Gb that is usually required to train a segmentation model so it is possible that torch is running out of memory. Could you just try training the model in 16bit mixed precision with the --precision option for a quick fix?

4.3.12 didn't use lightning yet which was slightly more memory efficient.

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 25, 2024 via email

@johnlockejrr
Copy link
Author

Doesn't crash now with --precision 16 but val_freq_iu is nan!!!

(kraken-5.2.9-py3.10) incognito@DESKTOP-H1BS9PO:~/kraken-train/teyman_print$ ketos segtrain --augment --precision 16 -d cuda:0 -f page -t output.txt -q early -cl --min-epochs 100 -w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 D
o0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 -o /home/incognito/kraken-train/teyman_print/teyman_print_scr_cl/teyman_print_scr_cl_v1
Training line types:
  textline      2       399
  default       4       21
Training region types:
  textzone      3       40
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  4.0 M │              [1, 1, 120, 300] │     [[1, 5, 1, 37], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  1.3 K │       [[1, 1, 120, 300], '?'] │ [[1, 32, 120, 300], '?'] │
│ 2  │ net.Do_1          │ Dropout                  │      0 │ [[1, 32, 120, 300], '?', '?'] │ [[1, 32, 120, 300], '?'] │
│ 3  │ net.Mp_2          │ MaxPool                  │      0 │ [[1, 32, 120, 300], '?', '?'] │  [[1, 32, 60, 150], '?'] │
│ 4  │ net.C_3           │ ActConv2D                │ 40.0 K │  [[1, 32, 60, 150], '?', '?'] │  [[1, 32, 60, 150], '?'] │
│ 5  │ net.Do_4          │ Dropout                  │      0 │  [[1, 32, 60, 150], '?', '?'] │  [[1, 32, 60, 150], '?'] │
│ 6  │ net.Mp_5          │ MaxPool                  │      0 │  [[1, 32, 60, 150], '?', '?'] │   [[1, 32, 30, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │ 55.4 K │   [[1, 32, 30, 75], '?', '?'] │   [[1, 64, 30, 75], '?'] │
│ 8  │ net.Do_7          │ Dropout                  │      0 │   [[1, 64, 30, 75], '?', '?'] │   [[1, 64, 30, 75], '?'] │
│ 9  │ net.Mp_8          │ MaxPool                  │      0 │   [[1, 64, 30, 75], '?', '?'] │   [[1, 64, 15, 37], '?'] │
│ 10 │ net.C_9           │ ActConv2D                │  110 K │   [[1, 64, 15, 37], '?', '?'] │   [[1, 64, 15, 37], '?'] │
│ 11 │ net.Do_10         │ Dropout                  │      0 │   [[1, 64, 15, 37], '?', '?'] │   [[1, 64, 15, 37], '?'] │
│ 12 │ net.S_11          │ Reshape                  │      0 │   [[1, 64, 15, 37], '?', '?'] │   [[1, 960, 1, 37], '?'] │
│ 13 │ net.L_12          │ TransposedSummarizingRNN │  1.9 M │   [[1, 960, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 14 │ net.Do_13         │ Dropout                  │      0 │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │  963 K │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 16 │ net.Do_15         │ Dropout                  │      0 │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 17 │ net.L_16          │ TransposedSummarizingRNN │  963 K │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 18 │ net.Do_17         │ Dropout                  │      0 │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 19 │ net.l_18          │ ActConv2D                │  2.0 K │   [[1, 400, 1, 37], '?', '?'] │     [[1, 5, 1, 37], '?'] │
│ 20 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 21 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 22 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
│ 23 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 4.0 M
Non-trainable params: 0
Total params: 4.0 M
Total estimated model params size (MB): 15
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 18.71it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 0/10 0.00000
stage 1/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.66it/s val_accuracy: 0.995 val_mean_acc: 0.995 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 1/10 0.00000
stage 2/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.15it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 2/10 0.00000
stage 3/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.10it/s val_accuracy: 0.995 val_mean_acc: 0.995 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 3/10 0.00000
stage 4/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.20it/s val_accuracy: 0.965 val_mean_acc: 0.965 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 4/10 0.00000
stage 5/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.60it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 5/10 0.00000
stage 6/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.76it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 6/10 0.00000
stage 7/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.43it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 7/10 0.00000
stage 8/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.04it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 8/10 0.00000
stage 9/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.98it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 9/10 0.00000
stage 10/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.62it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 10/10 0.00000
stage 11/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/36 0:00:00 • -:--:-- 0.00it/s  early_stopping: 10/10 0.00000Trainer was signaled to stop but the required `min_epochs=100` or `min_steps=None` has not been met. Training will continue...
stage 11/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 15.30it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 11/10 0.00000
stage 12/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 14.36it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 12/10 0.00000
stage 13/∞ ━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━ 22/36 0:00:01 • 0:00:01 17.44it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 12/10 0.00000

@dstoekl
Copy link

dstoekl commented Sep 27, 2024

why do you segtrain with such a small input mask that and an architecture that looks more like the recognition model??

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 27, 2024

I'm trying different tests on a small model. Is the arch for recognition? I want to train from scratch.

@dstoekl
Copy link

dstoekl commented Sep 27, 2024

yes you are trying to train with a reco architecture. to me it seems nonsense.

@johnlockejrr
Copy link
Author

I never trained a seg model from scratch. What's the arch for segtrain then?

@dstoekl
Copy link

dstoekl commented Sep 27, 2024

just dont submit -s and it will train on the default.

@johnlockejrr
Copy link
Author

ok, found it in the doc: '[1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32]'

@dstoekl
Copy link

dstoekl commented Sep 27, 2024

the default arch for segtrain is mentioned if you type ketos segtrain --help. The one you quote above is outdated.

@johnlockejrr
Copy link
Author

Working like this: ketos segtrain --precision 16 -d cuda:0 -f page -t output.txt -q early -tl --min-epochs 100 -o /home/incognito/kraken-train/teyman_print/teyman_print_scr_cl/teyman_print_scr_cl_v1

stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 5.02it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 0/10 0.00000
stage 1/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.94it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 1/10 0.00000
stage 2/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.97it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 2/10 0.00000
stage 3/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.86it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 3/10 0.00000
stage 4/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.93it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 4/10 0.00000
stage 5/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.85it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 5/10 0.00000
stage 6/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.89it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 6/10 0.00000
stage 7/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.78it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 0/10 0.00001
stage 8/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:08 • 0:00:00 4.55it/s val_accuracy: 0.989 val_mean_acc: 0.989 val_mean_iu: 0.024 val_freq_iu: 0.090 early_stopping: 0/10 0.02368
stage 9/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:08 • 0:00:00 4.42it/s val_accuracy: 0.991 val_mean_acc: 0.991 val_mean_iu: 0.083 val_freq_iu: 0.314 early_stopping: 0/10 0.08272
stage 10/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.88it/s val_accuracy: 0.993 val_mean_acc: 0.993 val_mean_iu: 0.110 val_freq_iu: 0.415 early_stopping: 0/10 0.10959
stage 11/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.94it/s val_accuracy: 0.994 val_mean_acc: 0.994 val_mean_iu: 0.141 val_freq_iu: 0.534 early_stopping: 0/10 0.14092

@dstoekl
Copy link

dstoekl commented Sep 27, 2024

training on 36 imgs is very little data if you do not train on top of a model. I do not remember whether ketos segtrain will automatically load the blla basemodel as point of departure, but I believe that not.
also I dont think you need to set precision to 16 if you use the default arch for segtrain.

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 27, 2024

I pretrain the model because the little data dataset is made by hand by me, so I train a model to help me in trainscribing for more groundtruth
@mittagessen told me to set precision to 16 because training crashed on one of my GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants