RuntimeError: Caught RuntimeError in pin memory thread for device 0 #642

johnlockejrr · 2024-09-19T15:50:48Z

Specs:

kraken, version 5.2.10.dev2
Python 3.11.10
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
NVIDIA GeForce RTX 3060 12Gb

Training small amount of data, about 35 pages. Worked on 25 pages.

ketos segtrain -d cuda:0 -f page -t output.txt -q early --min-epochs 100 --resize both -tl -i biblialong02_se3_2_tl.mlmodel -o /home/incognito/kraken-train/teyman_print/teyman_print_blong/teyman_print_blong_tl_v1 --pad 20 20

Error:

(kraken-5.2.9) incognito@DESKTOP-H1BS9PO:~/kraken-train/teyman_print$ ketos segtrain -d cuda:0 -f page -t output.txt -q early --min-epochs 100 --resize both -tl -i biblialong02_se3_2_tl.mlmodel -o /home/incognito/kraken-train/teyman_print/teyman_print_blong/teyman_print_blong_tl_v1 --pad 20 20
Training line types:
  textline      2       368
Training region types:
  textzone      3       35
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  1.3 M │             [1, 1, 1800, 300] │   [[1, 4, 450, 75], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  3.2 K │      [[1, 1, 1800, 300], '?'] │ [[1, 64, 900, 150], '?'] │
│ 2  │ net.Gn_1          │ GroupNorm                │    128 │ [[1, 64, 900, 150], '?', '?'] │ [[1, 64, 900, 150], '?'] │
│ 3  │ net.C_2           │ ActConv2D                │ 73.9 K │ [[1, 64, 900, 150], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 4  │ net.Gn_3          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 5  │ net.C_4           │ ActConv2D                │  147 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 6  │ net.Gn_5          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │  295 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 8  │ net.Gn_7          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 9  │ net.C_8           │ ActConv2D                │  590 K │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 10 │ net.Gn_9          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 11 │ net.L_10          │ TransposedSummarizingRNN │ 74.2 K │ [[1, 256, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 12 │ net.L_11          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 13 │ net.C_12          │ ActConv2D                │  2.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 14 │ net.Gn_13         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │ 16.9 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 16 │ net.L_15          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 17 │ net.C_16          │ ActConv2D                │ 18.5 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 18 │ net.Gn_17         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 19 │ net.C_18          │ ActConv2D                │ 18.5 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 20 │ net.Gn_19         │ GroupNorm                │    128 │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 21 │ net.l_20          │ ActConv2D                │    260 │  [[1, 64, 450, 75], '?', '?'] │   [[1, 4, 450, 75], '?'] │
│ 22 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 23 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 24 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
│ 25 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 1.3 M
Non-trainable params: 0
Total params: 1.3 M
Total estimated model params size (MB): 5
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━ 29/31 0:00:04 • 0:00:01 7.64it/s  early_stopping: 0/10 -inf
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/incognito/miniconda3/envs/kraken-5.2.9/bin/ketos:8 in <module>                             │
│                                                                                                  │
│   5 from kraken.ketos import cli                                                                 │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1157 in  │
│ __call__                                                                                         │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1078 in  │
│ main                                                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1688 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:1434 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/core.py:783 in   │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/click/decorators.py:33 │
│ in new_func                                                                                      │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/kraken/ketos/segmentat │
│ ion.py:366 in segtrain                                                                           │
│                                                                                                  │
│   363 │   │   │   │   │   │   │   **val_check_interval)                                          │
│   364 │                                                                                          │
│   365 │   with threadpool_limits(limits=threads):                                                │
│ ❱ 366 │   │   trainer.fit(model)                                                                 │
│   367 │                                                                                          │
│   368 │   if model.best_epoch == -1:                                                             │
│   369 │   │   logger.warning('Model did not improve during training.')                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/kraken/lib/train.py:12 │
│ 9 in fit                                                                                         │
│                                                                                                  │
│    126 │   │   with warnings.catch_warnings():                                                   │
│    127 │   │   │   warnings.filterwarnings(action='ignore', category=UserWarning,                │
│    128 │   │   │   │   │   │   │   │   │   message='The dataloader,')                            │
│ ❱  129 │   │   │   super().fit(*args, **kwargs)                                                  │
│    130                                                                                           │
│    131                                                                                           │
│    132 class KrakenFreezeBackbone(BaseFinetuning):                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:544 in fit                                                                        │
│                                                                                                  │
│    541 │   │   self.state.fn = TrainerFn.FITTING                                                 │
│    542 │   │   self.state.status = TrainerStatus.RUNNING                                         │
│    543 │   │   self.training = True                                                              │
│ ❱  544 │   │   call._call_and_handle_interrupt(                                                  │
│    545 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule,  │
│    546 │   │   )                                                                                 │
│    547                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/call.py:44 in _call_and_handle_interrupt                                                     │
│                                                                                                  │
│    41 │   try:                                                                                   │
│    42 │   │   if trainer.strategy.launcher is not None:                                          │
│    43 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,    │
│ ❱  44 │   │   return trainer_fn(*args, **kwargs)                                                 │
│    45 │                                                                                          │
│    46 │   except _TunerExitException:                                                            │
│    47 │   │   _call_teardown_hook(trainer)                                                       │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:580 in _fit_impl                                                                  │
│                                                                                                  │
│    577 │   │   │   model_provided=True,                                                          │
│    578 │   │   │   model_connected=self.lightning_module is not None,                            │
│    579 │   │   )                                                                                 │
│ ❱  580 │   │   self._run(model, ckpt_path=ckpt_path)                                             │
│    581 │   │                                                                                     │
│    582 │   │   assert self.state.stopped                                                         │
│    583 │   │   self.training = False                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:987 in _run                                                                       │
│                                                                                                  │
│    984 │   │   # ----------------------------                                                    │
│    985 │   │   # RUN THE TRAINER                                                                 │
│    986 │   │   # ----------------------------                                                    │
│ ❱  987 │   │   results = self._run_stage()                                                       │
│    988 │   │                                                                                     │
│    989 │   │   # ----------------------------                                                    │
│    990 │   │   # POST-Training CLEAN UP                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:1033 in _run_stage                                                                │
│                                                                                                  │
│   1030 │   │   │   with isolate_rng():                                                           │
│   1031 │   │   │   │   self._run_sanity_check()                                                  │
│   1032 │   │   │   with torch.autograd.set_detect_anomaly(self._detect_anomaly):                 │
│ ❱ 1033 │   │   │   │   self.fit_loop.run()                                                       │
│   1034 │   │   │   return None                                                                   │
│   1035 │   │   raise RuntimeError(f"Unexpected state {self.state}")                              │
│   1036                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:205 in run                                                                         │
│                                                                                                  │
│   202 │   │   while not self.done:                                                               │
│   203 │   │   │   try:                                                                           │
│   204 │   │   │   │   self.on_advance_start()                                                    │
│ ❱ 205 │   │   │   │   self.advance()                                                             │
│   206 │   │   │   │   self.on_advance_end()                                                      │
│   207 │   │   │   │   self._restarting = False                                                   │
│   208 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:363 in advance                                                                     │
│                                                                                                  │
│   360 │   │   │   )                                                                              │
│   361 │   │   with self.trainer.profiler.profile("run_training_epoch"):                          │
│   362 │   │   │   assert self._data_fetcher is not None                                          │
│ ❱ 363 │   │   │   self.epoch_loop.run(self._data_fetcher)                                        │
│   364 │                                                                                          │
│   365 │   def on_advance_end(self) -> None:                                                      │
│   366 │   │   trainer = self.trainer                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:140 in run                                                              │
│                                                                                                  │
│   137 │   │   self.on_run_start(data_fetcher)                                                    │
│   138 │   │   while not self.done:                                                               │
│   139 │   │   │   try:                                                                           │
│ ❱ 140 │   │   │   │   self.advance(data_fetcher)                                                 │
│   141 │   │   │   │   self.on_advance_end(data_fetcher)                                          │
│   142 │   │   │   │   self._restarting = False                                                   │
│   143 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:212 in advance                                                          │
│                                                                                                  │
│   209 │   │   │   batch_idx = data_fetcher._batch_idx                                            │
│   210 │   │   else:                                                                              │
│   211 │   │   │   dataloader_iter = None                                                         │
│ ❱ 212 │   │   │   batch, _, __ = next(data_fetcher)                                              │
│   213 │   │   │   # TODO: we should instead use the batch_idx returned by the fetcher, however   │
│   214 │   │   │   # fetcher state so that the batch_idx is correct after restarting              │
│   215 │   │   │   batch_idx = self.batch_idx + 1                                                 │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:133 in __next__                                                                    │
│                                                                                                  │
│   130 │   │   │   │   self.done = not self.batches                                               │
│   131 │   │   elif not self.done:                                                                │
│   132 │   │   │   # this will run only when no pre-fetching was done.                            │
│ ❱ 133 │   │   │   batch = super().__next__()                                                     │
│   134 │   │   else:                                                                              │
│   135 │   │   │   # the iterator is empty                                                        │
│   136 │   │   │   raise StopIteration                                                            │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:60 in __next__                                                                     │
│                                                                                                  │
│    57 │   │   assert self.iterator is not None                                                   │
│    58 │   │   self._start_profiler()                                                             │
│    59 │   │   try:                                                                               │
│ ❱  60 │   │   │   batch = next(self.iterator)                                                    │
│    61 │   │   except StopIteration:                                                              │
│    62 │   │   │   self.done = True                                                               │
│    63 │   │   │   raise                                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:341 in __next__                                                         │
│                                                                                                  │
│   338 │                                                                                          │
│   339 │   def __next__(self) -> _ITERATOR_RETURN:                                                │
│   340 │   │   assert self._iterator is not None                                                  │
│ ❱ 341 │   │   out = next(self._iterator)                                                         │
│   342 │   │   if isinstance(self._iterator, _Sequential):                                        │
│   343 │   │   │   return out                                                                     │
│   344 │   │   out, batch_idx, dataloader_idx = out                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:78 in __next__                                                          │
│                                                                                                  │
│    75 │   │   out = [None] * n  # values per iterator                                            │
│    76 │   │   for i in range(n):                                                                 │
│    77 │   │   │   try:                                                                           │
│ ❱  78 │   │   │   │   out[i] = next(self.iterators[i])                                           │
│    79 │   │   │   except StopIteration:                                                          │
│    80 │   │   │   │   self._consumed[i] = True                                                   │
│    81 │   │   │   │   if all(self._consumed):                                                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:630 in __next__                                                                         │
│                                                                                                  │
│    627 │   │   │   if self._sampler_iter is None:                                                │
│    628 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    629 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  630 │   │   │   data = self._next_data()                                                      │
│    631 │   │   │   self._num_yielded += 1                                                        │
│    632 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    633 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1345 in _next_data                                                                      │
│                                                                                                  │
│   1342 │   │   │   │   self._task_info[idx] += (data,)                                           │
│   1343 │   │   │   else:                                                                         │
│   1344 │   │   │   │   del self._task_info[idx]                                                  │
│ ❱ 1345 │   │   │   │   return self._process_data(data)                                           │
│   1346 │                                                                                         │
│   1347 │   def _try_put_index(self):                                                             │
│   1348 │   │   assert self._tasks_outstanding < self._prefetch_factor * self._num_workers        │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1371 in _process_data                                                                   │
│                                                                                                  │
│   1368 │   │   self._rcvd_idx += 1                                                               │
│   1369 │   │   self._try_put_index()                                                             │
│   1370 │   │   if isinstance(data, ExceptionWrapper):                                            │
│ ❱ 1371 │   │   │   data.reraise()                                                                │
│   1372 │   │   return data                                                                       │
│   1373 │                                                                                         │
│   1374 │   def _mark_worker_as_unavailable(self, worker_id, shutdown=False):                     │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/_utils.py:694 in │
│ reraise                                                                                          │
│                                                                                                  │
│   691 │   │   │   # If the exception takes multiple arguments, don't try to                      │
│   692 │   │   │   # instantiate since we don't know how to                                       │
│   693 │   │   │   raise RuntimeError(msg) from None                                              │
│ ❱ 694 │   │   raise exception                                                                    │
│   695                                                                                            │
│   696                                                                                            │
│   697 def _get_available_device_type():                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37, in do_one_step
    data = pin_memory(data, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 63, in pin_memory
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 63, in <dictcomp>
    return type(data)({k: pin_memory(sample, device) for k, sample in data.items()})  # type: ignore[call-arg]
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.2.9/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
    return data.pin_memory(device)
           ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

johnlockejrr · 2024-09-19T17:28:21Z

Exactly the same conf and command works on kraken-5.2.9

┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  1.3 M │             [1, 1, 1800, 300] │   [[1, 4, 450, 75], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  3.2 K │      [[1, 1, 1800, 300], '?'] │ [[1, 64, 900, 150], '?'] │
│ 2  │ net.Gn_1          │ GroupNorm                │    128 │ [[1, 64, 900, 150], '?', '?'] │ [[1, 64, 900, 150], '?'] │
│ 3  │ net.C_2           │ ActConv2D                │ 73.9 K │ [[1, 64, 900, 150], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 4  │ net.Gn_3          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 5  │ net.C_4           │ ActConv2D                │  147 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 6  │ net.Gn_5          │ GroupNorm                │    256 │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │  295 K │ [[1, 128, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 8  │ net.Gn_7          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 9  │ net.C_8           │ ActConv2D                │  590 K │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 10 │ net.Gn_9          │ GroupNorm                │    512 │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 11 │ net.L_10          │ TransposedSummarizingRNN │ 74.2 K │ [[1, 256, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 12 │ net.L_11          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 13 │ net.C_12          │ ActConv2D                │  2.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 14 │ net.Gn_13         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │ 16.9 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 16 │ net.L_15          │ TransposedSummarizingRNN │ 25.1 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 17 │ net.C_16          │ ActConv2D                │ 18.5 K │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 18 │ net.Gn_17         │ GroupNorm                │     64 │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 19 │ net.C_18          │ ActConv2D                │ 18.5 K │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 20 │ net.Gn_19         │ GroupNorm                │    128 │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 21 │ net.l_20          │ ActConv2D                │    260 │  [[1, 64, 450, 75], '?', '?'] │   [[1, 4, 450, 75], '?'] │
│ 22 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 23 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 24 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
│ 25 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 1.3 M
Non-trainable params: 0
Total params: 1.3 M
Total estimated model params size (MB): 5
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:04 • 0:00:00 7.34it/s val_accuracy: 0.977 val_mean_acc: 0.977 val_mean_iu: 0.001 val_freq_iu: 0.001 early_stopping: 0/10 0.00099
stage 1/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:06 • 0:00:00 4.41it/s val_accuracy: 0.986 val_mean_acc: 0.986 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 1/10 0.00099
stage 2/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:07 • 0:00:00 4.41it/s val_accuracy: 0.993 val_mean_acc: 0.993 val_mean_iu: 0.163 val_freq_iu: 0.481 early_stopping: 0/10 0.16270
stage 3/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:07 • 0:00:00 4.40it/s val_accuracy: 0.995 val_mean_acc: 0.995 val_mean_iu: 0.223 val_freq_iu: 0.655 early_stopping: 0/10 0.22259
stage 4/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:06 • 0:00:00 4.42it/s val_accuracy: 0.996 val_mean_acc: 0.996 val_mean_iu: 0.226 val_freq_iu: 0.683 early_stopping: 0/10 0.22590
stage 5/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0:00:06 • 0:00:00 4.42it/s val_accuracy: 0.996 val_mean_acc: 0.996 val_mean_iu: 0.236 val_freq_iu: 0.710 early_stopping: 0/10 0.23637

johnlockejrr · 2024-09-19T19:15:30Z

Really strage, I can't see my GPU being out of memory:

jesusbft · 2024-09-19T19:24:15Z

Try this argument --device cuda:0 --batch-size 12

Also, check the GPU use with this command: watch -n 1 nvidia-smi

johnlockejrr · 2024-09-19T19:28:25Z

Error: No such option: --batch-size (Possible options: --resize, --step-size)

(kraken-5.2.9-py3.10) incognito@DESKTOP-H1BS9PO:~/kraken-train/teyman_print$ nvidia-smi
Thu Sep 19 22:26:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 561.09         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:04:00.0 Off |                  N/A |
|  0%   29C    P8             11W /  170W |       8MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

johnlockejrr · 2024-09-19T19:39:43Z

I train YOLOv8 bigger models on the same environment without any problems.

johnlockejrr · 2024-09-19T20:53:00Z

Not happening in kraken-4.3.13 so for sure is not my GPU

mittagessen · 2024-09-24T23:13:30Z

12Gb is fairly close to the 10Gb that is usually required to train a segmentation model so it is possible that torch is running out of memory. Could you just try training the model in 16bit mixed precision with the --precision option for a quick fix?

4.3.12 didn't use lightning yet which was slightly more memory efficient.

johnlockejrr · 2024-09-25T06:45:27Z

Sure, I’ll try that tomorrow when I get home. Even though, it never happend on kraken 4.x Anyway, I kept an eye on the GPU while starting to train and the memory was far from being exhausted.

…

On Wed, 25 Sep 2024 at 01:13, mittagessen ***@***.***> wrote: 12Gb is fairly close to the 10Gb that is usually required to train a segmentation model so it is possible that torch is running out of memory. Could you just try training the model in 16bit mixed precision with the --precision option for a quick fix? — Reply to this email directly, view it on GitHub <#642 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD44GHWB5HPA4VTJTZ6ZG7LZYHW27AVCNFSM6AAAAABOQHKQ2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGU2DSOJSGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

johnlockejrr · 2024-09-27T08:21:36Z

Doesn't crash now with --precision 16 but val_freq_iu is nan!!!

(kraken-5.2.9-py3.10) incognito@DESKTOP-H1BS9PO:~/kraken-train/teyman_print$ ketos segtrain --augment --precision 16 -d cuda:0 -f page -t output.txt -q early -cl --min-epochs 100 -w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 D
o0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001 -o /home/incognito/kraken-train/teyman_print/teyman_print_scr_cl/teyman_print_scr_cl_v1
Training line types:
  textline      2       399
  default       4       21
Training region types:
  textzone      3       40
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3060') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  4.0 M │              [1, 1, 120, 300] │     [[1, 5, 1, 37], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  1.3 K │       [[1, 1, 120, 300], '?'] │ [[1, 32, 120, 300], '?'] │
│ 2  │ net.Do_1          │ Dropout                  │      0 │ [[1, 32, 120, 300], '?', '?'] │ [[1, 32, 120, 300], '?'] │
│ 3  │ net.Mp_2          │ MaxPool                  │      0 │ [[1, 32, 120, 300], '?', '?'] │  [[1, 32, 60, 150], '?'] │
│ 4  │ net.C_3           │ ActConv2D                │ 40.0 K │  [[1, 32, 60, 150], '?', '?'] │  [[1, 32, 60, 150], '?'] │
│ 5  │ net.Do_4          │ Dropout                  │      0 │  [[1, 32, 60, 150], '?', '?'] │  [[1, 32, 60, 150], '?'] │
│ 6  │ net.Mp_5          │ MaxPool                  │      0 │  [[1, 32, 60, 150], '?', '?'] │   [[1, 32, 30, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │ 55.4 K │   [[1, 32, 30, 75], '?', '?'] │   [[1, 64, 30, 75], '?'] │
│ 8  │ net.Do_7          │ Dropout                  │      0 │   [[1, 64, 30, 75], '?', '?'] │   [[1, 64, 30, 75], '?'] │
│ 9  │ net.Mp_8          │ MaxPool                  │      0 │   [[1, 64, 30, 75], '?', '?'] │   [[1, 64, 15, 37], '?'] │
│ 10 │ net.C_9           │ ActConv2D                │  110 K │   [[1, 64, 15, 37], '?', '?'] │   [[1, 64, 15, 37], '?'] │
│ 11 │ net.Do_10         │ Dropout                  │      0 │   [[1, 64, 15, 37], '?', '?'] │   [[1, 64, 15, 37], '?'] │
│ 12 │ net.S_11          │ Reshape                  │      0 │   [[1, 64, 15, 37], '?', '?'] │   [[1, 960, 1, 37], '?'] │
│ 13 │ net.L_12          │ TransposedSummarizingRNN │  1.9 M │   [[1, 960, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 14 │ net.Do_13         │ Dropout                  │      0 │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │  963 K │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 16 │ net.Do_15         │ Dropout                  │      0 │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 17 │ net.L_16          │ TransposedSummarizingRNN │  963 K │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 18 │ net.Do_17         │ Dropout                  │      0 │   [[1, 400, 1, 37], '?', '?'] │   [[1, 400, 1, 37], '?'] │
│ 19 │ net.l_18          │ ActConv2D                │  2.0 K │   [[1, 400, 1, 37], '?', '?'] │     [[1, 5, 1, 37], '?'] │
│ 20 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 21 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │                             ? │                        ? │
│ 22 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
│ 23 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 4.0 M
Non-trainable params: 0
Total params: 4.0 M
Total estimated model params size (MB): 15
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 18.71it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 0/10 0.00000
stage 1/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.66it/s val_accuracy: 0.995 val_mean_acc: 0.995 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 1/10 0.00000
stage 2/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.15it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 2/10 0.00000
stage 3/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.10it/s val_accuracy: 0.995 val_mean_acc: 0.995 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 3/10 0.00000
stage 4/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.20it/s val_accuracy: 0.965 val_mean_acc: 0.965 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 4/10 0.00000
stage 5/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.60it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 5/10 0.00000
stage 6/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.76it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 6/10 0.00000
stage 7/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.43it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 7/10 0.00000
stage 8/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 17.04it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 8/10 0.00000
stage 9/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.98it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 9/10 0.00000
stage 10/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 16.62it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 10/10 0.00000
stage 11/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/36 0:00:00 • -:--:-- 0.00it/s  early_stopping: 10/10 0.00000Trainer was signaled to stop but the required `min_epochs=100` or `min_steps=None` has not been met. Training will continue...
stage 11/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 15.30it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 11/10 0.00000
stage 12/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:02 • 0:00:00 14.36it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 12/10 0.00000
stage 13/∞ ━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━ 22/36 0:00:01 • 0:00:01 17.44it/s val_accuracy: 1.000 val_mean_acc: 1.000 val_mean_iu: 0.000 val_freq_iu: nan early_stopping: 12/10 0.00000

dstoekl · 2024-09-27T08:26:29Z

why do you segtrain with such a small input mask that and an architecture that looks more like the recognition model??

johnlockejrr · 2024-09-27T08:27:19Z

I'm trying different tests on a small model. Is the arch for recognition? I want to train from scratch.

dstoekl · 2024-09-27T08:28:59Z

yes you are trying to train with a reco architecture. to me it seems nonsense.

johnlockejrr · 2024-09-27T08:30:21Z

I never trained a seg model from scratch. What's the arch for segtrain then?

dstoekl · 2024-09-27T08:32:01Z

just dont submit -s and it will train on the default.

johnlockejrr · 2024-09-27T08:33:00Z

ok, found it in the doc: '[1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32]'

dstoekl · 2024-09-27T08:33:30Z

the default arch for segtrain is mentioned if you type ketos segtrain --help. The one you quote above is outdated.

johnlockejrr · 2024-09-27T08:38:45Z

Working like this: ketos segtrain --precision 16 -d cuda:0 -f page -t output.txt -q early -tl --min-epochs 100 -o /home/incognito/kraken-train/teyman_print/teyman_print_scr_cl/teyman_print_scr_cl_v1

stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 5.02it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 0/10 0.00000
stage 1/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.94it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 1/10 0.00000
stage 2/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.97it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 2/10 0.00000
stage 3/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.86it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 3/10 0.00000
stage 4/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.93it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 4/10 0.00000
stage 5/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.85it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 5/10 0.00000
stage 6/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.89it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 6/10 0.00000
stage 7/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.78it/s val_accuracy: 0.988 val_mean_acc: 0.988 val_mean_iu: 0.000 val_freq_iu: 0.000 early_stopping: 0/10 0.00001
stage 8/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:08 • 0:00:00 4.55it/s val_accuracy: 0.989 val_mean_acc: 0.989 val_mean_iu: 0.024 val_freq_iu: 0.090 early_stopping: 0/10 0.02368
stage 9/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:08 • 0:00:00 4.42it/s val_accuracy: 0.991 val_mean_acc: 0.991 val_mean_iu: 0.083 val_freq_iu: 0.314 early_stopping: 0/10 0.08272
stage 10/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.88it/s val_accuracy: 0.993 val_mean_acc: 0.993 val_mean_iu: 0.110 val_freq_iu: 0.415 early_stopping: 0/10 0.10959
stage 11/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0:00:07 • 0:00:00 4.94it/s val_accuracy: 0.994 val_mean_acc: 0.994 val_mean_iu: 0.141 val_freq_iu: 0.534 early_stopping: 0/10 0.14092

dstoekl · 2024-09-27T08:41:45Z

training on 36 imgs is very little data if you do not train on top of a model. I do not remember whether ketos segtrain will automatically load the blla basemodel as point of departure, but I believe that not.
also I dont think you need to set precision to 16 if you use the default arch for segtrain.

johnlockejrr · 2024-09-27T13:49:46Z

I pretrain the model because the little data dataset is made by hand by me, so I train a model to help me in trainscribing for more groundtruth
@mittagessen told me to set precision to 16 because training crashed on one of my GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Caught RuntimeError in pin memory thread for device 0 #642

RuntimeError: Caught RuntimeError in pin memory thread for device 0 #642

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

jesusbft commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

mittagessen commented Sep 24, 2024 •

edited

Loading

johnlockejrr commented Sep 25, 2024 via email •

edited

Loading

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024 •

edited

Loading

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024 •

edited

Loading

RuntimeError: Caught RuntimeError in pin memory thread for device 0 #642

RuntimeError: Caught RuntimeError in pin memory thread for device 0 #642

Comments

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

jesusbft commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

johnlockejrr commented Sep 19, 2024

mittagessen commented Sep 24, 2024 • edited Loading

johnlockejrr commented Sep 25, 2024 via email • edited Loading

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024 • edited Loading

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024

dstoekl commented Sep 27, 2024

johnlockejrr commented Sep 27, 2024 • edited Loading

mittagessen commented Sep 24, 2024 •

edited

Loading

johnlockejrr commented Sep 25, 2024 via email •

edited

Loading

johnlockejrr commented Sep 27, 2024 •

edited

Loading

johnlockejrr commented Sep 27, 2024 •

edited

Loading