Tokenizers trained on Stanford notes

Jump to bottom

Michael Wornow edited this page Jan 13, 2023 · 10 revisions

Note: takes ~28 hrs to run tokenizer on all Stanford notes

ClinicalBERT tokenizer

Algorithm: WordPiece
Training data: MIMIC-III notes
Location: /local-scratch/nigam/projects/clmbr_text_assets/models/Bio_ClinicalBERT/
Vocab size = 28996
Avg tokenized note length on all 159M Stanford notes: 587 tokens

ClinicalBERT tokenizer, trained from scratch on Stanford notes

Algorithm: WordPiece
Training data: All 159M Stanford notes in STARR-OMOP
Location: /local-scratch/nigam/projects/clmbr_text_assets/models/refit_Bio_ClinicalBERT/
Vocab size = 28996
Avg tokenized note length on all 159M Stanford notes: 479 tokens

Clinical-LongFormer tokenizer

Algorithm: Byte Pair Encoding
Training data: MIMIC-III notes
Location: /local-scratch/nigam/projects/clmbr_text_assets/models/Clinical-Longformer/
Vocab size = 50265
Avg tokenized note length on all 159M Stanford notes: 594 tokens

Clinical-LongFormer tokenizer, trained from scratch on Stanford notes

Algorithm: Byte Pair Encoding
Training data: All 159M Stanford notes in STARR-OMOP
Location: /local-scratch/nigam/projects/clmbr_text_assets/models/refit_Clinical-Longformer/
Vocab size = 50265
Avg tokenized note length on all 159M Stanford notes: 544 tokens

Stats

Total number of Stanford STARR-OMOP notes: 159,558,363
Source: som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_12_03