Skip to content

Tokenizers trained on Stanford notes

Michael Wornow edited this page Jan 11, 2023 · 10 revisions

ClinicalBERT tokenizer

  • Algorithm: WordPiece
  • Training data: MIMIC-III notes
  • Location: /local-scratch/nigam/projects/clmbr_text_assets/models/Bio_ClinicalBERT/
  • Vocab size = 28996
  • Avg tokenized note length on all 159M Stanford notes: 587 tokens

Tokenizer trained from scratch on Stanford notes

  • Algorithm: WordPiece
  • Training data: All 159M Stanford notes in STARR-OMOP
  • Location: /local-scratch/nigam/projects/clmbr_text_assets/models/refit_Bio_ClinicalBERT/
  • Vocab size = 28996
  • Avg tokenized note length on all 159M Stanford notes: 479 tokens

Total number of Stanford notes: ** 159,558,363**

Source: som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_12_03