Skip to content

Tokenizers trained on Stanford notes

Michael Wornow edited this page Jan 13, 2023 · 10 revisions

Note: takes ~28 hrs to run tokenizer on all Stanford notes

ClinicalBERT tokenizer

  • Algorithm: WordPiece
  • Training data: MIMIC-III notes
  • Location: /local-scratch/nigam/projects/clmbr_text_assets/models/Bio_ClinicalBERT/
  • Vocab size = 28996
  • Avg tokenized note length on all 159M Stanford notes: 587 tokens

ClinicalBERT tokenizer, trained from scratch on Stanford notes

  • Algorithm: WordPiece
  • Training data: All 159M Stanford notes in STARR-OMOP
  • Location: /local-scratch/nigam/projects/clmbr_text_assets/models/refit_Bio_ClinicalBERT/
  • Vocab size = 28996
  • Avg tokenized note length on all 159M Stanford notes: 479 tokens

image

Clinical-LongFormer tokenizer

  • Algorithm: Byte Pair Encoding
  • Training data: MIMIC-III notes
  • Location: /local-scratch/nigam/projects/clmbr_text_assets/models/Clinical-Longformer/
  • Vocab size = 50265
  • Avg tokenized note length on all 159M Stanford notes: 594 tokens

Clinical-LongFormer tokenizer, trained from scratch on Stanford notes

  • Algorithm: Byte Pair Encoding
  • Training data: All 159M Stanford notes in STARR-OMOP
  • Location: /local-scratch/nigam/projects/clmbr_text_assets/models/refit_Clinical-Longformer/
  • Vocab size = 50265
  • Avg tokenized note length on all 159M Stanford notes: 544 tokens

Stats

  • Total number of Stanford STARR-OMOP notes: 159,558,363
  • Source: som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_12_03