Tokenizers trained on Stanford notes

Jump to bottom

Michael Wornow edited this page Jan 11, 2023 · 10 revisions

ClinicalBERT tokenizer

Algorithm: WordPiece
Training data: MIMIC-III notes
Location: /local-scratch/nigam/projects/clmbr_text_assets/models/Bio_ClinicalBERT/
Vocab size = 28996
Avg tokenized note length on all 159M Stanford notes: 587 tokens

Tokenizer trained from scratch on Stanford notes

Algorithm: WordPiece
Training data: All 159M Stanford notes in STARR-OMOP
Location: /local-scratch/nigam/projects/clmbr_text_assets/models/refit_Bio_ClinicalBERT/
Vocab size = 28996
Avg tokenized note length on all 159M Stanford notes: 479 tokens

Total number of Stanford notes: ** 159,558,363**

Source: som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_12_03