-
Notifications
You must be signed in to change notification settings - Fork 15
Tokenizers trained on Stanford notes
Michael Wornow edited this page Jan 11, 2023
·
10 revisions
ClinicalBERT tokenizer
- Algorithm: WordPiece
- Training data: MIMIC-III notes
- Location:
/local-scratch/nigam/projects/clmbr_text_assets/models/Bio_ClinicalBERT/
- Vocab size = 28996
- Avg tokenized note length on all 159M Stanford notes: 587 tokens
Tokenizer trained from scratch on Stanford notes
- Algorithm: WordPiece
- Training data: All 159M Stanford notes in STARR-OMOP
- Location:
/local-scratch/nigam/projects/clmbr_text_assets/models/refit_Bio_ClinicalBERT/
- Vocab size = 28996
- Avg tokenized note length on all 159M Stanford notes: 479 tokens
Total number of Stanford notes: ** 159,558,363**
Source: som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_12_03