-
Notifications
You must be signed in to change notification settings - Fork 15
Tokenizers trained on Stanford notes
Michael Wornow edited this page Jan 13, 2023
·
10 revisions
Note: takes ~28 hrs to run tokenizer on all Stanford notes
- Algorithm: WordPiece
- Training data: MIMIC-III notes
- Location:
/local-scratch/nigam/projects/clmbr_text_assets/models/Bio_ClinicalBERT/
- Vocab size = 28996
- Avg tokenized note length on all 159M Stanford notes: 587 tokens
- Algorithm: WordPiece
- Training data: All 159M Stanford notes in STARR-OMOP
- Location:
/local-scratch/nigam/projects/clmbr_text_assets/models/refit_Bio_ClinicalBERT/
- Vocab size = 28996
- Avg tokenized note length on all 159M Stanford notes: 479 tokens
- Algorithm: Byte Pair Encoding
- Training data: MIMIC-III notes
- Location:
/local-scratch/nigam/projects/clmbr_text_assets/models/Clinical-Longformer/
- Vocab size = 50265
- Avg tokenized note length on all 159M Stanford notes: 594 tokens
- Algorithm: Byte Pair Encoding
- Training data: All 159M Stanford notes in STARR-OMOP
- Location:
/local-scratch/nigam/projects/clmbr_text_assets/models/refit_Clinical-Longformer/
- Vocab size = 50265
- Avg tokenized note length on all 159M Stanford notes: 544 tokens
- Total number of Stanford STARR-OMOP notes: 159,558,363
- Source: som-rit-phi-starr-prod.starr_omop_cdm5_deid_2022_12_03