This library comes with zero guarantees. If you want to see improvements, please add a gitissue, or contribute with a pull-request. If you want collaborate scientifically, please send an email
Repository for public-article extraction and mining.
Multiple components:
- Select (S) using API's to connect with 3rd party data
- Retrieve (S) text data from Arxiv/Biorxiv/Medrxiv or Pubmed/PMC
- Parse (S) process ingress XML/JSON/HTML/PDF/CSV into desired format
- Identify (S) relevant text from generic corpora
- Deduplicate (S) remove exact and mark approximate duplicates
- Clean the XML/JSON/.. etc. from the previous step and output cleaned text
- Translate the pruned/cleaned text to any target language
- Anonymize replace PII-information by placeholder terms
- Share make shareable through e.g. Huggingface
- Augment Add paraphrasing
- Synonimize identify and replace typos
- Deabbreviate identify and deabbreviate abbreviations
- Stats extract corpus statistics
Here the (S) indicates that these functions should be calleable in streaming mode. Especially to for smaller domains, with limited storage capacity, we may not want to download Terabytes of corpora before we start our higher level processing functions.
Status (minimum working example):
Task | In progress | Completed |
Select & Retrieve | [ ] | [ ] |
Parse | [x] | [ ] |
Identify | [ ] | [ ] |
Deduplicate | [x] | [ ] |
Clean | [x] | [ ] |
Translate | [x] | [ ] |
Anonymise | [x] | [ ] |
Share | [ ] | [ ] |
Augment | [ ] | [ ] |
Synonimize | [ ] | [ ] |
Deabbreviate | [ ] | [ ] |
Stats | [ ] | [ ] |
Here we can a bit more detail on the projects.
Select & Retrieve: interfaces with APIs for S2ORC/Pubmed/PMC/arxiv/medxriv/biorxiv/OAI and Huggingface.
The select function must be able to pull in data in streaming mode.
For Huggingface datasets this might be easy:
from datasets import load_dataset
datasets = load_dataset('some/dataset', *params, streaming=True)
Parse: parser to normalise incoming data in JSON/YAML or HuggingFace dataset formats
Identify: functionality to identify medical texts in general corpora using supervised and self-supervised models
- Use pre-trained supervised models to identify relevant documents or text sections
- Use LLMs to identify relevant texts using in-context-learning
- Use seed-texts in combination with bi-encoder and cross-encoder models to find texts that are near
The core function ab initio is to ease the creation and dissemination of Dutch clinical NLP work (including corpora) but in principle this code is not limited to the Dutch language or the medical domain.
Deduplicate: remove exact duplicates and mark approximate duplicates. Following the Llama3.1 recipe we use
- MinHash (see Broder)
- Line-level deduplication; line-level frequency determination with cut-off, and selective removal
Clean: remove noise, code/format artifacts, escape/remove quotes
- duplicated n-gram coverage ratio (see Rao et al.) to identify error logs
- Encoding degarbling
- file-format headers/endings
- using fasttext-based language detectors remove text-sections that exceed a pre-set fraction being other lingual based on a per-line basis. e.g. if >50% of the paragraph or document is non-English we remove that paragraph
The core function here is the extract the text intended to be read.
Translate: using NMT and translation APIs optionally in combination with glossaries translate corporate to a target language.
Anonymize: replace PII-information by placeholder terms using deidentification libraries and optional custom patterns.
Share: turn translated dataset into shared datasets including a dataset-card, license, etc.
Augment: code to use paraphrasing for text generation
Synonimize: identify and replace typos, normalise variations of the same word
Deabbreviate: identify and deabbreviate abbreviations to reduce the ambiguity
Stats: extract stats from corpora, specifically; number of tokens, number of sentences, number of documents, vocab size
Basic operation:
from pubscience import clean, deduplicate, anonymise
from pubscience.utils import Pipeline
Cleaner = clean(**clean_kwargs)
Deduplicate = deduplicate(**dedup_kwargs)
Deid = anonymise(**deid_kwargs)
TextPipe = Pipeline([('Cleaner', Cleaner),
('Deduplicate', Deduplicate),
('Deid', Deid)],
df['processed_text'] = TextPipe.fit_transform(df['raw_text'])
# here Deduplicate adds a column to indicate the duplication degree
from pubscience.translate import llm
json_example = os.getenv('PMC_Patients')
OUTPUT_LOC = os.getenv('PMC_Patients_output')
TEXT_IDS = ['title', 'patient']
ID_COLS = ['patient_id', 'patient_uid', 'PMID', 'file_path', 'pub_date']
META_COLS = ['age', 'gender']
MAX_LENGTH = 10_000
MAX_NUM_LINES = 250_293
SYSTEM_PROMPT = """You are a faithful and truthful translator in the medical/clinical domain.
The user query is formatted as a dictionary {'source_language':..,'target_language':.., 'text_to_translate':..},
your response should ONLY consist of your translation"""
vars = {
'model': 'gemini-1.5-flash',
'provider': 'google',
'source_lang': 'english',
'target_lang': 'dutch',
'max_tokens': MAX_LENGTH,
'system_prompt': SYSTEM_PROMPT,
'temperature': 0.15,
'env_loc': '../../.run.env',
translator = llm.TranslationLLM(**vars)
batches = get_batches(json_example)
for batch in batches:
translated_batch = translator.translate_batch(batch)
- camelot
- pdfminer
- pdftotext
- fitz
- beautifulsoup
- scrapy
- html2text
- Compact language detector
- justText
- Fixes Text For You
Language: This is primarily interesting because large scale text-processing can in principle be parallelized in an embarassingly simple way, that means we should prefer natively heteregenous languages such as
- Use the API's to pull .pdf's, .xml's or .json's.
- Pull directly from
. - Parse from local files (parquet/csv.gzip).
Based on
- keyword lookup, using e.g. FlashText
- relevant document embedders (bi-encoders/cross-encoders) or
- topic models, or
- supervised models, trained to distinguish between domain specific texts and generic/other texts
A simple recipe could be (1) use command line string manipulation tools such as grep
, awk
and cat
for the initial pruning
so for instance grep "cardiale\|hartziekte\|vasculair\|tachycardie\|hartritme\|angina pectoris\|vaatlijden" nl_clean_0000.jsonl > nl_clean_cardiale.jsonl
this is then followed by (2) a bi-encoder to check whether documents are 'near' medical texts or (3) a supervised model to identify medical texts.
We want to be able to do this as part of the select process. E.g. in case of the PubMed fulltext articles we can use the abstract for semantic search to identify the relevant PubMed identifiers, which we can then selectively parse from the fulltext.
Fix broken XML/JSON, and select text-sections using Beautifulsoup and other Python libraries, clean for non-word characters and e.g. formatting spans.
Use Bulk google Translate/DeepL/LLM's(GPT4/Gemini/etc) or open source translation models in combination with UMLS-based glossaries to translate the cleaned text to Dutch.
- External LLM APIs:
- Google Gemini
- OpenAI GPT4
- Anthropic Claude
- Groq (Llama, Mistral etc.)
- External translation APIs:
- Google Translate
- DeepL
- pre-trained NLMs (in principle all models that are availabe through Huggingface):
- Maria NMT
- NLLB200
- M2M100
- T5
- pre-trained local LLMs (assuming quantized models):
- Llama
- Mistral
Key features:
- A domain specific glossary, and related,
- a domain specific vocabulary.
- A
functionality to reduce translation cost, i.e. a dynamically programmed wrapper - Medical span alignment
When we translate annotated corpora we need to make sure that the labeled spans are correctly translated and spanned. We identify three approaches: (1) span-preserving translation, (2) span-inference of translation, (3) translate-then-align
An example approach is given by Seinen et al.; Seinen et al inject the span-information directly in the original text prior to translation. Even though this might, arguably, negatively effect the translation quality the resulting models trained on the translated corpora showed similar accuracy to the model trained on the original English corpora.
In principle we are able to create a training set with span-to-span information, e.g. as part of existing collective translation efforts (such as datatools4heart.
We translate a text as is: the fox jumps over the fence
-> de vos springt over het hek
, then we identify the spans in the translated sentence.
One possible solution is to perform semantic similarity matching using multi-lingual (or at least bilingual) bi- or cross-encoders.
A more lexical/syntactic approach is followed by Soares and Krallinger, who use the Aligner tool.
Text extraction pipelines:
- download pdf, extract body text, translate, clean, store
- download XML, fix broken XML, extract body text, translate, clean, store
- download pdf, extract Dutch section, clean, store
As part of Dutch generic corpora
- SoNaR. Raw:
$~$ 5GB - OSCAR. Raw: 41.5GB
- COW14. Raw: 5.3GB
- TnwC: ask permission to share with AMC. Raw: 3.1GB
- CC100. Raw: 31.5GB
- mC4. Raw: 151GB
- Gigacorpus. Raw: 234GB
- MADLAD-400, see paper. Raw: 118.2GB
- PleIAs, common corpus Raw: 180GB
Here we have to note that CC100, mC4, GigaCorpus and MADLAD-400 all consists primarily (if not solely) of CC text. The mC4 corpus is "filtered" for profanities and is therefore unsuitable as a basis for medical corpora. If you use multiple extraction versions of CC, be aware of the considerable required effort to deduplicates the text.
As part of English corpora that we can filter, clean, then translate
- eICU: 0.32GB
- PMC Patients: $160$k patient records
- PMC OA COMM: 54GB compressed, 150GB uncompressed
- PMC OA NON COMM: 16GB compressed, 50GB uncompressed, PMC OA represent more than 3M articles
- Pubmed abstracts
- S2ORC: 81M abstracts, 8.1M fulltext, estimated 500GB
- Biorxiv/Medrxiv, also: 0.22M fulltext documents, estimated 20GB
- Clinical guidelines
- Medical PhD-theses
- Apollo corpora.
- UFAL multilingual corpora
We have Italian corpora:
And in principle we are able to identify medical texts in non-Dutch generic corpora followed by a translation.
As part of Dutch clinical texts
- NtvG journals
- Dutch medical protocols
- medical health records from participating medical centers.
Sentence similarity
- WikiMedical sentence similarity
- MedSTS
- MedNLI
- SciTail
- Medical Question Pairs
- Mediqa RQE
- Mediqa NLI
- EHR Rel
Term similarity
- MultiCardioNER
- PharmaCoNER
- Cantemist
- CodiEsp
- LivingNER
- MedMentions
- NCBI disease, and here
- JNLPBA, this seems to be larger
- Flambe
- S800
- tmVar
- GENIA term, also see
- SCAI Disease
- SCAI Chemical
- n2c2
- MuchMore
- GeneTag
- DDI ner
- Chia
- AnEm
- AnatEm
Entity classification
Relationship extraction
- DDI re
- ChemProt
- GAD, also see
- DrugProt
- BioRed
- GENIA relation
- BioRelex
Document classification
- BioASQ
- BioInstructQA
- PubMedQA
- SCIq
- SciFact
- Mediqa QA
- MedHop
- MedDialog
- Evidence Inference
- BioMRC
- BioHowWhy
- BioInfer
- AskAPatient
Entity Classification
Document Classification
In principle all the English corpora can be used given an appropriate translation method.