PipelineIE is an Information Extraction Pipeline primarily based on spaCy that lets you extract information from free text and provides the flexibility to run general to domain specific pipeline like the biomedical domain for information extraction.
Currently the pipeline extracts information in the form of triplets and consists of Coreference Resolution (Stanford CoreNLP / neuralcoref) >> Sentence Simplification that decomposes complex sentences to simple sentences >> Entity Linking (spaCy / ScispaCy / custom spaCy model) >> Triplet Extraction (Currently Subject - Verb - Object Rule using textaCy).
How does it help? / What problem does it solve?
- It is important to resolve coreferences in the text before entities and triplets can be extracted so that they contain the original entities rather than pronouns.
- Usually, the subject and object does not represent the complete entity (which can be a sequence of many words) and might only represent a substring of the original entity. The Entity Linker in the pipeline helps to solve this problem while extracting triplets.
- Complex sentences makes it difficult to extract information from text. This pipeline solves this problem by decomposing complex sentence into simple sentences.
- Finally, in a few lines, anyone can extract triplets from text using the default pipeline or the biomedical pipeline, taking care of the above 2 problems, and use their custom pipeline making it easy to try different options on the input data.
Install neuralcoref from source as mentioned below (referenced from their github repo)
venv .env
source .env/bin/activate
git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .
Optional: Download and unzip CoreNLP 4.2.0 if CoreNLP has to be used for coreference resolution.
Install PipelineIE
git clone https://github.com/vj1494/PipelineIE.git
cd PipelineIE
pip install -r requirements.txt
pip install -e .
Biomedical Pipeline
from pipeline_ie.pipeline_ie import PipelineIE
text = "Co-culture of NK cells with transfected EC enhanced E-selectin, IL-8, and NF-kappaB-dependent promoter activity."
#Biomedical PipelineIE
#Default Biomedical Pipeline uses ScispaCy en_core_sci_lg model
#Same model is used for neuralcoref, entity linkage and triple extraction
#pipeline_ie="default" uses spacy en model
#Sentence Simplification is set as True by default. In order to disable it pass sentence_simplify=False
pie = PipelineIE(text, pipeline="biomedical")
#Returns a dataframe
df = pie.pipeline_triplet()
Please refer to the example for Additional Usage.
Sentence Simplification - (https://github.com/freyamehta99/Sentence-Simplification)