-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch '157-add_minidocs_from_segments' into 'main'
Resolve "Create minidocs from an annotated corpus" Closes #157 See merge request heka/medkit!174 changelog: Resolve "Create minidocs from an annotated corpus"
- Loading branch information
Showing
10 changed files
with
637 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
--- | ||
jupytext: | ||
text_representation: | ||
extension: .md | ||
format_name: myst | ||
format_version: 0.13 | ||
jupytext_version: 1.14.5 | ||
kernelspec: | ||
display_name: Python 3 (ipykernel) | ||
language: python | ||
name: python3 | ||
--- | ||
|
||
|
||
# Document splitter | ||
|
||
+++ | ||
|
||
This tutorial will show an example of how to split a document using its sections as a reference. | ||
|
||
```{seealso} | ||
We combine some operations like **section tokenizer**, **regexp matcher** and **custom operation**. Please see the other examples for more information. | ||
``` | ||
+++ | ||
|
||
## Adding annotations in a document | ||
|
||
Let's detect the sections and add some annotations using medkit operations. | ||
|
||
```{code-cell} ipython3 | ||
# You can download the file available in source code | ||
# !wget https://raw.githubusercontent.com/TeamHeka/medkit/main/docs/data/text/1.txt | ||
from pathlib import Path | ||
from medkit.core.text import TextDocument | ||
doc = TextDocument.from_file(Path("../../data/text/1.txt")) | ||
print(doc.text) | ||
``` | ||
**Defining the operations** | ||
|
||
```{code-cell} ipython3 | ||
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule | ||
from medkit.text.segmentation import SectionTokenizer | ||
# Define a section tokenizer | ||
# The section tokenizer uses a dictionary with keywords to identify sections | ||
section_dict = { | ||
"patient": ["SUBJECTIF"], | ||
"traitement": ["MÉDICAMENTS", "PLAN"], | ||
"allergies": ["ALLERGIES"], | ||
"examen clinique": ["EXAMEN PHYSIQUE"], | ||
"diagnostique": ["EVALUATION"], | ||
} | ||
section_tokenizer = SectionTokenizer(section_dict=section_dict) | ||
# Define a NER operation to create 'problem', and 'treatment' entities | ||
regexp_rules = [ | ||
RegexpMatcherRule(regexp=r"\ballergies\b", label="problem"), | ||
RegexpMatcherRule(regexp=r"\basthme\b", label="problem"), | ||
RegexpMatcherRule(regexp=r"\ballegra\b", label="treatment", case_sensitive=False), | ||
RegexpMatcherRule(regexp=r"\bvaporisateurs\b", label="treatment"), | ||
RegexpMatcherRule(regexp=r"\bloratadine\b", label="treatment", case_sensitive=False), | ||
RegexpMatcherRule(regexp=r"\bnasonex\b", label="treatment", case_sensitive=False), | ||
] | ||
regexp_matcher = RegexpMatcher(rules=regexp_rules) | ||
``` | ||
|
||
We can now annotate the document | ||
|
||
```{code-cell} ipython3 | ||
# Detect annotations | ||
sections = section_tokenizer.run([doc.raw_segment]) | ||
entities = regexp_matcher.run([doc.raw_segment]) | ||
# Annotate | ||
for ann in sections + entities: | ||
doc.anns.add(ann) | ||
print(f"The document contains {len(sections)} sections and {len(entities)} entities\n") | ||
``` | ||
|
||
## Split the document by sections | ||
|
||
Once annotated, we can use the medkit operation {class}`~medkit.text.postprocessing.DocumentSplitter` to create smaller versions of the document using the sections. | ||
|
||
By default, since its `entity_labels`, `attr_labels`, and `relation_labels` are set to `None`, all annotations will be in the resulting documents. You can select the annotations using their labels. | ||
|
||
```{code-cell} ipython3 | ||
from medkit.text.postprocessing import DocumentSplitter | ||
doc_splitter = DocumentSplitter(segment_label="section", # segments of reference | ||
entity_labels=["treatment","problem"],# entities to include | ||
attr_labels=[], # without attrs | ||
relation_labels=[], #without relations | ||
) | ||
new_docs = doc_splitter.run([doc]) | ||
print(f"The document was divided into {len(new_docs)} documents\n") | ||
``` | ||
|
||
Each document contains entities and attributes from the source segment; below, we visualize the new documents via displacy utils. | ||
|
||
```{code-cell} ipython3 | ||
from spacy import displacy | ||
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy | ||
options_displacy = dict(colors={'treatment': "#85C1E9", "problem": "#ff6961"}) | ||
for new_doc in new_docs: | ||
print(f"New document from the section called '{new_doc.metadata['name']}'") | ||
# convert new document to displacy | ||
displacy_data = medkit_doc_to_displacy(new_doc) | ||
displacy.render(displacy_data, manual=True, style="ent", options=options_displacy) | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,11 @@ | ||
__all__ = [ | ||
"AttributeDuplicator", | ||
"compute_nested_segments", | ||
"DocumentSplitter", | ||
"filter_overlapping_entities", | ||
] | ||
|
||
from .alignment_utils import compute_nested_segments | ||
from .attribute_duplicator import AttributeDuplicator | ||
from .document_splitter import DocumentSplitter | ||
from .overlapping import filter_overlapping_entities |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.