-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
15 Adding CoreferenceProcessor #41
Open
KiaLAN
wants to merge
65
commits into
asyml:master
Choose a base branch
from
KiaLAN:15_coreference_processor
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
65 commits
Select commit
Hold shift + click to select a range
880329f
add empty coref processor
KiaLAN d9b3f71
add process method. TODO: correct span
KiaLAN da36372
fix output span
KiaLAN 6dafabe
add more default configs and comment
KiaLAN 0e18a8f
change nlp pipeline
KiaLAN 2d28246
Merge branch 'master' of https://github.com/asyml/ForteHealth into 15…
KiaLAN fcafb22
add more entries to MedicalArticle; add comment
KiaLAN 0b675aa
fixed some comments
KiaLAN d0812ee
fix comments and format files
KiaLAN 0769453
update requirements.txt
KiaLAN 5a55b41
update setup.py
KiaLAN fdc343e
add unit test
KiaLAN 85bf9bb
remove duplicated definition
KiaLAN 9579923
fix ddt
KiaLAN 4e41b34
fix import
KiaLAN 443f886
formatting
KiaLAN 5f3cf34
remove long lines
KiaLAN a11f209
remove unused import
KiaLAN 07dbb02
add cython to dependency
KiaLAN 008eb1b
delay the installation of neuralcoref
KiaLAN ce2fd6f
put installation of neuralcoref in workflow
KiaLAN e832289
fix typo
KiaLAN c18f704
skip mypy's None is not callable bug
KiaLAN a9cf38d
black format
KiaLAN 2c346fb
add spacy
KiaLAN 037f33b
add cython and pytest
KiaLAN 5856d15
remove commented code
KiaLAN 03b6d69
fix unit test data
KiaLAN 9e55211
fix unit test data 2
KiaLAN 2730b88
remove the dependency of SpacyProcessor
KiaLAN 5f6d024
update unit test
KiaLAN c9d56a4
remove TODO
KiaLAN 09114ad
add load_lang_model
KiaLAN cb676c3
remove spacy from requirements and setup
KiaLAN e5ef675
change import order
KiaLAN 8b18e67
use ddt data and unpack
KiaLAN 299d6d3
update config structure
KiaLAN 1a0e239
add comment for lang
KiaLAN 36bcaba
fix set() bug
KiaLAN 3a20b8d
add offset calculation assertion
KiaLAN 26efabd
formatting
KiaLAN f7db024
udpate test
KiaLAN f453e60
shorten comment
KiaLAN 1d006fd
remove store_scores
KiaLAN dcec89f
fix assertion
KiaLAN 465341d
remove store_scores in test
KiaLAN 3ee2f7b
rename document to entry
KiaLAN 1d44ff7
fix cfg_inference kwargs
KiaLAN 0ad95d8
add conv_dict test
KiaLAN 87b0996
black reformat
KiaLAN 148e918
fix pylint
KiaLAN 2fefe18
update comment
KiaLAN 662b324
update comment
KiaLAN 6bcc58c
try one
KiaLAN ae31363
use subprocess to install cython
KiaLAN 74ffad4
use subprocess to install cython and spacy
KiaLAN f3dc9e3
add extras_require for icd and coref
KiaLAN 2581edb
remove spacy and neuralcoref stage from main.yml
KiaLAN e09dfe7
replace load_module with get_class
KiaLAN c925f5b
remove 'model' argument
KiaLAN f6c1a8b
fix rebundunt import and args
KiaLAN 035a2c8
fix merge conflict
KiaLAN 7b31beb
fix merge conflict
KiaLAN 5ef774a
fix merge conflict
KiaLAN ffe88f6
fix merge conflict: restore coref test
KiaLAN File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,209 @@ | ||
# Copyright 2022 The Forte Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" | ||
Coreference Processor | ||
""" | ||
from typing import Dict, Set | ||
|
||
import neuralcoref | ||
|
||
from forte.common import Resources, ProcessExecutionException | ||
from forte.common.configuration import Config | ||
from forte.data.data_pack import DataPack | ||
from forte.processors.base import PackProcessor | ||
from forte.utils import get_class | ||
|
||
from ft.onto.base_ontology import CoreferenceGroup | ||
|
||
from fortex.spacy.spacy_processors import load_lang_model | ||
|
||
__all__ = [ | ||
"CoreferenceProcessor", | ||
] | ||
|
||
|
||
class CoreferenceProcessor(PackProcessor): | ||
r""" | ||
Implementation of this CoreferenceProcessor has been based on | ||
huggingface NeuralCoref. You can find more details in the original repo. | ||
|
||
Note that the NeuralCoref package from PyPI uses a dated spaCy | ||
version (2.1), which can cause segmentation fault with the spaCy | ||
we use (2.3). Please install NeuralCoref by building from source. | ||
|
||
Referred repository link: | ||
https://github.com/huggingface/neuralcoref | ||
""" | ||
|
||
def __init__(self): | ||
super().__init__() | ||
self.spacy_nlp = None | ||
|
||
def set_up(self, configs: Config): | ||
self.spacy_nlp = load_lang_model(configs.lang) | ||
|
||
if self.spacy_nlp is None: | ||
raise ProcessExecutionException( | ||
"The SpaCy pipeline is not initialized, maybe you " | ||
"haven't called the initialization function." | ||
) | ||
|
||
cfg_inference = configs.cfg_inference | ||
neuralcoref.add_to_pipe(self.spacy_nlp, model=True, **cfg_inference) | ||
|
||
def initialize(self, resources: Resources, configs: Config): | ||
super().initialize(resources, configs) | ||
self.set_up(configs) | ||
|
||
def _process(self, input_pack: DataPack): | ||
r""" | ||
Coreference resolution is done by | ||
a spaCy pipeline with `NeuralCoref` added. | ||
|
||
Then we translate the output to `CoreferenceGroup`. | ||
""" | ||
|
||
# Default: Document | ||
entry_type = get_class(self.configs.entry_type) | ||
|
||
# Default: MedicalEntityMention | ||
mention_type = get_class(self.configs.mention_type) | ||
|
||
for entry_specified in input_pack.get(entry_type=entry_type): | ||
result = self.spacy_nlp(entry_specified.text) | ||
|
||
if result._.has_coref: | ||
for cluster in result._.coref_clusters: | ||
|
||
mentions = [] | ||
for mention in cluster.mentions: | ||
mention_text = mention.text | ||
mention = mention_type( | ||
input_pack, | ||
mention.start_char + entry_specified.begin, | ||
mention.end_char + entry_specified.begin, | ||
) | ||
|
||
# TODO: remove assertion? | ||
assert mention.text == mention_text, ( | ||
f"The processor extracted mention {mention.text}" | ||
" which is different from the original mention" | ||
f" {mention_text}. The offeset calculation is wrong." | ||
) | ||
mentions.append(mention) | ||
|
||
group = CoreferenceGroup(input_pack) | ||
group.add_members(mentions) | ||
|
||
@classmethod | ||
def default_configs(cls): | ||
r""" | ||
This defines a basic config structure for `CoreferenceProcessor`. | ||
|
||
Following are the keys for this dictionary: | ||
- `entry_type`: Input entry type. You can change the context of | ||
coreference resolution by setting this parameter. For example, | ||
if you want to do coreference resolution within documents, set | ||
it to `"ft.onto.base_ontology.Document"`. If you want to do | ||
coreference resolution within sentences, set it to | ||
`"ft.onto.base_ontology.Sentence"`. | ||
Default: `"ft.onto.base_ontology.Document"`. | ||
- `mention_type`: The type of members in `CoreferenceGroup`. | ||
It can be set to `"ft.onto.base_ontology.EntityMention"` or | ||
its subclasses. | ||
Default: `"ftx.medical.clinical_ontology.MedicalEntityMention"`. | ||
- `lang`: The SpaCy pipeline to be used. The pipeline does the | ||
preprocessing steps for NeuralCoref. | ||
Default: `"en_core_web_sm"`. | ||
- `cfg_inference`: A dict containing the inference configs of | ||
NeuralCoref. See `get_default_cfg_inference` for default values, and see | ||
https://github.com/huggingface/neuralcoref/blob/master/README.md#parameters | ||
for the meaing of these parameters. | ||
|
||
Returns: A dictionary with the default config for this processor. | ||
""" | ||
return { | ||
"entry_type": "ft.onto.base_ontology.Document", | ||
"mention_type": "ftx.medical.clinical_ontology.MedicalEntityMention", | ||
"lang": "en_core_web_sm", | ||
"cfg_inference": cls.get_default_cfg_inference(), | ||
} | ||
|
||
@classmethod | ||
def get_default_cfg_inference(cls): | ||
""" | ||
This defines the default inference config of NeuralCoref. | ||
|
||
Following are the keys for this dictionary: | ||
- `greedyness` (`float`): A number between 0 and 1 determining | ||
how greedy the model is about making coreference decisions | ||
(more greedy means more coreference links). | ||
Default: `0.5`. | ||
- `max_dist` (`int`): How many mentions back to look when | ||
considering possible antecedents of the current mention. | ||
Decreasing the value will cause the system to run faster | ||
but less accurately. | ||
Default: `50`. | ||
- `max_dist_match` (`int`): The system will consider linking | ||
the current mention | ||
to a preceding one further than max_dist away if they share | ||
a noun or proper noun. In this case, it looks max_dist_match | ||
away instead. | ||
Default: `500`. | ||
- `blacklist` (`bool`): Should the system resolve coreferences | ||
for pronouns in the following list: ["i", "me", "my", "you", "your"]. | ||
Default `True`. | ||
- `conv_dict` (`dict(str, list(str))`): A conversion dictionary | ||
that you can use | ||
to replace the embeddings of rare words (keys) by an average | ||
of the embeddings of a list of common words (values). | ||
Ex: `conv_dict={"Angela": ["woman", "girl"]}` | ||
will help resolving coreferences for Angela by using the | ||
embeddings for the more common woman and girl instead of the | ||
embedding of Angela. | ||
This currently only works for single words (not for words groups). | ||
Default: `None`. | ||
|
||
Returns: A dictionary with the default inference config of NeuralCoref. | ||
""" | ||
return { | ||
"greedyness": 0.5, | ||
"max_dist": 50, | ||
"max_dist_match": 500, | ||
"blacklist": True, | ||
"conv_dict": None, | ||
KiaLAN marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
def expected_types_and_attributes(self): | ||
r""" | ||
Method to add user specified expected type which would be checked | ||
before running the processor if the pipeline is initialized with | ||
`enforce_consistency=True` or | ||
:meth:`~forte.pipeline.Pipeline.enforce_consistency` was enabled for | ||
the pipeline. | ||
""" | ||
# return {self.configs.entry_type: {"text"}} # TODO: fix this | ||
KiaLAN marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return {self.configs.entry_type: set()} | ||
|
||
def record(self, record_meta: Dict[str, Set[str]]): | ||
r""" | ||
Method to add output type record of `CoreferenceProcessor` which | ||
is `"ftx.medical.clinical_ontology.CoreferenceGroup"` with attribute | ||
`members` to :attr:`forte.data.data_pack.Meta.record`. | ||
|
||
Args: | ||
record_meta: the field in the datapack for type record that need to | ||
fill in for consistency checking. | ||
""" | ||
record_meta["ft.onto.base_ontology.CoreferenceGroup"] = {"members"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,3 +26,11 @@ git+https://[email protected]/asyml/forte-wrappers.git#egg=forte.huggingface&subdir | |
dataclasses~=0.8; python_version < '3.7' | ||
setuptools~=57.0.0 | ||
transformers~=4.2.2 | ||
|
||
# spacy>=2.3.0, <=2.3.5 # will be installed by forte.spacy | ||
cython>=0.25 | ||
pytest | ||
# It is annoying that if we install neuralcoref and spacy at the same | ||
# time, neuralcoref will throw "Cython failed" during building. | ||
# Therefore, we must install neuralcoref after spacy is installed. | ||
# git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can deal with this through |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,37 +1,68 @@ | ||
import sys | ||
from pathlib import Path | ||
import setuptools | ||
import subprocess | ||
import os | ||
|
||
|
||
long_description = (Path(__file__).parent / "README.md").read_text() | ||
|
||
if sys.version_info < (3, 6): | ||
sys.exit('Python>=3.6 is required by forte-medical.') | ||
sys.exit("Python>=3.6 is required by forte-medical.") | ||
|
||
# If we install neuralcoref and spacy at the same | ||
# time, neuralcoref will throw "Cython failed" during building, | ||
# which is because neuralcoref does not set them as dependencies | ||
# properly. | ||
# Therefore, we must install neuralcoref after cython and spacy | ||
# are installed. | ||
p = subprocess.call( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we use more standard way in |
||
[ | ||
sys.executable, | ||
"-m", | ||
"pip", | ||
"install", | ||
"forte.spacy", # TODO: version | ||
"cython>=0.25", | ||
], | ||
env=os.environ, | ||
) | ||
if p != 0: | ||
raise RuntimeError("Installing NeuralCoref dependencies failed.") | ||
|
||
setuptools.setup( | ||
name="forte.health", | ||
version='0.1.0', | ||
version="0.1.0", | ||
url="https://github.com/asyml/ForteHealth", | ||
description="NLP pipeline framework for biomedical and clinical domains", | ||
long_description=long_description, | ||
long_description_content_type="text/markdown", | ||
license='Apache License Version 2.0', | ||
license="Apache License Version 2.0", | ||
packages=setuptools.find_namespace_packages( | ||
include=['fortex.health', 'ftx.*'], | ||
exclude=["scripts*", "examples*", "tests*"] | ||
include=["fortex.health", "ftx.*"], exclude=["scripts*", "examples*", "tests*"] | ||
), | ||
namespace_packages=["fortex"], | ||
install_requires=[ | ||
'forte~=0.2.0', | ||
"forte~=0.2.0", | ||
"forte.spacy", # TODO: version | ||
"cython>=0.25", | ||
], | ||
extras_require={ | ||
"test": [ | ||
"ddt", | ||
"testfixtures", | ||
"transformers==4.2.2", | ||
"protobuf==3.19.4", | ||
"pytest", | ||
"neuralcoref @ git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref", | ||
], | ||
"icd_coding": [ | ||
"transformers", | ||
], | ||
"coreference": [ | ||
"neuralcoref @ git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref", | ||
], | ||
}, | ||
include_package_data=True, | ||
python_requires='>=3.6' | ||
python_requires=">=3.6", | ||
) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't remove the installation directly from the workflow, let's actually use the workflow matrix set to run different dependencies as different runs.
Here is an example: https://github.com/asyml/forte-wrappers/blob/main/.github/workflows/main.yml#L84
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in the future we will have different versions of forte-wrapper, and you want to define a matrix to test them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forte-wrapper
. I show you theforte-wrapper
repository workflow as an example of how to set up a workflow matrix.extra_install
method insetup.py
to specify different sets of dependencies.You may wonder how we make them work together in the future, that's another issue (we have solutions such as RemoteProcessor or Docker images). Right now we can focus on making them each testable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a question: @Piyush13y seems to want to set forte.spacy as an extra_install. If so, forte.spacy and neurocoref will be in the same extra_install list, example:
I think pip won't make sure that forte.spacy is install ahead of neuralcoref.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm... It's a little bit confusing because now forte.spacy is not even in the setup.py in master branch: https://github.com/asyml/ForteHealth/blob/master/setup.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing a spacy version compatible with neuralcoref causes the failure of
SpacyProcessor
. SinceSpacyProcessor
is needed in unit test, this seems to be a dead end before we can manage to run the modules separately.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we come back to my initial suggestion, use the
matrix
set up to run the module separatelyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad news. A spacy version compatible with neuralcoref is not compatible with my RTX 3060.
I checked the issues in neuralcoref regarding the spacy version problem, almost everyone is suggesting building neuralcoref from source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest release of neuralcoref is in 2019. Maybe it is worth considering changing a better package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, actually for now we are simply looking for a coreference module that works. Maybe we can directly add this function to AllenNLPProcessor, or other packages you find easy