Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

15 Adding CoreferenceProcessor #41

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
880329f
add empty coref processor
KiaLAN Jun 15, 2022
d9b3f71
add process method. TODO: correct span
KiaLAN Jun 15, 2022
da36372
fix output span
KiaLAN Jun 16, 2022
6dafabe
add more default configs and comment
KiaLAN Jun 16, 2022
0e18a8f
change nlp pipeline
KiaLAN Jun 17, 2022
2d28246
Merge branch 'master' of https://github.com/asyml/ForteHealth into 15…
KiaLAN Jun 17, 2022
fcafb22
add more entries to MedicalArticle; add comment
KiaLAN Jun 17, 2022
0b675aa
fixed some comments
KiaLAN Jun 17, 2022
d0812ee
fix comments and format files
KiaLAN Jun 17, 2022
0769453
update requirements.txt
KiaLAN Jun 17, 2022
5a55b41
update setup.py
KiaLAN Jun 17, 2022
fdc343e
add unit test
KiaLAN Jun 17, 2022
85bf9bb
remove duplicated definition
KiaLAN Jun 17, 2022
9579923
fix ddt
KiaLAN Jun 17, 2022
4e41b34
fix import
KiaLAN Jun 17, 2022
443f886
formatting
KiaLAN Jun 17, 2022
5f3cf34
remove long lines
KiaLAN Jun 17, 2022
a11f209
remove unused import
KiaLAN Jun 17, 2022
07dbb02
add cython to dependency
KiaLAN Jun 17, 2022
008eb1b
delay the installation of neuralcoref
KiaLAN Jun 17, 2022
ce2fd6f
put installation of neuralcoref in workflow
KiaLAN Jun 17, 2022
e832289
fix typo
KiaLAN Jun 17, 2022
c18f704
skip mypy's None is not callable bug
KiaLAN Jun 17, 2022
a9cf38d
black format
KiaLAN Jun 17, 2022
2c346fb
add spacy
KiaLAN Jun 17, 2022
037f33b
add cython and pytest
KiaLAN Jun 17, 2022
5856d15
remove commented code
KiaLAN Jun 17, 2022
03b6d69
fix unit test data
KiaLAN Jun 17, 2022
9e55211
fix unit test data 2
KiaLAN Jun 17, 2022
2730b88
remove the dependency of SpacyProcessor
KiaLAN Jun 17, 2022
5f6d024
update unit test
KiaLAN Jun 17, 2022
c9d56a4
remove TODO
KiaLAN Jun 17, 2022
09114ad
add load_lang_model
KiaLAN Jun 17, 2022
cb676c3
remove spacy from requirements and setup
KiaLAN Jun 17, 2022
e5ef675
change import order
KiaLAN Jun 17, 2022
8b18e67
use ddt data and unpack
KiaLAN Jul 1, 2022
299d6d3
update config structure
KiaLAN Jul 1, 2022
1a0e239
add comment for lang
KiaLAN Jul 1, 2022
36bcaba
fix set() bug
KiaLAN Jul 1, 2022
3a20b8d
add offset calculation assertion
KiaLAN Jul 1, 2022
26efabd
formatting
KiaLAN Jul 1, 2022
f7db024
udpate test
KiaLAN Jul 1, 2022
f453e60
shorten comment
KiaLAN Jul 1, 2022
1d006fd
remove store_scores
KiaLAN Jul 1, 2022
dcec89f
fix assertion
KiaLAN Jul 1, 2022
465341d
remove store_scores in test
KiaLAN Jul 1, 2022
3ee2f7b
rename document to entry
KiaLAN Jul 1, 2022
1d44ff7
fix cfg_inference kwargs
KiaLAN Jul 1, 2022
0ad95d8
add conv_dict test
KiaLAN Jul 1, 2022
87b0996
black reformat
KiaLAN Jul 1, 2022
148e918
fix pylint
KiaLAN Jul 1, 2022
2fefe18
update comment
KiaLAN Jul 1, 2022
662b324
update comment
KiaLAN Jul 1, 2022
6bcc58c
try one
KiaLAN Jul 9, 2022
ae31363
use subprocess to install cython
KiaLAN Jul 9, 2022
74ffad4
use subprocess to install cython and spacy
KiaLAN Jul 9, 2022
f3dc9e3
add extras_require for icd and coref
KiaLAN Jul 9, 2022
2581edb
remove spacy and neuralcoref stage from main.yml
KiaLAN Jul 9, 2022
e09dfe7
replace load_module with get_class
KiaLAN Jul 9, 2022
c925f5b
remove 'model' argument
KiaLAN Jul 9, 2022
f6c1a8b
fix rebundunt import and args
KiaLAN Jul 9, 2022
035a2c8
fix merge conflict
KiaLAN Jul 16, 2022
7b31beb
fix merge conflict
KiaLAN Jul 16, 2022
5ef774a
fix merge conflict
KiaLAN Jul 16, 2022
ffe88f6
fix merge conflict: restore coref test
KiaLAN Jul 16, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,6 @@ jobs:
run: |
pip install --use-feature=2020-resolver --progress-bar off .[test]

- name: Install Forte-wrappers-spacy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't remove the installation directly from the workflow, let's actually use the workflow matrix set to run different dependencies as different runs.

Here is an example: https://github.com/asyml/forte-wrappers/blob/main/.github/workflows/main.yml#L84

Copy link
Collaborator Author

@KiaLAN KiaLAN Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"use the workflow matrix set to run different dependencies as different runs."

Do you mean in the future we will have different versions of forte-wrapper, and you want to define a matrix to test them all?

Copy link
Member

@hunterhector hunterhector Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No, not different versions of forte-wrapper. I show you the forte-wrapper repository workflow as an example of how to set up a workflow matrix.
  2. In this repo, ForteHealth, we can set up a similar matrix, since not all processors in this repo can be installed and run at the same time, they can be tested individually. Note that you could use the extra_install method in setup.py to specify different sets of dependencies.

You may wonder how we make them work together in the future, that's another issue (we have solutions such as RemoteProcessor or Docker images). Right now we can focus on making them each testable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a question: @Piyush13y seems to want to set forte.spacy as an extra_install. If so, forte.spacy and neurocoref will be in the same extra_install list, example:

setuptools.setup(
....
    install_requires=[
        "forte~=0.2.0",
    ],
    extras_require={
        .....
        "coreference": [
            "forte.spacy",  # TODO: version
            "cython>=0.25",
            "neuralcoref @ git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref",
        ],
    },
    include_package_data=True,
    python_requires=">=3.6",
)

I think pip won't make sure that forte.spacy is install ahead of neuralcoref.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Piyush13y seems to want to set forte.spacy as an extra_install.

Emmm... It's a little bit confusing because now forte.spacy is not even in the setup.py in master branch: https://github.com/asyml/ForteHealth/blob/master/setup.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing a spacy version compatible with neuralcoref causes the failure of SpacyProcessor. Since SpacyProcessor is needed in unit test, this seems to be a dead end before we can manage to run the modules separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we come back to my initial suggestion, use the matrix set up to run the module separately

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad news. A spacy version compatible with neuralcoref is not compatible with my RTX 3060.

I checked the issues in neuralcoref regarding the spacy version problem, almost everyone is suggesting building neuralcoref from source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest release of neuralcoref is in 2019. Maybe it is worth considering changing a better package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, actually for now we are simply looking for a coreference module that works. Maybe we can directly add this function to AllenNLPProcessor, or other packages you find easy

run: |
git clone https://github.com/asyml/forte-wrappers.git
cd forte-wrappers
pip install src/spacy

- name: Test with pytest and run coverage
run: |
coverage run -m pytest tests/
Expand Down
209 changes: 209 additions & 0 deletions fortex/health/processors/coreference_processor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Copyright 2022 The Forte Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Coreference Processor
"""
from typing import Dict, Set

import neuralcoref

from forte.common import Resources, ProcessExecutionException
from forte.common.configuration import Config
from forte.data.data_pack import DataPack
from forte.processors.base import PackProcessor
from forte.utils import get_class

from ft.onto.base_ontology import CoreferenceGroup

from fortex.spacy.spacy_processors import load_lang_model

__all__ = [
"CoreferenceProcessor",
]


class CoreferenceProcessor(PackProcessor):
r"""
Implementation of this CoreferenceProcessor has been based on
huggingface NeuralCoref. You can find more details in the original repo.

Note that the NeuralCoref package from PyPI uses a dated spaCy
version (2.1), which can cause segmentation fault with the spaCy
we use (2.3). Please install NeuralCoref by building from source.

Referred repository link:
https://github.com/huggingface/neuralcoref
"""

def __init__(self):
super().__init__()
self.spacy_nlp = None

def set_up(self, configs: Config):
self.spacy_nlp = load_lang_model(configs.lang)

if self.spacy_nlp is None:
raise ProcessExecutionException(
"The SpaCy pipeline is not initialized, maybe you "
"haven't called the initialization function."
)

cfg_inference = configs.cfg_inference
neuralcoref.add_to_pipe(self.spacy_nlp, model=True, **cfg_inference)

def initialize(self, resources: Resources, configs: Config):
super().initialize(resources, configs)
self.set_up(configs)

def _process(self, input_pack: DataPack):
r"""
Coreference resolution is done by
a spaCy pipeline with `NeuralCoref` added.

Then we translate the output to `CoreferenceGroup`.
"""

# Default: Document
entry_type = get_class(self.configs.entry_type)

# Default: MedicalEntityMention
mention_type = get_class(self.configs.mention_type)

for entry_specified in input_pack.get(entry_type=entry_type):
result = self.spacy_nlp(entry_specified.text)

if result._.has_coref:
for cluster in result._.coref_clusters:

mentions = []
for mention in cluster.mentions:
mention_text = mention.text
mention = mention_type(
input_pack,
mention.start_char + entry_specified.begin,
mention.end_char + entry_specified.begin,
)

# TODO: remove assertion?
assert mention.text == mention_text, (
f"The processor extracted mention {mention.text}"
" which is different from the original mention"
f" {mention_text}. The offeset calculation is wrong."
)
mentions.append(mention)

group = CoreferenceGroup(input_pack)
group.add_members(mentions)

@classmethod
def default_configs(cls):
r"""
This defines a basic config structure for `CoreferenceProcessor`.

Following are the keys for this dictionary:
- `entry_type`: Input entry type. You can change the context of
coreference resolution by setting this parameter. For example,
if you want to do coreference resolution within documents, set
it to `"ft.onto.base_ontology.Document"`. If you want to do
coreference resolution within sentences, set it to
`"ft.onto.base_ontology.Sentence"`.
Default: `"ft.onto.base_ontology.Document"`.
- `mention_type`: The type of members in `CoreferenceGroup`.
It can be set to `"ft.onto.base_ontology.EntityMention"` or
its subclasses.
Default: `"ftx.medical.clinical_ontology.MedicalEntityMention"`.
- `lang`: The SpaCy pipeline to be used. The pipeline does the
preprocessing steps for NeuralCoref.
Default: `"en_core_web_sm"`.
- `cfg_inference`: A dict containing the inference configs of
NeuralCoref. See `get_default_cfg_inference` for default values, and see
https://github.com/huggingface/neuralcoref/blob/master/README.md#parameters
for the meaing of these parameters.

Returns: A dictionary with the default config for this processor.
"""
return {
"entry_type": "ft.onto.base_ontology.Document",
"mention_type": "ftx.medical.clinical_ontology.MedicalEntityMention",
"lang": "en_core_web_sm",
"cfg_inference": cls.get_default_cfg_inference(),
}

@classmethod
def get_default_cfg_inference(cls):
"""
This defines the default inference config of NeuralCoref.

Following are the keys for this dictionary:
- `greedyness` (`float`): A number between 0 and 1 determining
how greedy the model is about making coreference decisions
(more greedy means more coreference links).
Default: `0.5`.
- `max_dist` (`int`): How many mentions back to look when
considering possible antecedents of the current mention.
Decreasing the value will cause the system to run faster
but less accurately.
Default: `50`.
- `max_dist_match` (`int`): The system will consider linking
the current mention
to a preceding one further than max_dist away if they share
a noun or proper noun. In this case, it looks max_dist_match
away instead.
Default: `500`.
- `blacklist` (`bool`): Should the system resolve coreferences
for pronouns in the following list: ["i", "me", "my", "you", "your"].
Default `True`.
- `conv_dict` (`dict(str, list(str))`): A conversion dictionary
that you can use
to replace the embeddings of rare words (keys) by an average
of the embeddings of a list of common words (values).
Ex: `conv_dict={"Angela": ["woman", "girl"]}`
will help resolving coreferences for Angela by using the
embeddings for the more common woman and girl instead of the
embedding of Angela.
This currently only works for single words (not for words groups).
Default: `None`.

Returns: A dictionary with the default inference config of NeuralCoref.
"""
return {
"greedyness": 0.5,
"max_dist": 50,
"max_dist_match": 500,
"blacklist": True,
"conv_dict": None,
KiaLAN marked this conversation as resolved.
Show resolved Hide resolved
}

def expected_types_and_attributes(self):
r"""
Method to add user specified expected type which would be checked
before running the processor if the pipeline is initialized with
`enforce_consistency=True` or
:meth:`~forte.pipeline.Pipeline.enforce_consistency` was enabled for
the pipeline.
"""
# return {self.configs.entry_type: {"text"}} # TODO: fix this
KiaLAN marked this conversation as resolved.
Show resolved Hide resolved
return {self.configs.entry_type: set()}

def record(self, record_meta: Dict[str, Set[str]]):
r"""
Method to add output type record of `CoreferenceProcessor` which
is `"ftx.medical.clinical_ontology.CoreferenceGroup"` with attribute
`members` to :attr:`forte.data.data_pack.Meta.record`.

Args:
record_meta: the field in the datapack for type record that need to
fill in for consistency checking.
"""
record_meta["ft.onto.base_ontology.CoreferenceGroup"] = {"members"}
8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,11 @@ git+https://[email protected]/asyml/forte-wrappers.git#egg=forte.huggingface&subdir
dataclasses~=0.8; python_version < '3.7'
setuptools~=57.0.0
transformers~=4.2.2

# spacy>=2.3.0, <=2.3.5 # will be installed by forte.spacy
cython>=0.25
pytest
# It is annoying that if we install neuralcoref and spacy at the same
# time, neuralcoref will throw "Cython failed" during building.
# Therefore, we must install neuralcoref after spacy is installed.
# git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can deal with this through setup.py using extra requires.

45 changes: 38 additions & 7 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,37 +1,68 @@
import sys
from pathlib import Path
import setuptools
import subprocess
import os


long_description = (Path(__file__).parent / "README.md").read_text()

if sys.version_info < (3, 6):
sys.exit('Python>=3.6 is required by forte-medical.')
sys.exit("Python>=3.6 is required by forte-medical.")

# If we install neuralcoref and spacy at the same
# time, neuralcoref will throw "Cython failed" during building,
# which is because neuralcoref does not set them as dependencies
# properly.
# Therefore, we must install neuralcoref after cython and spacy
# are installed.
p = subprocess.call(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use more standard way in setup.py

[
sys.executable,
"-m",
"pip",
"install",
"forte.spacy", # TODO: version
"cython>=0.25",
],
env=os.environ,
)
if p != 0:
raise RuntimeError("Installing NeuralCoref dependencies failed.")

setuptools.setup(
name="forte.health",
version='0.1.0',
version="0.1.0",
url="https://github.com/asyml/ForteHealth",
description="NLP pipeline framework for biomedical and clinical domains",
long_description=long_description,
long_description_content_type="text/markdown",
license='Apache License Version 2.0',
license="Apache License Version 2.0",
packages=setuptools.find_namespace_packages(
include=['fortex.health', 'ftx.*'],
exclude=["scripts*", "examples*", "tests*"]
include=["fortex.health", "ftx.*"], exclude=["scripts*", "examples*", "tests*"]
),
namespace_packages=["fortex"],
install_requires=[
'forte~=0.2.0',
"forte~=0.2.0",
"forte.spacy", # TODO: version
"cython>=0.25",
],
extras_require={
"test": [
"ddt",
"testfixtures",
"transformers==4.2.2",
"protobuf==3.19.4",
"pytest",
"neuralcoref @ git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref",
],
"icd_coding": [
"transformers",
],
"coreference": [
"neuralcoref @ git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref",
],
},
include_package_data=True,
python_requires='>=3.6'
python_requires=">=3.6",
)
Loading