Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

15 Adding CoreferenceProcessor #41

Open
wants to merge 65 commits into
base: master
Choose a base branch
from

Conversation

KiaLAN
Copy link
Collaborator

@KiaLAN KiaLAN commented Jun 17, 2022

This PR fixes #15.

Description of changes

Added CoreferenceProcessor and its unit test. Updated setup.py, requirements.txt, and ci workflow.

  • CoreferenceProcessor: a wrapper for NeuralCoref. It translates the original output to CoreferenceGroup and MedicalEntityMention and puts them into data pack.
  • Unit test: tests on daily language and medical notes.
  • setup.py and requirements.txt: added cython, and pytest, which are dependencies of neuralcoref.
  • CI workflow: added a stage called "Install NeuralCoref", after the installation of fortex.spacy. Since neuralcoref needs to be built from source. See the reason in NeuralCoref Installation Issue.

Possible influences of this PR.

  • Be careful with the "Install NeuralCoref" workflow stage. It needs to be placed after the installation of cython and fortex.spacy.

Test Conducted

Tested on new unit test.

NeuralCoref Installation Issue

Installing neuralcoref from PyPI causes segmentation fault, since the binary file is built with spacy 2.1, while the spacy used by forte.spacy is 2.3. We need to build from source to use customized spacy version.

Building from source needs spacy and cython being already installed. Putting neuralcoref together with spacy and cython in requirements.txt and setup.py can cause "Cython failed" exception, since building binary files comes ahead of the installation of cython and spacy. Therefore we need to delay the installation of neuralcoref.

But I don't know how to do this in an elegant way. requirements.txt seems not to support this kind of two stage installation. setup.py seems to support, but I don't know how to. So I added a new stage in CI after the installation of requirements.txt and fortex.spacy, treating neuralcoref as an optional dependency like fortex.spacy.

@KiaLAN
Copy link
Collaborator Author

KiaLAN commented Jul 10, 2022

Please review the latest commits.

@hunterhector hunterhector requested a review from Piyush13y July 16, 2022 04:24
@KiaLAN
Copy link
Collaborator Author

KiaLAN commented Jul 16, 2022

Merge conflict for setup.py and tests folder fixed.

@@ -83,12 +83,6 @@ jobs:
run: |
pip install --use-feature=2020-resolver --progress-bar off .[test]

- name: Install Forte-wrappers-spacy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't remove the installation directly from the workflow, let's actually use the workflow matrix set to run different dependencies as different runs.

Here is an example: https://github.com/asyml/forte-wrappers/blob/main/.github/workflows/main.yml#L84

Copy link
Collaborator Author

@KiaLAN KiaLAN Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"use the workflow matrix set to run different dependencies as different runs."

Do you mean in the future we will have different versions of forte-wrapper, and you want to define a matrix to test them all?

Copy link
Member

@hunterhector hunterhector Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No, not different versions of forte-wrapper. I show you the forte-wrapper repository workflow as an example of how to set up a workflow matrix.
  2. In this repo, ForteHealth, we can set up a similar matrix, since not all processors in this repo can be installed and run at the same time, they can be tested individually. Note that you could use the extra_install method in setup.py to specify different sets of dependencies.

You may wonder how we make them work together in the future, that's another issue (we have solutions such as RemoteProcessor or Docker images). Right now we can focus on making them each testable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a question: @Piyush13y seems to want to set forte.spacy as an extra_install. If so, forte.spacy and neurocoref will be in the same extra_install list, example:

setuptools.setup(
....
    install_requires=[
        "forte~=0.2.0",
    ],
    extras_require={
        .....
        "coreference": [
            "forte.spacy",  # TODO: version
            "cython>=0.25",
            "neuralcoref @ git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref",
        ],
    },
    include_package_data=True,
    python_requires=">=3.6",
)

I think pip won't make sure that forte.spacy is install ahead of neuralcoref.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Piyush13y seems to want to set forte.spacy as an extra_install.

Emmm... It's a little bit confusing because now forte.spacy is not even in the setup.py in master branch: https://github.com/asyml/ForteHealth/blob/master/setup.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing a spacy version compatible with neuralcoref causes the failure of SpacyProcessor. Since SpacyProcessor is needed in unit test, this seems to be a dead end before we can manage to run the modules separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we come back to my initial suggestion, use the matrix set up to run the module separately

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad news. A spacy version compatible with neuralcoref is not compatible with my RTX 3060.

I checked the issues in neuralcoref regarding the spacy version problem, almost everyone is suggesting building neuralcoref from source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest release of neuralcoref is in 2019. Maybe it is worth considering changing a better package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, actually for now we are simply looking for a coreference module that works. Maybe we can directly add this function to AllenNLPProcessor, or other packages you find easy

# It is annoying that if we install neuralcoref and spacy at the same
# time, neuralcoref will throw "Cython failed" during building.
# Therefore, we must install neuralcoref after spacy is installed.
# git+https://[email protected]/huggingface/[email protected]#egg=neuralcoref
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can deal with this through setup.py using extra requires.

# properly.
# Therefore, we must install neuralcoref after cython and spacy
# are installed.
p = subprocess.call(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use more standard way in setup.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Coreference Resolution
3 participants