-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunking seems to not be working properly #326
Comments
so, I think this is due to differences in the dependency parse. our dependency parser is more accurate on biomedical data (but different from spacy's), and spacy's noun chunker is defined here (https://github.com/explosion/spaCy/blob/a59f3fcf5dab3acf5570483cc314b47cc5833f39/spacy/lang/en/syntax_iterators.py#L8), with respect to specific dependency relations. See an example of the difference for your sentence below. Perhaps we should write our own noun chunker based on our dependency parser, but I am really not an expert in linguistics. You might get some mileage from adapting spacy's noun chunker based on patterns you observe from our dependency parser. Also, @DeNeutoy do you have any thoughts about this?
|
Did you have any luck adapting the noun chunker? |
Sorry for my late response. Your answer was very helpful! I decided to try a different approach since I did not have enough time in my project to look for these patterns. Thank you very much! |
has anyone worked on a scispacy noun chunker? thanks ! |
Hello and thank you for creating this tool!
I have been trying to use the noun_chunks with your pipelines but it does not seem to be working correctly. I have tried with en_core_sci_sm, en_core_sci_md and en_core_sci_lg. For example when I input the sentence "CCR5(+) and CXCR3(+) T cells are increased in multiple sclerosis and their ligands MIP-1alpha and IP-10 are expressed in demyelinating brain lesions." I only get "CCR5(+", "CXCR3(+" and "T cells" as chunks and I would expect more chunks. For example, using spaCys en_core_web_trf I get "CCR5(+) and CXCR3(+) T cells", "multiple sclerosis", "their ligands", "MIP-1alpha", "IP-10" and "brain lesions".
Is the chunking supposed to work in a similar way as spaCys pipelines or have I misinterpreted something?
Thank you in advance!
Best regards
The text was updated successfully, but these errors were encountered: