Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubTatorCorpusReader.load_corpus() can't handle multiple/composite mentions #4

Open
DennisMinn opened this issue Aug 8, 2022 · 0 comments

Comments

@DennisMinn
Copy link

DennisMinn commented Aug 8, 2022

Trying to parse the NCBI Disease Corpus train set, but get an error for mentions that include multiple MeSH terms (i.e. "colon and some other cancers" -> D003110|D009369). Suggestions on how to handle this aside from removing lines that include "CompositeMention".

Dataset

10192393|t|A common human skin tumour is caused by activating mutations in beta-catenin.
10192393|a|WNT signalling orchestrates... but a small percentage of colon and some other cancers harbour...
10192393        15      26      skin tumour     DiseaseClass    D012878
10192393        443     449     cancer  DiseaseClass    D009369
10192393        483     496     colon cancers   DiseaseClass    D003110
10192393        539     565     adenomatous polyposis coli      SpecificDisease D011125
10192393        567     570     APC     SpecificDisease D011125
10192393        670     698     colon and some other cancers    CompositeMention        D003110|D009369
10192393        855     867     skin tumours    DiseaseClass    D012878
10192393        879     893     pilomatricomas  SpecificDisease D018296
10192393        1021    1035    pilomatricomas  SpecificDisease D018296
10192393        1210    1221    skin tumour     DiseaseClass    D012878
10192393        1262    1268    tumour  Modifier        D009369
10192393        1312    1326    pilomatricomas  SpecificDisease D018296
10192393        1385    1392    tumours DiseaseClass    D009369
10192393        1615    1622    tumours DiseaseClass    D009369

Error

     77         prev_line_type = curr_line_type
     78     except Exception as e:
---> 79         raise Exception('ERROR occured when parsing line'
     80                         f' #{line_number}. Exception {e}')
     82 if self.__document_being_read is not None:
     83     self.corpus.append(self.__document_being_read)

Exception: ERROR occured when parsing line #8. Exception Unexpected content received on line #8, the line/data may have been corrupted. Content: '10192393	670	698	colon and some other cancers	CompositeMention	D003110|D009369
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant