Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPA not consistently lemmatized #571

Open
AngledLuffa opened this issue Jan 6, 2025 · 4 comments
Open

CPA not consistently lemmatized #571

AngledLuffa opened this issue Jan 6, 2025 · 4 comments

Comments

@AngledLuffa
Copy link
Contributor

Examples from the dev set (a couple others in train):

# sent_id = email-enronsent23_08-0002
# text = jill allen finishes her cpa today and she and her friends are going to party.
5       cpa     CPA     NOUN    NN      Number=Sing     3       obj     3:obj   _
# sent_id = email-enronsent00_02-0016
# text = As such, we have a CPA, Larry Lewis, working with us to audit and set up transition files.
7       CPA     cpa     NOUN    NN      Number=Sing     5       obj     5:obj|12:nsubj:xsubj    SpaceAfter=No

There's also the Iraq CPA, but that seems orthogonal to the accounting job title

nschneid added a commit that referenced this issue Jan 6, 2025
@nschneid
Copy link
Contributor

nschneid commented Jan 6, 2025

Thanks! Our English validation script flags a few other potential lemma inconsistencies—do any of these look like they should be fixed? The abbreviations are OK I guess.

! rare lemma America for A/NNP in weblog-blogspot.com_rigorousintuition_20050518101500_ENG_20050518_101500-0070 (majority: A)
! rare lemma central for Central/NNP in reviews-334388-0004 (majority: Central)
! rare lemma McDonald for Mcdonald/NNP in answers-20111019100027AAdxgXV_ans-0005 (majority: Mcdonald)
! rare lemma President for President/NN in newsgroup-groups.google.com_alt.animals_0084bdc731bfc8d8_ENG_20040905_212000-0071, newsgroup-groups.google.com_alt.animals_0084bdc731bfc8d8_ENG_20040905_212000-0166 (majority: president)
! rare lemma Securities for Securities/NNPS in email-enronsent36_01-0030 (majority: Security)
! rare lemma south for South/JJ in reviews-369210-0007 (majority: South)
! rare lemma West for West/JJ in reviews-342807-0001, reviews-342807-0002, reviews-342807-0004 (majority: west)
! rare lemma b for b/NN in email-enronsent44_01-0080, reviews-010433-0003 (majority: benefit)
! rare lemma building for b/NN in email-enronsent35_01-0010 (majority: benefit)
! rare lemma care for c/NN in newsgroup-groups.google.com_eHolistic_2dd76f31ceb6bfe8_ENG_20050513_224200-0056 (majority: c)
! rare lemma class for c/NN in reviews-225632-0001 (majority: c)
! rare lemma Deli for deli/NNP in answers-20111108104228AA6z9uZ_ans-0002 (majority: deli)
! rare lemma glass for glasses/NNS in reviews-363685-0026 (majority: glasses)
! rare lemma Inn for inn/NNP in reviews-159371-0004 (majority: inn)
! rare lemma respects for respects/NNS in answers-20111107144339AA0qw5S_ans-0005 (majority: respect)
! rare lemma science for science/NNP in answers-20111108094740AA1lbom_ans-0003 (majority: Science)
! rare lemma Southwest for southwest/NNP in answers-20111107193044AAvUYBv_ans-0004 (majority: southwest)
! rare lemma versus for vs./IN in email-enronsent27_01-0013, weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0258 (majority: vs.)

@AngledLuffa
Copy link
Contributor Author

actually yes, President is sus a couple times

# sent_id = newsgroup-groups.google.com_alt.animals_0084bdc731bfc8d8_ENG_20040905_212000-0071
14      had     have    AUX     VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   15      aux     15:aux  _
15      called  call    VERB    VBN     Tense=Past|VerbForm=Part        12      acl:relcl       12:acl:relcl    Cxn=rc-wh-nsubj
16      the     the     DET     DT      Definite=Def|PronType=Art       17      det     17:det  _
17      President       President       NOUN    NN      Number=Sing     15      obj     15:obj|20:nsubj:xsubj   _
# sent_id = newsgroup-groups.google.com_alt.animals_0084bdc731bfc8d8_ENG_20040905_212000-0166
15      whom    whom    PRON    WP      PronType=Rel    18      obj     13:ref  _
16      the     the     DET     DT      Definite=Def|PronType=Art       17      det     17:det  _
17      President       President       NOUN    NN      Number=Sing     18      nsubj   18:nsubj        _
18      called  call    VERB    VBD     Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   13      acl:relcl       13:acl:relcl    Cxn=rc-wh-obj
19      "       "       PUNCT   ``      _       21      punct   21:punct        SpaceAfter=No
20      Kenny   Kenny   PROPN   NNP     Number=Sing     21      compound        21:compound     _
21      Boy     Boy     PROPN   NNP     Number=Sing     18      xcomp   18:xcomp        SpaceAfter=No
22      "       "       PUNCT   ''      _       21      punct   21:punct        SpaceAfter=No

also vs. is inconsistently annotated vs. or versus

@AngledLuffa
Copy link
Contributor Author

might i suggest having the script print out line numbers and filenames as well?

@nschneid
Copy link
Contributor

nschneid commented Jan 6, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants