Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: inspire-matcher #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

ksachs
Copy link

@ksachs ksachs commented Sep 7, 2017

  • slightly re-organize code
  • another variant to confirm authors
  • bug-fix confirm titles (scores < 0.5 possible)
  • confirm titles: exclude frequent words

Signed-off-by: Kirsten Sachs [email protected]

* slightly re-organize code
* another variant to confirm authors
* bug-fix confirm titles (scores < 0.5 possible)
* confirm titles: exclude frequent words

Signed-off-by: Kirsten Sachs <[email protected]>
@ksachs
Copy link
Author

ksachs commented Sep 7, 2017

Just to let you know where I'm heading to.

How can I print a demo record if I have the recid? I would like to print the true match in case of false negatives.

@jacquerie
Copy link

How can I print a demo record if I have the recid?

If I understood the question correctly

>>> from inspirehep.utils.record_getter import get_db_record
>>> record = get_db_record('lit', recid)
>>> print(record)

should be what you want.

@ksachs
Copy link
Author

ksachs commented Sep 13, 2017

not working the way I thought.
To try it I looked at a matched record to get the same record from the match and from the demo database directly.
Starting at

if len(matched_exact_records) == 1:
:

                if len(matched_exact_records) == 1:
                    matched_recid = matched_exact_records[0].record.get('control_number')
                    if is_good_match(doi_match_map, recid_match_map, dois, control_number, matched_recid):
                        true_positives += 1
                        print '++ Got a good match! with recid: ', matched_recid
                        print '++ for record ', filename
                        print json.dumps(matched_exact_records[0].record)
                        true_record = get_db_record('lit', matched_recid)
                        print true_record

and get

++ Got a good match! with recid:  41205
++ for record  test_files/41314.xml
{"preprint_date": "1998", "_collections": ["Literature"], ....."legacy_creation_date": "1998-04-22", "texkeys": [":1900cic"], 
"self_recid": 41205, "facet_inspire_doc_type": ["proceedings"], "earliest_date": "1998-04-22"}
--------------------------------------------------------------------------------
ERROR in record_getter [/virtualenv/src/inspirehep/inspirehep/utils/record_getter.py:59]:
Can't load recid ('lit', 41205)

So I can't get the correct information from PersistentIdentifier

* more work in progress to understand how the confirmation works
* title confirmation on alpha-numeric words only
* AuthorComparator is too picky,
  most likely not using variants, looking for exact match

Signed-off-by: Kirsten Sachs <[email protected]>
* proposal to check on pages and years

Signed-off-by: Kirsten Sachs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants