Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create "fill-in" task for OpenAlex #194

Open
peetucket opened this issue Feb 27, 2025 · 0 comments
Open

Create "fill-in" task for OpenAlex #194

peetucket opened this issue Feb 27, 2025 · 0 comments
Assignees

Comments

@peetucket
Copy link
Member

peetucket commented Feb 27, 2025

After all the initial harvesting of publications by ORCID is complete we want to examine all publications that have a DOI and also lack OpenAlex metadata, to see if OpenAlex metadata is actually available when querying by DOI. The rationale for this previously was that some publications may be in OpenAlex, but may not be associated with a ORCID.

We had some code for this, which took advantage of the fact that you can query for more than one DOI at a time:

def publications_from_dois(dois: list):
"""
Look up works by DOI in batches that fit within OpenAlex request size limits
"""
for doi_batch in batched(dois, 50):
# Setting batch size to 50 to avoid 400 errors from OpenAlex API when GET query string is greater than 4096 characters
# Based on experimentation, 75 is too high. 50 is the default per_page size, so we could consider removing pagination in the future.
# TODO: do we need this to stay within 100,000 requests / day API quota?
time.sleep(1)
doi_list = "|".join([doi for doi in doi_batch])
try:
for page in Works().filter(doi=doi_list).paginate(per_page=200):
for pub in page:
yield normalize_publication(pub)
except api.QueryError:
# try dois individually
for doi in doi_batch:
try:
pubs = Works().filter(doi=doi).get()
if len(pubs) > 1:
logging.warn(f"Found multiple publications for DOI {doi}")
if len(pubs) > 0:
yield normalize_publication(pubs[0])
except api.QueryError as e:
logging.error(f"OpenAlex QueryError for {doi}: {e}")
continue

We will want to use the database in such a way that we aren't pulling all the query results back into memory for SELECT * FROM publications WHERE sulpub_json IS NULL AND doi IS NOT NULL.

It might be interesting to keep a count of how often we find metadata this way, and log it at the end, to help evaluate how important this is.

@edsu edsu changed the title Create a new DAG task to enhance metadata by DOI for Dimensions Create "fill-in" task for OpenAlex Mar 3, 2025
@lwrubel lwrubel self-assigned this Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants