Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow harvesting citations of OpenAlex reference papers #1306

Merged
merged 1 commit into from
Oct 2, 2024

Conversation

ewan-escience
Copy link
Collaborator

Scrape citations from OpenAlex reference papers

Changes proposed in this pull request

  • Drop the external_id column of the mention table and add the openalex_id column (note: existing data will need to be migrated)
  • Let the citations scraper also scrape citations of reference papers that have an OpenAlex ID
  • Refactor the DOI scrapers (including new dedicated Doi and OpenalexId classes)
  • Small changes to the documentation

How to test

  • docker compose down --volumes && docker compose build --parallel && docker compose up --scale data-generation=0
  • Sign in, create a project page
  • Add an output with OpenAlex ID https://openalex.org/W3159002838
  • Wait for the citation scraper to run or run docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainCitations
  • This shouldn't produce errors (check the Docker logs and the error logs as admin) and should result in 39 citations
  • Create a software page, add reference paper with DOI 10.1016/j.future.2018.08.004
  • Same as above, should yield 62 citations
  • docker compose down --volumes && docker compose up --scale data-generation=1
  • Wait a while so that all scraper have run for a while, no unexpected errors should have been generated. You can run the relevant scrapers with the following commands:
  • docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainMention
  • docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainCitations
  • docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainReleases
  • Test out your own (edge) cases?
  • Check that the search functionality of the mentions overview for admins still works
  • Check that other functionality regarding mentions still work
  • Check the changes in the documentation

To do

  • Allow for adding manual reference papers to software (to allow for reference papers with an OpenAlex ID)
  • Better OpenAlex support, e.g. allow searching for an OpenAlex ID the same way as searching for a DOI and add a scraper for mentions with an OpenAlex ID

Migration:

Before dropping the external_id column and after adding the openalex_id column, the following (untested) query should be executed:

UPDATE mention SET openalex_id = external_id WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$';

The following was tested in production, yielding a result of 5629

SELECT COUNT(*) FROM mention WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$';

The following gave the same result of 5629:

SELECT COUNT(*) FROM mention WHERE external_id IS NOT NULL;

To check for unique entries, run

SELECT COUNT(DISTINCT(LOWER(external_id))) FROM mention WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$';

which again yielded 5629.

If you do have duplicate entries, you can get them with:

SELECT LOWER(external_id), COUNT(LOWER(external_id)) FROM mention WHERE external_id ~ '^https://openalex\.org/[WwAaSsIiCcPpFf]\d{3,13}$' GROUP BY LOWER(external_id) HAVING COUNT(LOWER(external_id)) > 1;

Closes #1291

PR Checklist:

  • Increase version numbers in docker-compose.yml
  • Link to a GitHub issue
  • Update documentation
  • Tests

Copy link

sonarcloud bot commented Sep 26, 2024

Copy link

sonarcloud bot commented Sep 26, 2024

Quality Gate Failed Quality Gate failed for 'scrapers'

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Copy link

sonarcloud bot commented Sep 26, 2024

message: 'Maximum length is 500'
pattern: {
value: /^https:\/\/openalex\.org\/[WwAaSsIiCcPpFf]\d{3,13}$/,
message: 'e.g. https://openalex.org/W3160330321'
Copy link
Contributor

@dmijatovic dmijatovic Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make error message bit more explainable?
"Incorrect OpenAlex id. Correct input: https://openalex.org/W3160330321"

help: 'An ID used by e.g. OpenAlex',
openalex_id: {
label: 'OpenAlex ID',
help: 'The OpenAlex ID',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this help message something like this?
"Provide complete url of OpenAlex ID. For example https://openalex.org/W3160330321"

@jmaassen
Copy link
Member

jmaassen commented Oct 2, 2024

Works as expected.

One question: given that we can get most of the information for https://openalex.org/W3159002838 after scraping, couldn't we also use this identifier to import the mention in the first place? The "Search for DOI or title" box could add the OpenAlexID?

@ewan-escience
Copy link
Collaborator Author

One question: given that we can get most of the information for https://openalex.org/W3159002838 after scraping, couldn't we also use this identifier to import the mention in the first place? The "Search for DOI or title" box could add the OpenAlexID?

Yes, that's what I meant with the second TODO in the PR description. 🙂 I will open issues for the TODOs.

@ewan-escience ewan-escience merged commit ee47d07 into main Oct 2, 2024
7 of 8 checks passed
@ewan-escience ewan-escience deleted the 1291-scrape-openalex-citations branch October 14, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scrape citations of reference papers with an OpenAlex ID
3 participants