-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow harvesting citations of OpenAlex reference papers #1306
Conversation
129dac1
to
998296c
Compare
998296c
to
4175b82
Compare
Quality Gate passed for 'rsd-database'Issues Measures |
Quality Gate failed for 'scrapers'Failed conditions See analysis details on SonarCloud Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
Quality Gate passed for 'rsd-frontend'Issues Measures |
message: 'Maximum length is 500' | ||
pattern: { | ||
value: /^https:\/\/openalex\.org\/[WwAaSsIiCcPpFf]\d{3,13}$/, | ||
message: 'e.g. https://openalex.org/W3160330321' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make error message bit more explainable?
"Incorrect OpenAlex id. Correct input: https://openalex.org/W3160330321"
help: 'An ID used by e.g. OpenAlex', | ||
openalex_id: { | ||
label: 'OpenAlex ID', | ||
help: 'The OpenAlex ID', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this help message something like this?
"Provide complete url of OpenAlex ID. For example https://openalex.org/W3160330321"
Works as expected. One question: given that we can get most of the information for https://openalex.org/W3159002838 after scraping, couldn't we also use this identifier to import the mention in the first place? The "Search for DOI or title" box could add the OpenAlexID? |
Yes, that's what I meant with the second TODO in the PR description. 🙂 I will open issues for the TODOs. |
Scrape citations from OpenAlex reference papers
Changes proposed in this pull request
external_id
column of themention
table and add theopenalex_id
column (note: existing data will need to be migrated)Doi
andOpenalexId
classes)How to test
docker compose down --volumes && docker compose build --parallel && docker compose up --scale data-generation=0
https://openalex.org/W3159002838
docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainCitations
10.1016/j.future.2018.08.004
docker compose down --volumes && docker compose up --scale data-generation=1
docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainMention
docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainCitations
docker compose exec scrapers java -cp /usr/myjava/scrapers.jar nl.esciencecenter.rsd.scraper.doi.MainReleases
To do
Migration:
Before dropping the
external_id
column and after adding theopenalex_id
column, the following (untested) query should be executed:The following was tested in production, yielding a result of
5629
The following gave the same result of
5629
:To check for unique entries, run
which again yielded
5629
.If you do have duplicate entries, you can get them with:
Closes #1291
PR Checklist:
docker-compose.yml