-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Schema.org exports of all datasets so they appear in Google Dataset Search #222
Comments
The schema.org export only gets updated with a reExport, but the json-ld in the page is only cached as an |
Ah okay. When you say that "the page info is version specific whereas the export is only cached for the latest version," this makes me think that for each of a dataset's published versions, in the page source code there should be schema.org json-ld metadata. Should that be the case? What I'm seeing is that only the latest published version has schema.org json-ld metadata in its page source code. So for https://doi.org/10.7910/DVN/FZOVRC that has two published versions, the source code on the page for version 1 has an empty For version 2, there's the Schema.org metadata and it refers to version 2 I see the same thing in a couple other Dataverse repositories I've been able to check. It also sounds like a reExport wouldn't update the metadata in the page. And since that's the metadata that I think Google Dataset Search is using to index datasets in Dataverse repositories, a reExport all wouldn't result in more datasets being discoverable through Google Dataset Search. Does that all make sense? And should I open an issue in the Dataverse GitHub repo about figuring out how to update the json-ld on dataset pages? |
I was referring to the underlying code. I see now that https://github.com/IQSS/dataverse/blob/4903e9f0277105ea6a8c59a2f962dac8bcf715f2/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java#L5535 only displays the json-ld for the latest version (even though the underlying code could do it for earlier versions.) For the latest version not being up-to-date - I don't know why that is. If it is because the generated or osgi-cache dirs weren't cleared - we should see which one. The release notes already say you delete things in generated so if that wasn't done, it's not an issue with the code or release notes. If it is the osgi-cache dir we probably should add an issue to add to the release notes. If the reason still isn't clear, we should perhaps treat either this issue or a new one in the dataverse as a spike to investigate. reExportAll would still be a good thing to do - in order to get the schema.org export files up-to-date. |
For datasets whose latest versions were published after Dataverse v5.13 was applied to Harvard Dataverse, those datasets' Schema.org exports have been updated with the creator
@type
updates made in the pull request at #9089.These changes were made largely to improve the odds that Google Dataset Search would index the datasets.
I think v5.13 was applied to Harvard's repository on Feb 15, 2023, so datasets with versions published after that date have the updated Schema.org metadata exports.
For example, in the Schema.org export of the dataset at https://doi.org/10.7910/DVN/ZJ8MC0 published on Feb. 15, 2023, we see the "Creator" metadata and its
@type
property saying that the creator is a person:Datasets with versions published before v5.13 was applied have Schema.org exports that don't include the creator
@type
updates.For example, in the Schema.org export of the dataset at https://doi.org/10.7910/DVN/VOPK0E published on Feb. 14, 2023, we see the "Creator" metadata doesn't have an
@type
property saying if the creator is a person or organization:The same is true if we look at the JSON-LD in the page source.
In a conversation in an unrelated pull request, @qqmyers wrote that installations will need to do a reExportAll() so that all datasets include the Schema.org export updates.
Definition of done:
Do a reExportAll() so that the Schema.org metadata exports of all datasets in the Harvard Dataverse include the updates made in v5.13 pull request at #9089
The text was updated successfully, but these errors were encountered: