Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Schema.org exports of all datasets so they appear in Google Dataset Search #222

Open
jggautier opened this issue Apr 26, 2023 · 3 comments

Comments

@jggautier
Copy link
Collaborator

jggautier commented Apr 26, 2023

For datasets whose latest versions were published after Dataverse v5.13 was applied to Harvard Dataverse, those datasets' Schema.org exports have been updated with the creator @type updates made in the pull request at #9089.

These changes were made largely to improve the odds that Google Dataset Search would index the datasets.

I think v5.13 was applied to Harvard's repository on Feb 15, 2023, so datasets with versions published after that date have the updated Schema.org metadata exports.

For example, in the Schema.org export of the dataset at https://doi.org/10.7910/DVN/ZJ8MC0 published on Feb. 15, 2023, we see the "Creator" metadata and its @type property saying that the creator is a person:

Screenshot 2023-04-26 at 9 36 31 AM

Datasets with versions published before v5.13 was applied have Schema.org exports that don't include the creator @type updates.

For example, in the Schema.org export of the dataset at https://doi.org/10.7910/DVN/VOPK0E published on Feb. 14, 2023, we see the "Creator" metadata doesn't have an @type property saying if the creator is a person or organization:

Screenshot 2023-04-26 at 9 46 31 AM

The same is true if we look at the JSON-LD in the page source.

In a conversation in an unrelated pull request, @qqmyers wrote that installations will need to do a reExportAll() so that all datasets include the Schema.org export updates.

Definition of done:
Do a reExportAll() so that the Schema.org metadata exports of all datasets in the Harvard Dataverse include the updates made in v5.13 pull request at #9089

@qqmyers
Copy link
Member

qqmyers commented Apr 26, 2023

The schema.org export only gets updated with a reExport, but the json-ld in the page is only cached as an @transient value in the DatasetVersion object (unless I'm missing something - the page info is version specific whereas the export is only cached for the latest version which is one reason why the page doesn't just load the cached export). So I'm not sure why it wouldn't be updated without a re-export. Are @transient values getting cached in ./generated or ./osgi-cache ?

@jggautier
Copy link
Collaborator Author

jggautier commented Apr 26, 2023

Ah okay. When you say that "the page info is version specific whereas the export is only cached for the latest version," this makes me think that for each of a dataset's published versions, in the page source code there should be schema.org json-ld metadata.

Should that be the case? What I'm seeing is that only the latest published version has schema.org json-ld metadata in its page source code.

So for https://doi.org/10.7910/DVN/FZOVRC that has two published versions, the source code on the page for version 1 has an empty <script> tag:

Screenshot 2023-04-26 at 2 34 09 PM

For version 2, there's the Schema.org metadata and it refers to version 2

Screenshot 2023-04-26 at 2 34 28 PM

I see the same thing in a couple other Dataverse repositories I've been able to check.

It also sounds like a reExport wouldn't update the metadata in the page. And since that's the metadata that I think Google Dataset Search is using to index datasets in Dataverse repositories, a reExport all wouldn't result in more datasets being discoverable through Google Dataset Search.

Does that all make sense?

And should I open an issue in the Dataverse GitHub repo about figuring out how to update the json-ld on dataset pages?

@qqmyers
Copy link
Member

qqmyers commented Apr 26, 2023

I was referring to the underlying code. I see now that https://github.com/IQSS/dataverse/blob/4903e9f0277105ea6a8c59a2f962dac8bcf715f2/src/main/java/edu/harvard/iq/dataverse/DatasetPage.java#L5535 only displays the json-ld for the latest version (even though the underlying code could do it for earlier versions.)

For the latest version not being up-to-date - I don't know why that is. If it is because the generated or osgi-cache dirs weren't cleared - we should see which one. The release notes already say you delete things in generated so if that wasn't done, it's not an issue with the code or release notes. If it is the osgi-cache dir we probably should add an issue to add to the release notes. If the reason still isn't clear, we should perhaps treat either this issue or a new one in the dataverse as a spike to investigate.

reExportAll would still be a good thing to do - in order to get the schema.org export files up-to-date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants