Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose links to all export formats via Signposting #11045

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented Nov 22, 2024

What this PR does / why we need it:

Especially for Croissant but really all export formats, we're interested in exposing the URLs for each format via Signposting so that crawlers (and API users) can efficiently get just the HEAD of a page (or linkset API) to get the URLs.

If Google and others adopt Signposting, it will mean they can do a HEAD, get the Croissant URL (for example), and download the Croissant file, which has the potential to be large. See also discussion at mlcommons/croissant#530 (comment) and especially this URL:

Which issue(s) this PR closes:

Special notes for your reviewer:

Please see the comment about how I changed the mimetype for our schema.org format.

I also did a fair amount of doc improvement. Feedback welcome. Heads up to @julian-schneider that I tweaked your docs in #10739 (just merged).

Suggestions on how to test this:

Try HEAD and GET on a published dataset, looking for the "Link header:

Try the linkset API

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes, included.

Additional documentation:

A good entry point for doc changes: https://dataverse-guide--11045.org.readthedocs.build/en/11045/api/native-api.html#retrieve-signposting-information

@pdurbin pdurbin self-assigned this Nov 22, 2024
@coveralls
Copy link

coveralls commented Nov 22, 2024

Coverage Status

coverage: 22.69% (-0.005%) from 22.695%
when pulling 76c347b on 10542-signposting
into 825ab15 on develop.

This comment has been minimized.

@pdurbin pdurbin marked this pull request as ready for review November 22, 2024 21:16
@pdurbin pdurbin removed their assignment Nov 22, 2024

This comment has been minimized.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice to have all our formats in Signposting - we should make sure Herbert vdS knows. I suggested on change to avoid problems with permalinks.

try {
exporter = ExportService.getInstance().getExporter(formatName);
describedby += ",<" + systemConfig.getDataverseSiteUrl() + "/api/datasets/export?exporter=" + formatName + "&persistentId="
+ ds.getProtocol() + ":" + ds.getAuthority() + "/" + ds.getIdentifier() + ">;rel=\"describedby\"" + ";type=\"" + exporter.getMediaType() + "\"";
Copy link
Member

@qqmyers qqmyers Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These (this and line 137) won't work for all permalinks since they don't necessarily have / as a separator. I think you can just ds.getGlobalId().asString() instead. For a real dataset, I don't think you can ever have a null GlobalId so not sure you even need to check for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers thanks, in ca93d60 I corrected the two you found plus two more.

@cmbz cmbz added the FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) label Nov 23, 2024
@pdurbin pdurbin self-assigned this Nov 24, 2024
@pdurbin pdurbin assigned qqmyers and unassigned pdurbin Nov 25, 2024
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@qqmyers qqmyers removed their assignment Nov 25, 2024

This comment has been minimized.

@pdurbin
Copy link
Member Author

pdurbin commented Nov 25, 2024

@pdurbin pdurbin self-assigned this Nov 25, 2024
…0542

The test file is used in InfoIT#testGetExportFormats
Before this PR...

In development:

Expected: is "http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292"
  Actual: is "http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292"

On Jenkins

Expected: is "http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292"
  Actual: http://ec2-3-225-221-142.compute-1.amazonaws.com/dataset.xhtml?persistentId=doi:10.5072/FK2/6A3292

So we'll change to just "endsWith" since we aren't actually testing the baseurl,
just the datasetPid which we fixed up in ca93d60.
@pdurbin pdurbin assigned qqmyers and unassigned pdurbin Nov 25, 2024
@pdurbin
Copy link
Member Author

pdurbin commented Nov 25, 2024

I pushed some commits to fix the broken tests, update the API changelog and release note, and fix a broken header in the docs.

This test was failing in Jenkins: mvn test -Dtest=FilesIT#testDeleteFile. I ran it locally and it seems fine. I think it's unrelated to this PR. Here's the failure...

JSON path data.files[0].dataFile.filename doesn't match.
Expected: cc0.png
  Actual: orcid_16x16.png

... on this line:

.body("data.files[0].dataFile.filename", equalTo("cc0.png"))

Again, I don't think it has anything to do with changes in the PR but I'm just mentioning it in case we start seeing it elsewhere.

@qqmyers if you would take another look I'd appreciate it!

@pdurbin
Copy link
Member Author

pdurbin commented Nov 25, 2024

By the way, @4tikhonov pointed out to me that Dataverse shows up in this report of who has implemented Signposting: https://s11.no/2024/signposting-report/

This comment has been minimized.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look fine. I agree the one test failure is probably unrelated to the PR - possibly a timing issue again.

@qqmyers qqmyers removed their assignment Nov 25, 2024
@cmbz cmbz added the FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) label Dec 5, 2024
Conflicts:
doc/sphinx-guides/source/api/changelog.rst (updated to 6.6)

This comment has been minimized.

@cmbz cmbz added the FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15) label Jan 2, 2025
@ofahimIQSS
Copy link
Contributor

branch has conflicts, can you please resolve?

Conflicts:
doc/sphinx-guides/source/api/changelog.rst
@pdurbin
Copy link
Member Author

pdurbin commented Jan 7, 2025

@ofahimIQSS yes, done in 76c347b

Copy link

github-actions bot commented Jan 7, 2025

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10542-signposting
ghcr.io/gdcc/configbaker:10542-signposting

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@pdurbin
Copy link
Member Author

pdurbin commented Jan 13, 2025

As discussed at a couple standups and in Slack, I pulled this PR out of "ready for QA" in order to think about the JSON-LD "profile" parameter I learned about last week in a presentation by @stain about Signposting to the Croissant working group.

As of this writing, "profile" information is not present. For Croissant, for example, The following should appear in the Signposting output:

<https://demo.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi:10.5072/FK2/YD5QDG>;rel="describedby";type="application/json"

In the future, we plan to establish a pattern where exporters can put "profile" information in their getMediaType() method. I created a new issue to track that work:

I've already created gdcc/exporter-croissant#7 to update the Croissant exporter such that once it has been merged a new version has been released, we should expect to see Signposting output like this for Croissant:

<https://demo.dataverse.org/api/datasets/export?exporter=croissant&persistentId=doi:10.5072/FK2/YD5QDG>;rel="describedby";type="application/ld+json"; profile="http://mlcommons.org/croissant/1.0"

(Yes, in that PR we are also making the type more specific by changing json to ld+json.)

Likewise, I've made a PR for the RO-Crate exporter at gdcc/exporter-ro-crate#4 to add the "profile" information.

Now that we have #11151 to track future work, I don't see any reason to hold up this PR (#11045), so I'm moving it back to "ready for QA".

@pdurbin pdurbin removed their assignment Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) FY25 Sprint 14 FY25 Sprint 14 (2025-01-02 - 2025-01-15)
Projects
Status: Ready for QA ⏩
Development

Successfully merging this pull request may close these issues.

Add Croissant to Signposting "describedby" output
5 participants