Support Dataverse files without a persistentID #355

santisoler · 2023-03-13T18:15:12Z

Add support for downloading Dataverse files that don't have a persistent ID. Use the file ID instead.

Relevant issues/PRs:

Fixes #354

TODO

Add tests that download files that don't have a persistentID

Add support for downloading Dataverse files that don't have a persistent ID. Use the file ID instead.

santisoler · 2023-03-13T18:18:15Z

We should add some tests for this bugfix that downloads some file from a Dataverse repository that doesn't provide persistent ID for its files. I would avoid using preexisting repositories, we want very small files since these tests will be run multiple times. Creating a version of the test data for Pooch in another Dataverse repository would be nice.

pdurbin

More details in the comment I left but I'd suggest always downloading files using the (database) ID.

Please feel free to hit me up on https://chat.dataverse.org if you have any questions. Thanks for teaching Pooch to download from Dataverse! 🐶 ❤️

pdurbin · 2023-03-17T18:47:45Z

pooch/downloaders.py

+        persistent_id = files[file_name]["persistentId"]
+        if persistent_id:


The file ID will always be there. It's the primary key in the database.

The persistent ID (DOI or Handle) is optional so you can't rely on it being there. I would simply avoid even checking for it if all you want to do is download the file.

I see. I thought the persistentId is always there, but it's empty if the files doesn't have one. Sorry for the misunderstanding.

If that's the case, you're right, we shouldn't assume that persistentId will always be there. I'll change the if statement then.

BTW, do you have any example where the persistentId is not even included in the response (I'm thinking for testing purposes)?

Well, from a quick test it looks like persistentId is always present but can be an empty string. I'll put an example below.

I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id instead which will always be there and always be a number. I hope this helps! 😄

curl -s 'https://dataverse.unc.edu/api/datasets/:persistentId?persistentId=doi:10.15139/S3/TRSZ3X' | jq '.data.latestVersion.files[0]'

{ "description": "summary data file", "label": "CureTB data summary and statistics.tab", "restricted": false, "version": 3, "datasetVersionId": 32878, "dataFile": { "id": 7527010, "persistentId": "", "pidURL": "", "filename": "CureTB data summary and statistics.tab", "contentType": "text/tab-separated-values", "filesize": 3660, "description": "summary data file", "storageIdentifier": "s3://unc-dataverse-prod:18704bbebb7-8e1788510d33", "originalFileFormat": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "originalFormatLabel": "MS Excel Spreadsheet", "originalFileSize": 335887, "originalFileName": "CureTB data summary and statistics.xlsx", "UNF": "UNF:6:P0vi0QJZpFxCwM3pcX9YJw==", "rootDataFileId": -1, "md5": "4804fa6347742d850b0e1753c1668882", "checksum": { "type": "MD5", "value": "4804fa6347742d850b0e1753c1668882" }, "creationDate": "2023-03-21" } }

Well, from a quick test it looks like persistentId is always present but can be an empty string.

Great then! The following two lines actually work both with non-existing persistentId and if persistentId is an empty string. So I think we could leave them as they are, just in case in the future persistentId is dropped from any Dataverse API.

pooch/pooch/downloaders.py

Lines 1049 to 1050 in 1e670cf

persistent_id = files[file_name].get("persistentId")

if persistent_id:

I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id instead which will always be there and always be a number.

My strategy is to check for the id only if the persistentId is missing. This is due to what I commented above regarding defaulting to persistentId, being it the first option offered in Dataverse docs.

Maybe I'm being too conservative about it... I'm trying to keep the chances of breaking backward compatibility as low as possible, while still supporting the cases where persistentId is missing.

I would suggest changing your strategy to this:

Only check for id

Simple! 😄

The `persistentId` key might be missing in the API response, while the `ID` is always there. So, don't assume it exists when deciding which id should be used to download the files.

Both for a persistent_id as a None or as an empty string, we can evaluate them with `if persistent_id:`.

leouieda

Since the API docs say:

Basic access URI:
/api/access/datafile/$id

and only after that they have a box saying you can also use the persistentID, I think we can go with @pdurbin's advise of only using the ID instead of PID. That would simplify the testing and code and it would only break if Dataverse were to break their API. That should be fine since if we assume they can break by removing the ID then they could break in so many other ways that we have no way to predict. If it does happen, we can always issue a patch. But I think it's unlikely that they will without bumping the API version. The case of Zenodo from #373 seems like a good example that things may break but unintentionally. In which case, we probably only have to report the issue.

leouieda · 2024-02-19T19:48:22Z

I can merge in main and make the changes since @santisoler is on vacation.

leouieda · 2024-02-19T19:53:31Z

Plus, finding a dataverse instance that doesn't have persistentIDs that would be willing to host the Pooch test data would be non-trivial. And they could always enable the PIDs and break our tests without any warning.

pdurbin · 2024-02-22T20:13:29Z

15f7536 looks like a nice simplification.

By the way we have a new changelog for breaking changes to the Dataverse API, which we hope to keep short! Here's how it looks as of Dataverse 6.1: https://guides.dataverse.org/en/6.1/api/changelog.html

leouieda · 2024-02-22T21:53:26Z

Thanks for sharing @pdurbin!

Support Dataverse files without a persistentID

0321a08

Add support for downloading Dataverse files that don't have a persistent ID. Use the file ID instead.

santisoler mentioned this pull request Mar 13, 2023

Unable to download files in Dataverse repositories when files don't have PIDs #354

Closed

pdurbin reviewed Mar 17, 2023

View reviewed changes

santisoler added 3 commits March 20, 2023 14:29

Don't assume that persistentId is always present

7ad5073

The `persistentId` key might be missing in the API response, while the `ID` is always there. So, don't assume it exists when deciding which id should be used to download the files.

Simplify the if statement

1e670cf

Both for a persistent_id as a None or as an empty string, we can evaluate them with `if persistent_id:`.

Merge branch 'main' into dataverse-without-persistentid

23d2d02

santisoler requested a review from leouieda April 13, 2023 17:37

leouieda reviewed Feb 19, 2024

View reviewed changes

Merge branch 'main' into dataverse-without-persistentid

93d9835

Only rely on the file ID, not PID

15f7536

leouieda merged commit c256699 into main Feb 19, 2024
19 checks passed

leouieda deleted the dataverse-without-persistentid branch February 19, 2024 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Dataverse files without a persistentID #355

Support Dataverse files without a persistentID #355

santisoler commented Mar 13, 2023

santisoler commented Mar 13, 2023

pdurbin left a comment

pdurbin Mar 17, 2023

santisoler Mar 20, 2023

pdurbin Mar 23, 2023

santisoler Mar 23, 2023

pdurbin Apr 12, 2023

leouieda left a comment

leouieda commented Feb 19, 2024

leouieda commented Feb 19, 2024

pdurbin commented Feb 22, 2024

leouieda commented Feb 22, 2024

		persistent_id = files[file_name]["persistentId"]
		if persistent_id:

	persistent_id = files[file_name].get("persistentId")
	if persistent_id:

Support Dataverse files without a persistentID #355

Support Dataverse files without a persistentID #355

Conversation

santisoler commented Mar 13, 2023

santisoler commented Mar 13, 2023

pdurbin left a comment

Choose a reason for hiding this comment

pdurbin Mar 17, 2023

Choose a reason for hiding this comment

santisoler Mar 20, 2023

Choose a reason for hiding this comment

pdurbin Mar 23, 2023

Choose a reason for hiding this comment

santisoler Mar 23, 2023

Choose a reason for hiding this comment

pdurbin Apr 12, 2023

Choose a reason for hiding this comment

leouieda left a comment

Choose a reason for hiding this comment

leouieda commented Feb 19, 2024

leouieda commented Feb 19, 2024

pdurbin commented Feb 22, 2024

leouieda commented Feb 22, 2024