Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Dataverse files without a persistentID #355

Merged
merged 6 commits into from
Feb 19, 2024

Conversation

santisoler
Copy link
Member

Add support for downloading Dataverse files that don't have a persistent ID. Use the file ID instead.

Relevant issues/PRs:

Fixes #354

TODO

  • Add tests that download files that don't have a persistentID

Add support for downloading Dataverse files that don't have a persistent
ID. Use the file ID instead.
@santisoler
Copy link
Member Author

We should add some tests for this bugfix that downloads some file from a Dataverse repository that doesn't provide persistent ID for its files. I would avoid using preexisting repositories, we want very small files since these tests will be run multiple times. Creating a version of the test data for Pooch in another Dataverse repository would be nice.

Copy link
Contributor

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More details in the comment I left but I'd suggest always downloading files using the (database) ID.

Please feel free to hit me up on https://chat.dataverse.org if you have any questions. Thanks for teaching Pooch to download from Dataverse! 🐶 ❤️

Comment on lines 1049 to 1050
persistent_id = files[file_name]["persistentId"]
if persistent_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file ID will always be there. It's the primary key in the database.

The persistent ID (DOI or Handle) is optional so you can't rely on it being there. I would simply avoid even checking for it if all you want to do is download the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I thought the persistentId is always there, but it's empty if the files doesn't have one. Sorry for the misunderstanding.

If that's the case, you're right, we shouldn't assume that persistentId will always be there. I'll change the if statement then.

BTW, do you have any example where the persistentId is not even included in the response (I'm thinking for testing purposes)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, from a quick test it looks like persistentId is always present but can be an empty string. I'll put an example below.

I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id instead which will always be there and always be a number. I hope this helps! 😄

curl -s 'https://dataverse.unc.edu/api/datasets/:persistentId?persistentId=doi:10.15139/S3/TRSZ3X' | jq '.data.latestVersion.files[0]'

{
  "description": "summary data file",
  "label": "CureTB data summary and statistics.tab",
  "restricted": false,
  "version": 3,
  "datasetVersionId": 32878,
  "dataFile": {
    "id": 7527010,
    "persistentId": "",
    "pidURL": "",
    "filename": "CureTB data summary and statistics.tab",
    "contentType": "text/tab-separated-values",
    "filesize": 3660,
    "description": "summary data file",
    "storageIdentifier": "s3://unc-dataverse-prod:18704bbebb7-8e1788510d33",
    "originalFileFormat": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "originalFormatLabel": "MS Excel Spreadsheet",
    "originalFileSize": 335887,
    "originalFileName": "CureTB data summary and statistics.xlsx",
    "UNF": "UNF:6:P0vi0QJZpFxCwM3pcX9YJw==",
    "rootDataFileId": -1,
    "md5": "4804fa6347742d850b0e1753c1668882",
    "checksum": {
      "type": "MD5",
      "value": "4804fa6347742d850b0e1753c1668882"
    },
    "creationDate": "2023-03-21"
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, from a quick test it looks like persistentId is always present but can be an empty string.

Great then! The following two lines actually work both with non-existing persistentId and if persistentId is an empty string. So I think we could leave them as they are, just in case in the future persistentId is dropped from any Dataverse API.

pooch/pooch/downloaders.py

Lines 1049 to 1050 in 1e670cf

persistent_id = files[file_name].get("persistentId")
if persistent_id:

I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id instead which will always be there and always be a number.

My strategy is to check for the id only if the persistentId is missing. This is due to what I commented above regarding defaulting to persistentId, being it the first option offered in Dataverse docs.

Maybe I'm being too conservative about it... I'm trying to keep the chances of breaking backward compatibility as low as possible, while still supporting the cases where persistentId is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing your strategy to this:

  • Only check for id

Simple! 😄

The `persistentId` key might be missing in the API response, while the
`ID` is always there. So, don't assume it exists when deciding which id
should be used to download the files.
Both for a persistent_id as a None or as an empty string, we can
evaluate them with `if persistent_id:`.
Copy link
Member

@leouieda leouieda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the API docs say:

Basic access URI:
/api/access/datafile/$id

and only after that they have a box saying you can also use the persistentID, I think we can go with @pdurbin's advise of only using the ID instead of PID. That would simplify the testing and code and it would only break if Dataverse were to break their API. That should be fine since if we assume they can break by removing the ID then they could break in so many other ways that we have no way to predict. If it does happen, we can always issue a patch. But I think it's unlikely that they will without bumping the API version. The case of Zenodo from #373 seems like a good example that things may break but unintentionally. In which case, we probably only have to report the issue.

@leouieda
Copy link
Member

I can merge in main and make the changes since @santisoler is on vacation.

@leouieda
Copy link
Member

Plus, finding a dataverse instance that doesn't have persistentIDs that would be willing to host the Pooch test data would be non-trivial. And they could always enable the PIDs and break our tests without any warning.

@leouieda leouieda merged commit c256699 into main Feb 19, 2024
19 checks passed
@leouieda leouieda deleted the dataverse-without-persistentid branch February 19, 2024 20:12
@pdurbin
Copy link
Contributor

pdurbin commented Feb 22, 2024

15f7536 looks like a nice simplification.

By the way we have a new changelog for breaking changes to the Dataverse API, which we hope to keep short! Here's how it looks as of Dataverse 6.1: https://guides.dataverse.org/en/6.1/api/changelog.html

@leouieda
Copy link
Member

Thanks for sharing @pdurbin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to download files in Dataverse repositories when files don't have PIDs
3 participants