Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Dataverse files without a persistentID #355

Merged
merged 6 commits into from
Feb 19, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 24 additions & 13 deletions pooch/downloaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -1034,20 +1034,31 @@ def download_url(self, file_name):
download_url : str
The HTTP URL that can be used to download the file.
"""

parsed = parse_url(self.archive_url)

# Iterate over the given files until we find one of the requested name
for filedata in self.api_response.json()["data"]["latestVersion"]["files"]:
if file_name == filedata["dataFile"]["filename"]:
return (
f"{parsed['protocol']}://{parsed['netloc']}/api/access/datafile/"
f":persistentId?persistentId={filedata['dataFile']['persistentId']}"
)

raise ValueError(
f"File '{file_name}' not found in data archive {self.archive_url} (doi:{self.doi})."
)
response = self.api_response.json()
files = {
file["dataFile"]["filename"]: file["dataFile"]
for file in response["data"]["latestVersion"]["files"]
}
if file_name not in files:
raise ValueError(
f"File '{file_name}' not found in data archive "
f"{self.archive_url} (doi:{self.doi})."
)
# Generate download_url using persistentId or file id
persistent_id = files[file_name]["persistentId"]
if persistent_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file ID will always be there. It's the primary key in the database.

The persistent ID (DOI or Handle) is optional so you can't rely on it being there. I would simply avoid even checking for it if all you want to do is download the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I thought the persistentId is always there, but it's empty if the files doesn't have one. Sorry for the misunderstanding.

If that's the case, you're right, we shouldn't assume that persistentId will always be there. I'll change the if statement then.

BTW, do you have any example where the persistentId is not even included in the response (I'm thinking for testing purposes)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, from a quick test it looks like persistentId is always present but can be an empty string. I'll put an example below.

I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id instead which will always be there and always be a number. I hope this helps! 😄

curl -s 'https://dataverse.unc.edu/api/datasets/:persistentId?persistentId=doi:10.15139/S3/TRSZ3X' | jq '.data.latestVersion.files[0]'

{
  "description": "summary data file",
  "label": "CureTB data summary and statistics.tab",
  "restricted": false,
  "version": 3,
  "datasetVersionId": 32878,
  "dataFile": {
    "id": 7527010,
    "persistentId": "",
    "pidURL": "",
    "filename": "CureTB data summary and statistics.tab",
    "contentType": "text/tab-separated-values",
    "filesize": 3660,
    "description": "summary data file",
    "storageIdentifier": "s3://unc-dataverse-prod:18704bbebb7-8e1788510d33",
    "originalFileFormat": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "originalFormatLabel": "MS Excel Spreadsheet",
    "originalFileSize": 335887,
    "originalFileName": "CureTB data summary and statistics.xlsx",
    "UNF": "UNF:6:P0vi0QJZpFxCwM3pcX9YJw==",
    "rootDataFileId": -1,
    "md5": "4804fa6347742d850b0e1753c1668882",
    "checksum": {
      "type": "MD5",
      "value": "4804fa6347742d850b0e1753c1668882"
    },
    "creationDate": "2023-03-21"
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, from a quick test it looks like persistentId is always present but can be an empty string.

Great then! The following two lines actually work both with non-existing persistentId and if persistentId is an empty string. So I think we could leave them as they are, just in case in the future persistentId is dropped from any Dataverse API.

pooch/pooch/downloaders.py

Lines 1049 to 1050 in 1e670cf

persistent_id = files[file_name].get("persistentId")
if persistent_id:

I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id instead which will always be there and always be a number.

My strategy is to check for the id only if the persistentId is missing. This is due to what I commented above regarding defaulting to persistentId, being it the first option offered in Dataverse docs.

Maybe I'm being too conservative about it... I'm trying to keep the chances of breaking backward compatibility as low as possible, while still supporting the cases where persistentId is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing your strategy to this:

  • Only check for id

Simple! 😄

download_url = (
f"{parsed['protocol']}://{parsed['netloc']}/api/access/datafile/"
f":persistentId?persistentId={persistent_id}"
)
else:
file_id = files[file_name]["id"]
download_url = (
f"{parsed['protocol']}://{parsed['netloc']}/api/access/datafile/"
f"{file_id}"
)
return download_url

def populate_registry(self, pooch):
"""
Expand Down