Skip to content

Commit

Permalink
Merge pull request IQSS#11016 from IQSS/220-audit-physical-files
Browse files Browse the repository at this point in the history
API for auditing physical files and file metadata
  • Loading branch information
ofahimIQSS authored Dec 2, 2024
2 parents 4624bb6 + 8c79f67 commit 5d7d942
Show file tree
Hide file tree
Showing 6 changed files with 1,220 additions and 936 deletions.
16 changes: 16 additions & 0 deletions doc/release-notes/220-harvard-edu-audit-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
### New API to Audit Datafiles across the database

This is a superuser only API endpoint to audit Datasets with DataFiles where the physical files are missing or the file metadata is missing.
The Datasets scanned can be limited by optional firstId and lastId query parameters, or a given CSV list of Dataset Identifiers.
Once the audit report is generated, a superuser can either delete the missing file(s) from the Dataset or contact the author to re-upload the missing file(s).

The JSON response includes:
- List of files in each DataFile where the file exists in the database but the physical file is not in the file store.
- List of DataFiles where the FileMetadata is missing.
- Other failures found when trying to process the Datasets

curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles"
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/RVNT9Q,doi:10.5072/FK2/RVNT9Q"

For more information, see [the docs](https://dataverse-guide--11016.org.readthedocs.build/en/11016/api/native-api.html#datafile-audit), #11016, and [#220](https://github.com/IQSS/dataverse.harvard.edu/issues/220)
66 changes: 66 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6300,6 +6300,72 @@ Note that if you are attempting to validate a very large number of datasets in y
asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600
Datafile Audit
~~~~~~~~~~~~~~
Produce an audit report of missing files and FileMetadata for Datasets.
Scans the Datasets in the database and verifies that the stored files exist. If the files are missing or if the FileMetadata is missing, this information is returned in a JSON response.
The call will return a status code of 200 if the report was generated successfully. Issues found will be documented in the report and will not return a failure status code unless the report could not be generated::
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles"
Optional Parameters are available for filtering the Datasets scanned.
For auditing the Datasets in a paged manner (firstId and lastId)::
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
Auditing specific Datasets (comma separated list)::
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/JXYBJS,doi:10.7910/DVN/MPU019"
Sample JSON Audit Response::
{
"status": "OK",
"data": {
"firstId": 0,
"lastId": 100,
"datasetIdentifierList": [
"doi:10.5072/FK2/XXXXXX",
"doi:10.5072/FK2/JXYBJS",
"doi:10.7910/DVN/MPU019"
],
"datasetsChecked": 100,
"datasets": [
{
"id": 6,
"pid": "doi:10.5072/FK2/JXYBJS",
"persistentURL": "https://doi.org/10.5072/FK2/JXYBJS",
"missingFileMetadata": [
{
"storageIdentifier": "local://1930cce4f2d-855ccc51fcbb",
"dataFileId": "7"
}
]
},
{
"id": 47731,
"pid": "doi:10.5072/FK2/MPU019",
"persistentURL": "https://doi.org/10.7910/DVN/MPU019",
"missingFiles": [
{
"storageIdentifier": "s3://dvn-cloud:298910",
"directoryLabel": "trees",
"label": "trees.png"
}
]
}
],
"failures": [
{
"datasetIdentifier": "doi:10.5072/FK2/XXXXXX",
"reason": "Not Found"
}
]
}
}
Workflows
~~~~~~~~~
Expand Down
Loading

0 comments on commit 5d7d942

Please sign in to comment.