-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pydataverse upgrade broke OnlineDataverseDataset.upload_file()
#320
Comments
What might cause the problem here is a "fix" made for gdcc/pyDataverse#171 that came with v0.3.2 in gdcc/pyDataverse#174 I am not sure what is happening here. I thought the API of replace_file changed, but the documentation of It is not the identifier of the dataset, but of the file. |
I cannot determine what |
And something is out-of-sync with the underlying dataverse API. Here is the validation error I get when intentionally butchering the metadata record:
However https://guides.dataverse.org/en/latest/api/native-api.html#replacing-files shows that a metadata record like this is possible: {"description":"My description.","categories":["Data"],"forceReplace":false} It has neither of the properties that the pydataverse schema requires, and a The tests added for |
Indeed, the documentation is just confusing, because it says "... of the dataset" when it is actually of the file. Regarding the Trying to reproduce locally now and will get back soon. |
I found the bug! Dataverse requires the metadata to be sent as I will verify if this change does not cause any problems in other areas and will then prepare a pull request to fix this as soon as possible. 😊 |
I've created a draft pull request. Can you verify that it is working correctly now? |
With the tentative fix in pydataverse we arrive at 5 remaining failures:
It looks a bit as if the next blocker may be in the present implementation of file removal. Needs more investigation. |
Can you provide details of the failed tests? I will be on vacation from today until over next week Monday. Happy to get these fixed asap when back 😊 |
Thank you. I will, once I know and cannot figure it out myself. Happy vacation! |
I plan to look at the failing tests this afternoon |
I'll comment here for now, but I'm happy to re-file anything of relevance in more appropriate places. I found an issue with datalad-dataverse/datalad_dataverse/tests/test_pydataverse.py Lines 161 to 164 in 0621e2d
The signature of
(I was also a bit confused about the 200 response as the API Guide says "You should expect a 201 (“CREATED”) response and JSON indicating the database id that has been assigned to your newly uploaded file.", but datalad-dataverse code was already adjusted to this and checks for a 200). Simply adding a random JSON payload to the diff --git a/datalad_dataverse/tests/test_pydataverse.py b/datalad_dataverse/te>
index 93e1719..a6a94fd 100644
--- a/datalad_dataverse/tests/test_pydataverse.py
+++ b/datalad_dataverse/tests/test_pydataverse.py
@@ -83,14 +83,16 @@ def check_duplicate_file_deposition(api, dsid, tmp_path):
response = api.upload_datafile(
identifier=dsid,
- filename=tmp_path / 'nonunique1.txt'
+ filename=tmp_path / 'nonunique1.txt',
+ json_str='{"some":"thing"}'
)
# we do not expect issues here
response.raise_for_status()
# now upload the second file with the same content
response = api.upload_datafile(
identifier=dsid,
- filename=tmp_path / 'nonunique2.txt'
+ filename=tmp_path / 'nonunique2.txt',
+ json_str='{"some":"thing"}'
)
response.raise_for_status()
@@ -109,6 +111,7 @@ def check_upload(api, dsid, fcontent, fpath, src_md5):
response = api.upload_datafile(
identifier=dsid,
filename=fpath,
+ json_str='{"some":"thing"}'
)
# worked
assert response.status_code == 200
@@ -161,6 +164,7 @@ def test_file_removal(
response = dataverse_admin_api.upload_datafile(
identifier=dataverse_dataset,
filename=fpath,
+ json_str='{"some":"thing"}'
)
# worked
assert response.status_code == 200, \
@@ -184,6 +188,7 @@ def test_file_removal(
response = dataverse_admin_api.upload_datafile(
identifier=dataverse_dataset,
filename=fpath,
+ json_str='{"some":"thing"}'
)
assert response.status_code == 200, \
f"failed to upload file {response.status_code}: {response.json()}" |
The Backstory is in #320 (comment). A change outside of datalad-dataverse resulted in a 400 bad request whenever pydataverse's upload_file() function was called without passing anything as json_str, and its default value None was used. In the pydataverse tests we perform a simplistic upload, where this parameter . hasn't yet benn used. This change adds empty json to make the API call succeed.
The failure in What it gets, currently, is a CommandError from git annex initremote:
This in turn comes from datalad-dataverse/datalad_dataverse/dataset.py Lines 97 to 98 in 0621e2d
The status code is 404 not found, as should be expected. I'm not sure what changed in between, or where the DOI was previously report. I'm not sure if I'm making it too easy for me with this patch: diff --git a/datalad_dataverse/tests/test_add_sibling_dataverse.py b/datalad_dataverse/tests/test_add_sibling_dataverse.py
index d5b4ed3..bdbac15 100644
--- a/datalad_dataverse/tests/test_add_sibling_dataverse.py
+++ b/datalad_dataverse/tests/test_add_sibling_dataverse.py
@@ -25,7 +25,7 @@ def test_asdv_invalid_calls(
credential="dataverse",
**ckwa
)
- assert 'doi:no-ffing-datalad-way-this-exists not found' in str(ve.value)
+ assert 'Cannot find dataset' in str(ve.value)
@pytest.mark.parametrize("mode", ["annex", "filetree"]) |
The failing tests in git annex's testremote are
(or the following when I use gdcc/pyDataverse#201 where the authentication is moved to the header:
) datalad-dataverse/datalad_dataverse/dataset.py Lines 216 to 223 in 0621e2d
The API Guide's curl call with an ID succeeds, so maybe the way that pydataverse calls replace_datafile is the issue, but I haven't yet figured out how.
|
The docker compose setup as used in pyDataverse uses a different storage backend than demo.dataverse.org, so the storageIdentifier reports a non-s3:// address. This commit detects the two most common ways of specifying localhost (leaving out 127.1 and the like) to allow for `local://` storageIdentifiers. Relates to datalad#320.
upload fixI used the main branch (0621e2d) and installed pyDataverse from @JR-1991's branch (gdcc/pyDataverse#203). local docker compose setup fix(see #323) @adswa's observed 400 Bad RequestI checked the error @adswa is facing now:
by printing the response text if the error is a 400:
So it should be better to somehow get this error message into git-annex to debug such issues. json_str=json.dumps(json.loads(datafile.json()) | {"forceReplace": "true"}), # yes, the API docs have "true" string, not a bool does not seem to solve it and still causes the duplicate content error. I think this is something worth mentioning upstream at Dataverse – I think a 409 Conflict might be a better suited response code here. Plus, the docs should mention this. This piece of code has not changed in the last four years, so I guess there is something else which now causes this to be treated as an error – in fact, the comment above it even says that this would be a warning, not an error. Do you roughly know since when the test (test_remote.py::test_remote[yes] is failing? We might be able to pin it down further by going through the Dataverse code. Otherwise, the best idea might be to either detect this error and delete the file, then uploading a new one, or simply not uploading a file for which the md5 already matches. Sidenote(Unrelated sidenote: I also attempted to merge or rebase #280 onto main but git would refuse to do so due to unrelated histories or various conflicts on the initial commit, but I didn't dig deeper.) |
Thanks A LOT for all this digging, @shoeffner!
I think I found a mention in the API docs, finally: https://guides.dataverse.org/en/latest/user/dataset-management.html#duplicate-files
In practice, though, this shouldn't affect real usecases. Sadly, there not an elegant option to alter the tests - the |
This may be small comfort but @shoeffner @JR-1991 and I talked about this issue on Wednesday. It was first on the agenda if you want to try watching some of the recording. I'm honestly having a little trouble following what changed and why the test stopped working. Please feel free to reach out at https://chat.dataverse.org if you want to talk! Also, gdcc/pyDataverse#203 was merged yesterday. Hope it helps! |
If I understand it correctly, it appears that since DV 5, it is no longer
possible to upload identical file contents for an existing file, but the
failing test does just that.
And, again if I understand it correctly, it is not easily fixed as this is
expected to be working by git-annex.
…On Fri, Aug 23, 2024, 21:37 Philip Durbin ***@***.***> wrote:
This may be small comfort but @shoeffner <https://github.com/shoeffner>
@JR-1991 <https://github.com/JR-1991> and I talked about this issue on
Wednesday. It was first on the agenda
<https://docs.google.com/document/d/1feGc2cU9eNOJSvkcljw-l7fSym98gqgQRbtc3JaFoJQ/edit?usp=sharing>
if you want to try watching some of the recording
<https://drive.google.com/file/d/1VtNNpSQ3_97HGU6pAnpRoTFCPMSKLTVT/view?usp=share_link>.
I'm honestly having a little trouble following what changed and why the
test stopped working. Please feel free to reach out at
https://chat.dataverse.org if you want to talk! Also, gdcc/pyDataverse#203
<gdcc/pyDataverse#203> was merged yesterday. Hope
it helps!
—
Reply to this email directly, view it on GitHub
<#320 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOAOD4TV7MQSJHP3BTCA5DZS6FOJAVCNFSM6AAAAABLENV7H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBXG4YDENJYGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes exactly. And the test that fails is git-annex testremote, which we call in one of our tests. Its git-annex testsuite to test special remotes. And four out of 125 tests in this suite try replacing files with identical files and fail. However, my current thinking is that despite these tests failing, it does not affect real world use. It just needs a work-around in our tests. |
Closing this, thanks for everyone's work at fixing things from various angles! Everything is released and functioning. |
Thanks for making us aware of the problem! :) |
We pass
replace_file()
the following instructionsIt appears that
directoryLabel
gets ignore withpyDataverse==0.3.3
The text was updated successfully, but these errors were encountered: