Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGUS129103: UberFTP (a UMD dependency) may destroy data after a failed overwrite attempt #3

Closed
msalle opened this issue May 12, 2020 · 27 comments · Fixed by #5
Closed

Comments

@msalle
Copy link
Member

msalle commented May 12, 2020

From the GGUS ticket, see also JasonAlt#11
Zdenek mentioned that a fix is easy, Steve offered to put it in EPEL if a patch was provided.

We've noticed that UberFTP deletes a file after an unsuccessful attempt to overwrite the file. In our (SARA-MATRIX) dCache SRM, overwrites have been disabled. Some users try to overwrite anyway, and may thus end up deleting their own data. This only happens with UberFTP. Detailed log follows.

Now, we've considered deleting UberFTP, but it appears it's a dependency of the UMD UI packages:

uberftp x86_64 2.8-2.el6 @epel 196 k
Removing for dependencies:
emi-ui x86_64 3.1.0-1.el6 @UMD-3-updates 0.0
glite-ce-cream-cli x86_64 1.15.3-2.el6 @UMD-3-updates 10 M

So that's no option. I'm worried that UMD depends on software with the potential of accidentally destroying data.

I've posted an issue at Github, but I'm not sure whether it is actively maintained. JasonAlt#11

Here's a log:

$ uberftp -debug 9 file:////home/onno/test-onno gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/lsgrid/onnotest2017-06
Debug level set at 3 (ON)
220 GSI FTP door ready
.....
227 OK (145,100,32,147,85,112)
STOR /pnfs/grid.sara.nl/data/lsgrid/onnotest2017-06
Encoded cmd: ENC FwMDAE3/hDwhz9zEHLKvYjOQ3xXkv93C4+Bt1N4KZFmMC24sRmeqwz3ArHnC2vHaVHUMF/7AVMwZDMaBdXcsrie0Jx9emi8zobjFDmQUy0Xp+w==
Encoded resp: 633 FwMDACkAAAAAAAAAC50TX5x1j+iWnBnCNkZoo2wHsiUxayldBlUSjvuARxJVnA==
550 File exists
/pnfs/grid.sara.nl/data/lsgrid/onnotest2017-06: Error with remote service during transfer.
550 File exists
MLST /pnfs/grid.sara.nl/data/lsgrid/onnotest2017-06
Encoded cmd: ENC FwMDAE3/hDwhz9zEHWMRXtjOwYLf6KRjO0wdT0VXat5BD5lU/hZzbwBSUr4ztkdC2AEOfQEY89AZnkm3y0/o75cGX3PYuCaJdwOGoGL9z1PhNw==
Encoded resp: 633 FwMDAT0AAAAAAAAADKiT2FNw2urfbjD0Zrnfkw6c40cjheEcAgMUiJqc5c//hm3PI/Wqx878jvMYGAipYB2pXvkYdEndIiaZYlu3K4zvOJOoMDPyOhHpB3tPawTCba1Afz+DKiZ1bH8HiFTDjKlag0hyU5r4lMX0PEDXkFkqyQcsT+kUAPw0+1PLn3K69H+AOAeNDiO37592oJYw4UwbZtGEyuqDIeKC1BIexXPm8iUzC9SDkEaYJPhuznYeRUc+baNihx7SzkXl/ILPAG789gBUu/RcjfEgJdfCu6rolgZVSGv615cXZCFQPJrc1tpHslNWCIogMFHdTAZzm/TCrNwZBhU1b3KPqt3vl9Z6TLgdk893ZA1+hfbrGIJoSb4dXJA4jjZ0E1MAAeIup1OScGoxXtA7/cHCUem3GjQ+8rt7ww==
250- Listing /pnfs/grid.sara.nl/data/lsgrid/onnotest2017-06
Perm=dr;UNIX.group=lsgrid;Modify=20170613133017;UNIX.owner=lsgrid;UNIX.gid=31883;Unique=0000C90F1929011A4B87B21F748848EF353B;Size=12;UNIX.uid=36494;Type=file;UNIX.mode=0644; /pnfs/grid.sara.nl/data/lsgrid/onnotest2017-06
250 End
DELE /pnfs/grid.sara.nl/data/lsgrid/onnotest2017-06
Encoded cmd: ENC FwMDAE3/hDwhz9zEHhepf0zuK9jolLqo+xKDcly0F/h7fJVLkJVepA0Yod0ATmqNljLbvd3Ua+jvbdr8tffEs2FBsxD20DGuqv+BXfQxPpDobg==
Encoded resp: 633 FwMDACAAAAAAAAAADfTjvoGq5uXp5P0GMT/8oHwx0VZxS3YpHw==
250 OK 
@fscheiner
Copy link
Member

@msalle
Relevant code according to the info from the GGUS ticket (timestamp: 2017-06-22 | 15:35) is in https://github.com/gridcf/UberFTP/blob/master/cmds.c#L3552..L3574.

The assumption seems to be, that if the "assumed to be transferred" file exists remotely after a transfer error, it is a partially transferred file and should be removed.

Uberftp should maybe check if a file with identical name to the "to be transferred" file exists remotely prior to a transfer and if that's the case not remove it after a transfer error.

This could work (1) for the configuration mentioned in the GGUS ticket (where files can not be overwritten but deleted) and (2) in other cases would leave a partially transferred behind file at the remote location. The latter could be useful because a user could than continue the transfer at an offset if that offset is known, though I don't know if the Uberftp UI allows partial transfers or transfers from offset positions.

@msalle
Copy link
Member Author

msalle commented May 20, 2020

I'm a bit confused. Looking at cmds.c lines 3442-3451 I would think that a failure to write would directly jump to the cleanup, without setting delfile to 1 in line 3454. As far as I can see there is no other place where delfile is set to 1.
I guess this will take a bit more debugging...

@msalle
Copy link
Member Author

msalle commented May 20, 2020

Small update, I just tried it, and indeed it does do the delete, but the problem should be fixed in a different place: it's the write in line 3485 that results in the 550 File exists. So in lines 3486-3497, we need to reset delfile to 0.
I'll create a 1-line pull-request.

@msalle
Copy link
Member Author

msalle commented May 20, 2020

Unfortunately it turns out to be quite a bit more complicated. The error is not coming back directly from the l_write() unless we're running very slowly and in a debugger, instead we could see it in the l_close() in line 3520 or in a later l_write(). For big files it would typically be in ftp.c line 780 which then returns the 550 (code is 550 and resp is 550 File exists) and creates a errcode_t containing just the latter (resp) in the lines 789-792.
That leaves us basically with the only option to do a strcmp of the resp to see if it matches 550 File exists\r\n. I thought of doing a stat beforehand and not delete when it existed then, but in case we've written half a file, it should still cleanup.

msalle added a commit to msalle/UberFTP that referenced this issue May 20, 2020
This is a simple and not very clean fix for
gridcf/issues/3
Either (one of) the l_write() or the l_close() can fail if the remote file
already exists, since the data goes over a different channel than the control
messages. For large files, it will typically be one of the later l_write()
statements, for small files probably the l_close(). Hence we effectively only
have the errmsg to find out what happened, and therefore do a simple str(n)cmp
with this errmsg. For this we also need to add a function to get it from the
opaque errcode_t.
@msalle
Copy link
Member Author

msalle commented May 22, 2020

Involving @paulmillar
It seems there is no way to get information on whether the server allows overwriting files, is that correct? I tried both gridftp.grid.sara.nl which does not allow overwrite and prometheus.desy.de which does allow it, and both respond identically to quote FEAT.
Also, the question is whether we should only not delete when the server returns 550, are there other codes for which we should not delete the already written data. Or should we reversely only delete on certain codes and by default not?

@fscheiner
Copy link
Member

fscheiner commented May 22, 2020

@msalle
How does globus-url-copy behave? I assume it keeps the remote file, which is useful for retries from offset positions (especially if the file is huge). Also in a situation, where the control connection also breaks, there's no way to send that "remove (existing) file" command anyhow. So better not try to remove files/data on the remote side at all, but only if explicitly requested by "rm" or "mv".

EDIT: So better not try to remove files/data on the remote side unexpectedly, but only if explicitly requested by "rm" or "mv".

@msalle
Copy link
Member Author

msalle commented May 22, 2020

Hi @fscheiner, I think globus-url-copy is also deleting, see e.g. globus_ftp_client_register_write() in particular around line 464. On the other hand the flow there seems to be much more subtle, keeping track of blocks written etc.
Not deleting the data for any action is not per se such a good idea. You might end up with half-written files, e.g. when you reach an out-of-space condition.

@paulmillar
Copy link

It seems there is no way to get information on whether the server allows overwriting files, is that correct?

Correct.

I tried both gridftp.grid.sara.nl which does not allow overwrite and prometheus.desy.de which does allow it, and both respond identically to quote FEAT.

Correct. FEAT does not describe whether overwrite is disabled.

Also, the question is whether we should only not delete when the server returns 550, are there other codes for which we should not delete the already written data. Or should we reversely only delete on certain codes and by default not?

Here are my thoughts on how to fix this.

  1. Disallow DELE in dCache.

The disallow overwrite option in dCache is a (IMHO misguided) attempt to prevent clients from deleting data by mistake by forcing them to "explicitly" delete the data. However, nothing stops clients from doing what UberFTP does, and fall-back to issuing DELE if the upload fails, and doing this automatically (as in UberFTP). Therefore, the current approach is flawed.

IMHO, it's fundamentally flawed: this disallow overwrite feature should be removed.

However, the self-consistent approach would be for dCache to disallow DELE, too.

  1. Remove the DELE fall-back in client.

If the upload fails (for whatever reason) the client should just fail. Attempting to DELE the file is (at best) a work-around for something broken. At worse, it's an attempt by the client to "game" the permission model. Instead, the client should simply fail if the initial upload is unsuccessful.

This the closest match to dCache's intention with the disallow overwrite: that clients should simply fail if the (initial) upload fails.

  1. The client retries the upload.

The client is attempting to overwrite an existing file. Because of dCache's (somewhat questionable) permission model, the initial upload fails. The client reacts by deleting the file. Since the client is attempting to upload a file and overwrite any existing content, this is fine. However, the client should then complete the task and retry the upload the file.

@msalle
Copy link
Member Author

msalle commented May 25, 2020

Hi Paul,
thanks for the answer!
Concerning 3, currently the code retries for codes in the 400 range, and fails for those in the 500 range, see https://github.com/gridcf/UberFTP/blob/master/ftp.c#L70_L73 which is not unreasonable, looking at RFC959 section 4.2.1 and 4.2.2. If the server would return instead a 450, it would make sense.
Instead just never deleting (your option 2) is certainly easier to implement, but then the question is whether we could get stuck with a half-written file, e.g. in the case of an no-space-left error.
I think what we have implemented now is effectively implementing 1. on client side. The question is then whether doing this only for the 550 is sufficient.

@paulmillar
Copy link

Hi Mischa,

I think you and I have different understanding of 4xy vs 5xy return codes.

My understanding of 4xy code is that repeating the operation could succeed even without this client doing anything. Therefore, it would be reasonable for the client to simply retry the request (i.e., without making any other changes), perhaps after waiting a short period. The server being overloaded (e.g., too many concurrent transfers) or the storage node (containing the file's data) being restarted are examples where simply retrying could lead to the request succeeding.

In contrast, a 5xy code is permanent in the sense that the outcome will not change without the client doing something differently. In some cases, there are no circumstances in which the request can succeed (e.g., the command is not implemented or the invocation has a syntax error). In other cases, admin- or perhaps even user interaction can result in a subsequent invocation of the command succeeding.

To give a couple of examples, consider a client attempting to delete a file without authenticating.

DELE /path/to/file
530 Not logged in
USER paul
331 Password required for paul
PASS TooManySecrets
230 User paul logged in
DELE /path/to/file
250 OK

Here the DELE command initially returns 530. It doesn't matter how many times the client repeats the DELE command, all unauthenticated DELE requests will fail -- this is a "permanent negative response". However, when the client authenticates, the situation is different and the DELE request now succeeds.

Another example, attempting to download a file that does not exist.

RETR /path/to/file
550 File not found
STOR /path/to/file
150 Opening BINARY data connection for /path/to/file
226 Transfer complete.
RETR /path/to/file
150 Opening BINARY data connection for /path/to/file
226 Transfer complete.

Without there being a file at the requested path, the initial RETR command fails and will always fail (permanent negative response). However, after this client (or, indeed, any other cilent) has successfully uploaded a file at that path, the RETR command succeeds.

So, I think it is reasonable to return 550 when the client is attempting to overwrite the file since the failure is not transitory -- direct repeating of the command will not help. However, the client is free to try to change the environment (in this case, by deleting the file) and then retry the upload.

Instead just never deleting (your option 2) is certainly easier to implement, but then the question is whether we could get stuck with a half-written file, e.g. in the case of an no-space-left error.

This is certainly a legitimate concern.

My suggestion would be that if the request fails before receiving the 150 intermediate response then the client should not delete the file if it fails. If the client receives a 150 intermediate response then it is free to delete the file on an upload failure.

(As an aside: dCache has a feature where it can automatically delete partial uploaded files. This feature is enabled by default, so this isn't such a concern when uploading to dCache.)

A client could also take the view that removal of partial uploads requires a human decision. A partial upload could be useful as-is. Additionally, FTP (unlike HTTP) allows the client to append to an existing files content, rather than sending the entire file again.

HTH,
Paul.

@msalle
Copy link
Member Author

msalle commented May 26, 2020

Hi Mischa,

I think you and I have different understanding of 4xy vs 5xy return codes.

My understanding of 4xy code is that repeating the operation could succeed even without this client doing anything. Therefore, it would be reasonable for the client to simply retry the request (i.e., without making any other changes), perhaps after waiting a short period. The server being overloaded (e.g., too many concurrent transfers) or the storage node (containing the file's data) being restarted are examples where simply retrying could lead to the request succeeding.

In contrast, a 5xy code is permanent in the sense that the outcome will not change without the client doing something differently. In some cases, there are no circumstances in which the request can succeed (e.g., the command is not implemented or the invocation has a syntax error). In other cases, admin- or perhaps even user interaction can result in a subsequent invocation of the command succeeding.

I agree that is the intention of the split between the 4xy and 5xy codes and also that a 5xy code is the correct return code from the perspective of the server. However, from the point of view of the code, I would effectively need to treat the 550 as a non-permanent error once I've deleted the (pre-existing) file. Also, since I'm thereby bypassing the server setting of non-overwrite without any questions or interaction with the user, I think this would not be really what we should be doing.

[...]

So, I think it is reasonable to return 550 when the client is attempting to overwrite the file since the failure is not transitory -- direct repeating of the command will not help. However, the client is free to try to change the environment (in this case, by deleting the file) and then retry the upload.

I agree with your first statement, but - as above - I don't think I agree a client tool should do this on its own. It's like I chmod a file a-w, and then someone tries overwriting it with a tool which says, "O I can't, let's just silently chmod it back to a+w..." (-;

Instead just never deleting (your option 2) is certainly easier to implement, but then the question is whether we could get stuck with a half-written file, e.g. in the case of an no-space-left error.

This is certainly a legitimate concern.

My suggestion would be that if the request fails before receiving the 150 intermediate response then the client should not delete the file if it fails. If the client receives a 150 intermediate response then it is free to delete the file on an upload failure.

This is a little bit tricky, since there are quite a few places where the response is handled and in some cases it handles a few in a row in the l_close() (in _f_get_final_resp() to be precise). Keeping track of whether or not a 150 has come by, will not be a very simple or clean patch since we need to have it flow all the way back up to _c_xfer_file() via some new EC_FLAG_ and retrieval function similar to ec_retry().

(As an aside: dCache has a feature where it can automatically delete partial uploaded files. This feature is enabled by default, so this isn't such a concern when uploading to dCache.)

A client could also take the view that removal of partial uploads requires a human decision. A partial upload could be useful as-is. Additionally, FTP (unlike HTTP) allows the client to append to an existing files content, rather than sending the entire file again.

Right, so that would be a good excuse for just not removing the file ever, which certainly would be the easiest solution, it would just mean removing the block https://github.com/gridcf/UberFTP/blob/master/cmds.c#L3552_L3574 plus definition and setting of delfile

msalle added a commit to msalle/UberFTP that referenced this issue May 26, 2020
This is an even simpler fix for gridcf/issues/3. It just disables the deletion
of the destination file when writing has failed.
@paulmillar
Copy link

OK, I think we're converging on the client only deleting a file if the user explicitly requests that file be deleted.

I think that should be fine.

@fscheiner
Copy link
Member

@msalle

sorry, my response might be late but still relevant

Hi @fscheiner, I think globus-url-copy is also deleting, see e.g. globus_ftp_client_register_write() in particular around line 464.

Just had a look into that: For me it doesn't look like globus_l_ftp_client_data_delete(data) actually deletes a remote file, looking into its definition it seems to just free the local buffer - unless I misunderstood the code here.

Not deleting the data for any action is not per se such a good idea.

I am just reluctant to let the tools do something unexpected. And deleting a remote file during a PUT is highly unexpected IMHO unless a user wants to explicitly overwrite an existing file.

You might end up with half-written files, e.g. when you reach an out-of-space condition.

Doesn't a (Grid)FTP server check if there's enough disk space left before accepting a PUT (if the client sent an ALLO with the file size in bytes prior) and then should also reserve that space somehow?

Oh, apparently not:

[root@gridftp-5 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_gridftp4-lv_root
                      6.5G  1.1G  5.1G  18% /
tmpfs                 939M     0  939M   0% /dev/shm
/dev/vda1             477M  108M  344M  24% /boot

[johndoe@gridftp-client-2 ~]$ dd if=/dev/zero of=10G-sparse bs=1G seek=10 count=0
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000460967 s, 0.0 kB/s

[johndoe@gridftp-client-2 ~]$ ls -lah
total 4.0M
drwx------. 11 johndoe johndoe 4.0K Jun 18 12:42 .
drwxr-xr-x.  3 root    root    4.0K Feb  2  2017 ..
-rw-rw-r--.  1 johndoe johndoe  10G Jun 18 12:42 10G-sparse
[...]

[johndoe@gridftp-client-2 ~]$ globus-url-copy -dbg file:///$PWD/10G-sparse gsiftp://gridftp-5/~/
[...]
debug: sending command to gsiftp://gridftp-5/~/10G-sparse:
ALLO 10737418240

debug: response from gsiftp://gridftp-5/~/10G-sparse:
200 ALLO command successful.

debug: sending command to gsiftp://gridftp-5/~/10G-sparse:
STOR ~/10G-sparse

debug: response from gsiftp://gridftp-5/~/10G-sparse:
150 Beginning transfer.

debug: writing buffer 0x7f01df2d0010, length 1048576, offset=0, eof=false
debug: data callback, no error, buffer 0x7f01df2d0010, length 1048576, offset=0, eof=false
debug: writing buffer 0x7f01d9aeb010, length 1048576, offset=1048576, eof=false
debug: data callback, no error, buffer 0x7f01d9aeb010, length 1048576, offset=1048576, eof=false
[...]

...not sure why. :-/

@fscheiner
Copy link
Member

@paulmillar

(As an aside: dCache has a feature where it can automatically delete partial uploaded files. This feature is enabled by default, so this isn't such a concern when uploading to dCache.)

Does this also work for intended partial transfers (i.e. transfers from offset positions)? How does dCache detect that?

@paulmillar
Copy link

Hi @fscheiner

Does [automatically delete partial uploaded files] also work for intended partial transfers (i.e. transfers from offset positions)? How does dCache detect that?

We have a very simple solution: partial uploads are not supported.

dCache is an immutable storage system; in the sense that, once the initial writing of a file is complete, the cannot be modified. So (for example) a client cannot append the missing data to an interrupted upload, but must start again and upload the complete file.

Therefore, it doesn't really make any sense to allow FTP clients to upload part of a file, since the client wouldn't be able to upload the remaining data.

In concrete terms, dCache will reject ESTO commands with non-zero offsets and it doesn't have any support for the RESTART command.

@fscheiner
Copy link
Member

@paulmillar
Interesting approach. Now very huge files (let's say 100 GiB and more, though not sure if such files are practical or in use) and interrupted transfers (at 99% of the transfer process or so ;-)) on slow connections come into my mind. Is that only a theoretical problem?

@msalle
Copy link
Member Author

msalle commented Jun 18, 2020

@msalle

sorry, my response might be late but still relevant

Indeed, I basically put this work on hold to wait for you to return (-:

Hi @fscheiner, I think globus-url-copy is also deleting, see e.g. globus_ftp_client_register_write() in particular around line 464.

Just had a look into that: For me it doesn't look like globus_l_ftp_client_data_delete(data) actually deletes a remote file, looking into its definition it seems to just free the local buffer - unless I misunderstood the code here.

I've certainly seen it delete a file, but it depends on the server settings:

  • put a file
  • put a different file with the same name
    -> error is returned
    -> file is gone

Not deleting the data for any action is not per se such a good idea.

I am just reluctant to let the tools do something unexpected. And deleting a remote file during a PUT is highly unexpected IMHO unless a user wants to explicitly overwrite an existing file.

I agree, in particular since the server tries NOT to override, so deleting is even worse...

I think we should go for plan C and never delete anything. Only remaining question: should we delete the relevant blocks or uncomment using #if 0

@fscheiner
Copy link
Member

@msalle
sorry, my response might be late but still relevant

Indeed, I basically put this work on hold to wait for you to return (-:

I see, sorry for delaying this then. :-)

Hi @fscheiner, I think globus-url-copy is also deleting, see e.g. globus_ftp_client_register_write() in particular around line 464.

Just had a look into that: For me it doesn't look like globus_l_ftp_client_data_delete(data) actually deletes a remote file, looking into its definition it seems to just free the local buffer - unless I misunderstood the code here.

I've certainly seen it delete a file, but it depends on the server settings:

* put a file

* put a different file with the same name
  -> error is returned
  -> file is gone

That happens with a globus-gridftp-server? I need to check that then...

Not deleting the data for any action is not per se such a good idea.

I am just reluctant to let the tools do something unexpected. And deleting a remote file during a PUT is highly unexpected IMHO unless a user wants to explicitly overwrite an existing file.

I agree, in particular since the server tries NOT to override, so deleting is even worse...

I think we should go for plan C and never delete anything. Only remaining question: should we delete the relevant blocks or uncomment using #if 0

Ah, I answered or actually asked for that in the corresponding PR itself, before reading this. :-) For the record, I'd just remove the relevant code, as it will still be kept in the history.

@msalle
Copy link
Member Author

msalle commented Jun 18, 2020

I've certainly seen it delete a file, but it depends on the server settings:

* put a file

* put a different file with the same name
  -> error is returned
  -> file is gone

That happens with a globus-gridftp-server? I need to check that then...

I don't think so, the original GGUS ticket says " In our (SARA-MATRIX) dCache SRM, overwrites have been disabled." I presume it's still a dCache SRM.

Ah, I answered or actually asked for that in the corresponding PR itself, before reading this. :-) For the record, I'd just remove the relevant code, as it will still be kept in the history.

Makes sense, but will await response from the others...

@fscheiner
Copy link
Member

I've certainly seen it delete a file, but it depends on the server settings:

* put a file

* put a different file with the same name
  -> error is returned
  -> file is gone

That happens with a globus-gridftp-server? I need to check that then...

I don't think so, the original GGUS ticket says " In our (SARA-MATRIX) dCache SRM, overwrites have been disabled." I presume it's still a dCache SRM.

Then maybe it happens somewhere else in guc.

Ah, I answered or actually asked for that in the corresponding PR itself, before reading this. :-) For the record, I'd just remove the relevant code, as it will still be kept in the history.

Makes sense, but will await response from the others...

Sure, that was also my intention.

@paulmillar
Copy link

Interesting approach. Now very huge files (let's say 100 GiB and more, though not sure if such files are practical or in use) and interrupted transfers (at 99% of the transfer process or so ;-)) on slow connections come into my mind. Is that only a theoretical problem?

There are large files being sent, since many files are stored on tape and tape works best with files in the multi-gigabyte range. Particle physics is in the enviable position that they are largely free to choose how big are their files.

I agree this is potentially a problem. However, it's not a problem in practice; at least, not for WLCG.

WLCG has invested in good networking connections. There is LHCOPN, plus the many national and international networks within LHCONE.

There are still problems; however, they tend to result in either zero bytes transferred (e.g., firewall problems), the entire file transferred but the result is rejected (e.g,. software bug), or the transfer is too slow and killed by agent requesting the transfer for not making sufficient progress (e.g,. faulty network hardware). In the last case, the infrastructure can try a different source or destination; in effect, routing around the bad network. In all these cases, resuming an upload doesn't help.

However, this is a problem is when data is transferred over the more unreliable, public networks. For example, uploads with sync-n-share clients (ownCloud/nextCloud) and volunteer-computing (BOINC). These have solve this problem with HTTP uploads in broadly the same way; but (unfortunately) with incompatible solutions.

@paulmillar
Copy link

@msalle,

I don't think so, the original GGUS ticket says " In our (SARA-MATRIX) dCache SRM, overwrites have been disabled." I presume it's still a dCache SRM.

I think I understand this comment.

dCache used to support a configuration option where it would outright reject (or veto) attempts by an SRM client to overwrite existing files (like with GridFTP). For reference, this is the srmPrepareToPut call. This behaviour (whether overwriting is allowed) may be controlled independently from the (Grid)FTP overwrite protection; although, it was done in a way that made it easy to have dCache react in the same way to overwrite requests from either FTP or SRM clients.

This SRM overwrite option was (IMHO) even more stupid than the FTP option. In the SRM protocol, the srmPrepareToPut call has an option where the SRM client says explicitly whether or not the upload should overwrite any existing file. Overwriting only happens if the SRM client say (explicitly) that it wants this to happen (the default is not to overwrite). Therefore it makes absolutely no sense for dCache to veto that statement: it just forces the client to implement a hacky work-around, like checking if the file exists, deleting it and then uploading the file.

Therefore, I simply removed that option: dCache honours the SRM client's decision on whether an srmPrepareToPut call should overwrite an existing file.

All of this is just to say that SRM behaviour has no impact on how dCache reacts to an FTP client.

@msalle
Copy link
Member Author

msalle commented Jun 19, 2020

Ah, @fscheiner I just realise I misread your comment, no I actually have not tried it with a globus-url-copy as far as I remember, but did have a look at the code. Will look at that again. For UberFTP I did what I said before: upload a file, then upload a different file with the same name, and then see the original one being deleted, and those tests where against two different dCache instances: prometheus.desy.de and the one from SurfSARA mentioned in the ticket.
And @paulmillar I'm actually a bit confused about the original GGUS ticket it's exclusively about gridftp, right? So where is the SRM and srmPrepareToPut coming in?

@msalle
Copy link
Member Author

msalle commented Jun 19, 2020

Hi @fscheiner, I think globus-url-copy is also deleting, see e.g. globus_ftp_client_register_write() in particular around line 464.

Just had a look into that: For me it doesn't look like globus_l_ftp_client_data_delete(data) actually deletes a remote file, looking into its definition it seems to just free the local buffer - unless I misunderstood the code here.

Yes, you're right, that's indeed not doing any deleting itself. I guess best thing is to check try it out with a globus-url-copy, which I now have just done and I can confirm it does not delete on the SurfSARA dCache (and return a

error: globus_ftp_client: the server responded with an error
550 File exists

error instead) while on prometheus.desy.de it just silently overwrites.

@paulmillar
Copy link

@msalle

And @paulmillar I'm actually a bit confused about the original GGUS ticket it's exclusively about gridftp, right? So where is the SRM and srmPrepareToPut coming in?

Yes.

Sorry, I picked up the wrong end of the stick. My point was that "SRM" is only a distraction: this is issue only about (Grid)FTP. Please forget everything I mentioned about SRM.

@msalle
Copy link
Member Author

msalle commented Jun 19, 2020

Ah clear, thanks! So conclusion so far:

  • we fix UberFTP to never delete on failure
  • globus-url-copy already does that

Question remaining: should we remove the delete code altogether or just comment out

msalle added a commit to msalle/UberFTP that referenced this issue Jun 22, 2020
Never delete the destination file when writing has failed. That is the safest
and easiest solution and matches the behaviour of other tools such as
globus-url-copy.
@msalle
Copy link
Member Author

msalle commented Jun 22, 2020

Ok, so I decided to make a new PR #5 that indeed just removes the code: It's so little code it's easy to find back. I think we should just merge it.

@msalle msalle linked a pull request Jun 22, 2020 that will close this issue
fscheiner added a commit that referenced this issue Jun 22, 2020
Simple clean fix for github issue #3
msalle added a commit to msalle/UberFTP that referenced this issue Jun 23, 2020
Never delete the destination file when writing has failed. That is the safest
and easiest solution and matches the behaviour of other tools such as
globus-url-copy.
msalle added a commit to msalle/UberFTP that referenced this issue Jul 2, 2020
Never delete the destination file when writing has failed. That is the safest
and easiest solution and matches the behaviour of other tools such as
globus-url-copy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants