-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GGUS129103: UberFTP (a UMD dependency) may destroy data after a failed overwrite attempt #3
Comments
@msalle The assumption seems to be, that if the "assumed to be transferred" file exists remotely after a transfer error, it is a partially transferred file and should be removed. Uberftp should maybe check if a file with identical name to the "to be transferred" file exists remotely prior to a transfer and if that's the case not remove it after a transfer error. This could work (1) for the configuration mentioned in the GGUS ticket (where files can not be overwritten but deleted) and (2) in other cases would leave a partially transferred behind file at the remote location. The latter could be useful because a user could than continue the transfer at an offset if that offset is known, though I don't know if the Uberftp UI allows partial transfers or transfers from offset positions. |
I'm a bit confused. Looking at cmds.c lines 3442-3451 I would think that a failure to write would directly jump to the cleanup, without setting |
Small update, I just tried it, and indeed it does do the delete, but the problem should be fixed in a different place: it's the write in line 3485 that results in the |
Unfortunately it turns out to be quite a bit more complicated. The error is not coming back directly from the |
This is a simple and not very clean fix for gridcf/issues/3 Either (one of) the l_write() or the l_close() can fail if the remote file already exists, since the data goes over a different channel than the control messages. For large files, it will typically be one of the later l_write() statements, for small files probably the l_close(). Hence we effectively only have the errmsg to find out what happened, and therefore do a simple str(n)cmp with this errmsg. For this we also need to add a function to get it from the opaque errcode_t.
Involving @paulmillar |
@msalle EDIT: So better not try to remove files/data on the remote side unexpectedly, but only if explicitly requested by "rm" or "mv". |
Hi @fscheiner, I think globus-url-copy is also deleting, see e.g. |
Correct.
Correct. FEAT does not describe whether overwrite is disabled.
Here are my thoughts on how to fix this.
The disallow overwrite option in dCache is a (IMHO misguided) attempt to prevent clients from deleting data by mistake by forcing them to "explicitly" delete the data. However, nothing stops clients from doing what UberFTP does, and fall-back to issuing DELE if the upload fails, and doing this automatically (as in UberFTP). Therefore, the current approach is flawed. IMHO, it's fundamentally flawed: this disallow overwrite feature should be removed. However, the self-consistent approach would be for dCache to disallow DELE, too.
If the upload fails (for whatever reason) the client should just fail. Attempting to DELE the file is (at best) a work-around for something broken. At worse, it's an attempt by the client to "game" the permission model. Instead, the client should simply fail if the initial upload is unsuccessful. This the closest match to dCache's intention with the disallow overwrite: that clients should simply fail if the (initial) upload fails.
The client is attempting to overwrite an existing file. Because of dCache's (somewhat questionable) permission model, the initial upload fails. The client reacts by deleting the file. Since the client is attempting to upload a file and overwrite any existing content, this is fine. However, the client should then complete the task and retry the upload the file. |
Hi Paul, |
Hi Mischa, I think you and I have different understanding of My understanding of In contrast, a To give a couple of examples, consider a client attempting to delete a file without authenticating.
Here the DELE command initially returns 530. It doesn't matter how many times the client repeats the DELE command, all unauthenticated DELE requests will fail -- this is a "permanent negative response". However, when the client authenticates, the situation is different and the DELE request now succeeds. Another example, attempting to download a file that does not exist.
Without there being a file at the requested path, the initial So, I think it is reasonable to return
This is certainly a legitimate concern. My suggestion would be that if the request fails before receiving the (As an aside: dCache has a feature where it can automatically delete partial uploaded files. This feature is enabled by default, so this isn't such a concern when uploading to dCache.) A client could also take the view that removal of partial uploads requires a human decision. A partial upload could be useful as-is. Additionally, FTP (unlike HTTP) allows the client to append to an existing files content, rather than sending the entire file again. HTH, |
I agree that is the intention of the split between the 4xy and 5xy codes and also that a 5xy code is the correct return code from the perspective of the server. However, from the point of view of the code, I would effectively need to treat the 550 as a non-permanent error once I've deleted the (pre-existing) file. Also, since I'm thereby bypassing the server setting of non-overwrite without any questions or interaction with the user, I think this would not be really what we should be doing. [...]
I agree with your first statement, but - as above - I don't think I agree a client tool should do this on its own. It's like I chmod a file a-w, and then someone tries overwriting it with a tool which says, "O I can't, let's just silently chmod it back to a+w..." (-;
This is a little bit tricky, since there are quite a few places where the response is handled and in some cases it handles a few in a row in the
Right, so that would be a good excuse for just not removing the file ever, which certainly would be the easiest solution, it would just mean removing the block https://github.com/gridcf/UberFTP/blob/master/cmds.c#L3552_L3574 plus definition and setting of |
This is an even simpler fix for gridcf/issues/3. It just disables the deletion of the destination file when writing has failed.
OK, I think we're converging on the client only deleting a file if the user explicitly requests that file be deleted. I think that should be fine. |
sorry, my response might be late but still relevant
Just had a look into that: For me it doesn't look like globus_l_ftp_client_data_delete(data) actually deletes a remote file, looking into its definition it seems to just free the local buffer - unless I misunderstood the code here.
I am just reluctant to let the tools do something unexpected. And deleting a remote file during a PUT is highly unexpected IMHO unless a user wants to explicitly overwrite an existing file.
Doesn't a (Grid)FTP server check if there's enough disk space left before accepting a PUT (if the client sent an ALLO with the file size in bytes prior) and then should also reserve that space somehow? Oh, apparently not:
...not sure why. :-/ |
Does this also work for intended partial transfers (i.e. transfers from offset positions)? How does dCache detect that? |
Hi @fscheiner
We have a very simple solution: partial uploads are not supported. dCache is an immutable storage system; in the sense that, once the initial writing of a file is complete, the cannot be modified. So (for example) a client cannot append the missing data to an interrupted upload, but must start again and upload the complete file. Therefore, it doesn't really make any sense to allow FTP clients to upload part of a file, since the client wouldn't be able to upload the remaining data. In concrete terms, dCache will reject ESTO commands with non-zero offsets and it doesn't have any support for the RESTART command. |
@paulmillar |
Indeed, I basically put this work on hold to wait for you to return (-:
I've certainly seen it delete a file, but it depends on the server settings:
I agree, in particular since the server tries NOT to override, so deleting is even worse... I think we should go for plan C and never delete anything. Only remaining question: should we delete the relevant blocks or uncomment using |
I see, sorry for delaying this then. :-)
That happens with a
Ah, I answered or actually asked for that in the corresponding PR itself, before reading this. :-) For the record, I'd just remove the relevant code, as it will still be kept in the history. |
I don't think so, the original GGUS ticket says " In our (SARA-MATRIX) dCache SRM, overwrites have been disabled." I presume it's still a dCache SRM.
Makes sense, but will await response from the others... |
Then maybe it happens somewhere else in
Sure, that was also my intention. |
There are large files being sent, since many files are stored on tape and tape works best with files in the multi-gigabyte range. Particle physics is in the enviable position that they are largely free to choose how big are their files. I agree this is potentially a problem. However, it's not a problem in practice; at least, not for WLCG. WLCG has invested in good networking connections. There is LHCOPN, plus the many national and international networks within LHCONE. There are still problems; however, they tend to result in either zero bytes transferred (e.g., firewall problems), the entire file transferred but the result is rejected (e.g,. software bug), or the transfer is too slow and killed by agent requesting the transfer for not making sufficient progress (e.g,. faulty network hardware). In the last case, the infrastructure can try a different source or destination; in effect, routing around the bad network. In all these cases, resuming an upload doesn't help. However, this is a problem is when data is transferred over the more unreliable, public networks. For example, uploads with sync-n-share clients (ownCloud/nextCloud) and volunteer-computing (BOINC). These have solve this problem with HTTP uploads in broadly the same way; but (unfortunately) with incompatible solutions. |
I think I understand this comment. dCache used to support a configuration option where it would outright reject (or veto) attempts by an SRM client to overwrite existing files (like with GridFTP). For reference, this is the srmPrepareToPut call. This behaviour (whether overwriting is allowed) may be controlled independently from the (Grid)FTP overwrite protection; although, it was done in a way that made it easy to have dCache react in the same way to overwrite requests from either FTP or SRM clients. This SRM overwrite option was (IMHO) even more stupid than the FTP option. In the SRM protocol, the srmPrepareToPut call has an option where the SRM client says explicitly whether or not the upload should overwrite any existing file. Overwriting only happens if the SRM client say (explicitly) that it wants this to happen (the default is not to overwrite). Therefore it makes absolutely no sense for dCache to veto that statement: it just forces the client to implement a hacky work-around, like checking if the file exists, deleting it and then uploading the file. Therefore, I simply removed that option: dCache honours the SRM client's decision on whether an srmPrepareToPut call should overwrite an existing file. All of this is just to say that SRM behaviour has no impact on how dCache reacts to an FTP client. |
Ah, @fscheiner I just realise I misread your comment, no I actually have not tried it with a globus-url-copy as far as I remember, but did have a look at the code. Will look at that again. For UberFTP I did what I said before: upload a file, then upload a different file with the same name, and then see the original one being deleted, and those tests where against two different dCache instances: prometheus.desy.de and the one from SurfSARA mentioned in the ticket. |
Yes, you're right, that's indeed not doing any deleting itself. I guess best thing is to check try it out with a globus-url-copy, which I now have just done and I can confirm it does not delete on the SurfSARA dCache (and return a
error instead) while on prometheus.desy.de it just silently overwrites. |
Yes. Sorry, I picked up the wrong end of the stick. My point was that "SRM" is only a distraction: this is issue only about (Grid)FTP. Please forget everything I mentioned about SRM. |
Ah clear, thanks! So conclusion so far:
Question remaining: should we remove the delete code altogether or just comment out |
Never delete the destination file when writing has failed. That is the safest and easiest solution and matches the behaviour of other tools such as globus-url-copy.
Ok, so I decided to make a new PR #5 that indeed just removes the code: It's so little code it's easy to find back. I think we should just merge it. |
Simple clean fix for github issue #3
Never delete the destination file when writing has failed. That is the safest and easiest solution and matches the behaviour of other tools such as globus-url-copy.
Never delete the destination file when writing has failed. That is the safest and easiest solution and matches the behaviour of other tools such as globus-url-copy.
From the GGUS ticket, see also JasonAlt#11
Zdenek mentioned that a fix is easy, Steve offered to put it in EPEL if a patch was provided.
The text was updated successfully, but these errors were encountered: