halt transfer on multiple failures #201

dsschult · 2021-06-18T19:57:43Z

NERSC sometimes goes offline and all transfers will fail. We should just halt all transfers for a while, instead of failing every transfer into quarantine.

So if we see multiple transfer failures, that may not be an issue with that file, but a site issue.

blinkdog · 2021-06-20T10:04:53Z

NERSC sometimes goes offline and all transfers will fail. We should just halt all transfers for a while, instead of failing every transfer into quarantine.

So if we see multiple transfer failures, that may not be an issue with that file, but a site issue.

I'll gently push back on this and see if you agree with my reasoning. 😸

Moving work units to quarantine on site failure is operation as intended.

The software should try to deal with anything that is par-for-the-course; an issue that (semi-)regularly crops up and can be readily dealt with in an automated fashion or dealt with simply by waiting it out and trying again.

Some site problems fall into that category; if they take the DTN node down for a few hours every week, then we can just wait it out if we teach the software to be patient.

Other site problems don't fall into that category; if credentials expire (e.g. an expired token) and needs to be manually requested and updated by an operator. This is something we want to know as soon as possible so an operator can fix it. We'd also like to estimate the impact of the problem if we can.

This is the nice bit about sending work units to quarantine; it both indicates a problem the software doesn't know how to handle, and the count/volume gives a nice estimate of the impact. One transfer not going is a curious anomaly the operator can fix quickly. A hundred transfers not going may impact a timeline, and the operator will want to report that to management so they can adjust plans.

Having things in quarantine means they can show up on dashboards, and remind both operators and management that something is amiss, a human is required, get some conversations going, etc.

If the impact of sending to quarantine is large, like transferring 1 TB of data only to lose the work, or running a sha512sum on 1 TB only to lose the work, or pulling the work unit out of quarantine requires a similar level of work, then we've got a problem. As it stands though, if the transfer fails fast we lose very little work, and putting the transfer back in the pipeline is just a metadata update. Putting even 1000 bundles back in the pipeline is scriptable and shouldn't take more than 5 minutes to run.

There are alternatives, but there are some downsides too:

Adding some kind of exponential backoff, and alerting over a certain threshold
- Pros: This is nice because it'll fix all problems that can be waited out.
- Cons: Our time-to-discover a problem that requires human intervention is the threshold, and false positives may still creep in and encourage us to make the threshold yet higher.
Turning off a component that we know can't work:
- Pros: This is nice because it doesn't throw things in quarantine, and things can 'just work' when we turn it back on.
- Cons: We need to know the trouble is coming (i.e.: works best for planned outages), and the operator has to remember to turn the component back on.

I'm of the opinion that letting the transfers fall to quarantine is a good option, because it alerts us quickly, gives us a sense of how often problems happen and the scope/impact of those problems, and is relatively easy to fix (i.e. run a script) after the root cause is addressed.

On the other hand, I think it's perfectly reasonable to say that not distracting a human (the most limited resource) is preferable to quickly discovering faults. If transfers are held up for a week because the credentials expired and we didn't know until we hit the threshold, that may be perfectly acceptable if it also means the operator was not wasting time on 10 different false positives that all resolved themselves without intervention.

dsschult · 2021-06-22T16:05:21Z

Just a short reply for now, but @barnetspa (the originator of the issue) may have a longer one.

My thought is the assumption an operator exists and can respond to LTA every day is a flawed one. The goal should be for <1 intervention per month, or for the software to basically run itself if it possibly can. One of the successes of IceProd and Pyglidein is that we don't babysit them, and can leave them alone for months without problems.

barnetspa · 2021-06-23T18:42:15Z

As I've been thinking about this, I think it comes to decoupling different failure modes and different error handling:

Transfer pipeline failures (NERSC offline, peeps with backhoes, etc) - stall the pipeline
Bundle corruption - quarantine the bundles

The distinction I see is how the system operates through the failures. So in the case of stalling the pipeline, work stops until the issue is resolved, and picks up where it left off after resolution. Human intervention might be required to resolve the problem (please plug the network cable back in to the DTN), but should not be needed for the pipeline to resume work.

For quarantine, there is a problem identified with the bundles themselves - bad checksum, truncated ... - in this case, even though the transfer pipeline may be working, we do not want to push them through any further. So getting them out of the stream makes sense. And as long as there are other bundles that have no issues, they can continue to move through the system. I would expect these issues to be where most human effort is spent since there are a lot of safeguards in the system to keep this from happening.

These could feasibly have some overlap (transfer pipeline fails mid-copy, so there is a corrupt/truncated bundle). Depending on when/where the problem happened, it might be possible to trivially recover (we will resend the bundle to NERSC), in other cases maybe not. If not, we quarantine, and let work resume on the good bundles while we figure out what to do about the quarantined bundles.

I can see a couple different approaches to error detection that would work with these that would give us timely notification when problems happen. The biggest is for components to propagate errors to our normal alerting systems. So at the point where a component stalls, that is picked up in our normal monitoring (check_mk or whatever the new hotness is).

The other big case is capacity/performance (congested pipes may not cause outright failure, but may stall the system if not addressed). That seems like more of a data collection/threshold problem.

Any of these can show up in dashboards or monitoring systems pretty easily, so I think it's more about how the pipeline handles and recovers from errors. And early on, we will not likely know the appropriate way to identify and handle every failure automatically so stalls will be frequent and new code and alerts will be needed. But hopefully as we iterate, the number of things a human needs to do gets pretty small.

jnbellinger · 2021-06-23T20:42:44Z

Just as a for-example, yesterday 4 jobs were picked up for verifying at NERSC--and they seem to have all timed out. Other nersc-verify jobs succeeded. The pipeline can keep flowing--but something needs tweaking.

blinkdog · 2021-06-25T22:13:13Z

Any of these can show up in dashboards or monitoring systems pretty easily, so I think it's more about how the pipeline handles and recovers from errors. And early on, we will not likely know the appropriate way to identify and handle every failure automatically so stalls will be frequent and new code and alerts will be needed. But hopefully as we iterate, the number of things a human needs to do gets pretty small.

Okay, I think we're talking about roughly the same thing but we may be coming at it with different schema/terminology.

The general evolution of the components is:

Do work, and if there are any failure conditions, just fail fast/hard (i.e.: dump the work unit in quarantine)
See which errors crop up often and code something to work around it if possible; retries, stalls, etc.
Rinse, Repeat until the errors are rare enough or strange enough that it's not worth coding for

I think an example of what you're looking for would be like the Uploader component at SPS. If it can't contact ASC's SFTP server, it won't dump the satellite bundles in quarantine, but rather notify the operators that the SFTP server is unreachable, then try again on a later cycle (hoping ASC fixed the problem).

For the transfer to NERSC, we could code a "canary transfer", try leading the work cycle by uploading some tiny (1MB?) file of random/junk data we don't care about. If that succeeds, we can allow the component to proceed with real work. If that fails, we notify the operator / update the dashboard / etc that our canary died, and then have the component sleep until the next work cycle.

The next step would be to see how often the canaries die, and start counting them. A single dead canary might not be worth informing the operator / updating the dashboard, if we see that happen occasionally. But if we never see 5+ dead canaries in a row unless something really went wrong, now we've got a threshold worth bringing to human attention.

It would have the effect of stalling the pipeline when transfers to NERSC don't work, and prevent otherwise good work from falling into quarantine for no reason.

Does this sound like it would meet your requirements @barnetspa ?

dsschult assigned blinkdog Jun 18, 2021

dsschult added the enhancement New feature or request label Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halt transfer on multiple failures #201

halt transfer on multiple failures #201

dsschult commented Jun 18, 2021

blinkdog commented Jun 20, 2021

dsschult commented Jun 22, 2021

barnetspa commented Jun 23, 2021

jnbellinger commented Jun 23, 2021

blinkdog commented Jun 25, 2021

halt transfer on multiple failures #201

halt transfer on multiple failures #201

Comments

dsschult commented Jun 18, 2021

blinkdog commented Jun 20, 2021

dsschult commented Jun 22, 2021

barnetspa commented Jun 23, 2021

jnbellinger commented Jun 23, 2021

blinkdog commented Jun 25, 2021