-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[draft; not-ready] receive: Sloppy quorum implementation #7106
Conversation
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could add a small paragraph to the docs outlining how this works? It would make reviewing easier
@GiedriusS sure thing! I will write when the implementation is ready. It'll take a while still. So far I am experimenting with different approaches to see what's the simplest way to achieve sloppy quorum that I'm happy with and passes tests (current and new sloppy quorum related tests). PR's open only to make it easier for me to ask some people's opinion on some things. |
Signed-off-by: Douglas Camata <[email protected]>
When we reuse an endpoint for the same replica index we get out of order errors. Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
After investigating a bunch of different approaches to implement sloppy quorum in Thanos, I learned that the per-series replication makes things difficult. Unfortunately I will pause my efforts towards this one more time. Let's see if the 3rd time will be the charm. I'm pushing my latest attempt to have it public. 😄 Below you can find a few things I tried and why they failed. Hopefully it can inspire and/or help someone else that want to work or collaborate on this. Send the request as-is to another hashring memberWhen I try to send a request that failed to another member of the hashring, it is very likely to already have received part of the series from another replicated request. Why is it so likely? Due to improvements in spread of data done to all hashring algorithms (ketama and hashmod). Redistribute the remaining series between hashring membersHas the same problesm as described above. Redistribute the series to a subslice of the original hashringIn this approach I used a denylist to create subslices of the original hashring where all the members that already received at least one request were ignored. Very often this results in an empty subslice due to the great spread of data mentioned above. When it doesn't, it has the same issues as the point above: the new node to get a given set of series very likely already has some series from that request. ConclusionImplementing sloppy quorum in Thanos will be a high effort work. It might require that we intercalate some small related features (i.e. allowing forward requests to be retried through configuration) and refactors to get to a point where sloppy quorum can be safely introduced with an implementation that is as simple as we can come up with. |
@douglascamata Could you clarify something for me? You mentioned this:
Could you elaborate why this was a problem? AFAIK this should not be a problem at all. Assume a situation with 3 receivers with replication factor 3. Any piece of data coming into the system would want to end up on all 3 of these receivers, so regardless which receiver it lands on initially, that receiver will send it off to the other 2, and it will write it down to its own tsdb locally. If both the other receivers happen to be unavailable long enough to fail all internal retries, the quorum of 2 will not be achieved, and a 5xx response will be returned, which will cause an eventual retry from the client with the exact same data. (or possibly with new data added on top) If in the meantime at least one of the previously unavailable receivers come back up, the retry will be processed appropriately with a 200 response having reached the quorum of 2, despite the data already being (fully or partially) available in the receiver that it processed it on the first try. This is standard operation which may or may not cause a warning to be printed in the logs. What problems did you experience with this during your testing? |
@mfoldenyi it's been a few months since I worked on this, so let's see if I remember something about it. 🤣 The complication is that we are replicating series, not requests. Let's name these 3 receivers in the example Receiver Because we are hashing series, Now, At the end, The more nodes you have down in your hashring, even when you still have 2 up to achieve quorum, the more likely this is to happen. For instance, if you had 2 receivers up out of 4 total maybe a given series that for one success write would have its other 2 writes on the 2 nodes that are down. I hope I'm remembering correctly, but I might be wrong. |
Ok, so I may be understanding what you meant, so let me rephrase and tell me if I got it or not: For now lets assume we have 4 receivers, A, B, C, D, replication factor 3. Assume B, C and D is unavailable. (in this state, no requests should ever succeed) We end up getting 3 successful writes, when in reality all we did was just 1, the local write on A. So above said, the problem you mean is that we cannot identify which writes are "real writes" that do count towards the quorum, multiple replication attempts could pick the same "fallback target" and report the same write as multiple successes? If this is so, can we not just refactor the quorum check to collect target names instead of counts? We could then check uniqueness in the list to check for quorum. |
@mfoldenyi this scenario that you just showed is one of the few additional complications -- a very good one, I would say. If we are just thinking out loud, it's an easy problem to solve: you have to ensure that every single series is written to at least 2 different nodes. It might not be as easy in code, but definitely doable. But let me draw your attention to the some other complications. You said this:
There are lots of decisions and tradeoffs to be made. The code starts to become complicated and things get difficult. There's a lot of work to do and many of these details only come up as you start to see tests behaving weirdly. |
So there is one aspect which I have assumed to be one way, and you are saying otherwise, which I so far believe to not be true. Specifically this part:
My understanding so far was, that this is not a problem. S1-R3 would successfully write to both B and C if attempted, despite them already having the samples. If this was not so, we would pretty almost always end up in a "not possible to recover from" scenario whenever we are processing a retry: Take a request R1 with 2 series: R1 gets resubmitted by the client: If resending the same data to the same node was a problem, then this use case would not work, more over, any request containing S1 arriving in the future would by default be rejected, which most of the time would completely kneecap all of the retries arrive, since in my experience most of the time requests are rejected due to only a small subset of the series within not reaching quorum. (eg 2 nodes of 90 are down, 90% of requests get rejected, but retries work fine right after with 90 nodes) Am I not seeing something that makes this situation somehow different to what you are talking about? If this is indeed a problem, we could still limit the solution the AZ aware ketama hashring, where we would have the ability to find nodes that are definitely not normal targets of the data (since at max 1 request would land in one AZ, so all other nodes in the AZ are good fallback candidates) |
Resending the same series, with the same timestamp, and the same value to the same node will 100% fail. This happens today when a request is partially written. Often it will result in Prometheus retrying the request non-stop because there will be always an error as part of it was already written. It only stops when Prometheus decides to drop the request from its retry queue. |
Changes
--receive.sloppy-quorum
. The control of the amount of times the algorithm looks for a new node to replicate the request to is done using the--receive.sloppy-retries-limit
flag. Currently the sloppy quorum logic happens in two places: when finding a peer connection and when writing, with independent retries counters.One important detail is that there's no implementation of what's known as "hinted handoff". This means that writes that end up "slipping" will never be sent back to the original place where they should be. The reasons for this are:
Verification