-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INVALID_TOO_LATE errors #299
Comments
The way this works is that the time counts starting from the Signage Point origin and stops when the pool gets a partial. Previous investigations seems to show thats the delay of the signage point propagating in the blockchain/nodes to your local node. It seems to happen when your local node has localized peers (not spread across the globe). Some pools increase the timeout to above 25 seconds but that also increases the chances of bad farmers (long lookup times) not being flagged as invalid partials. |
It looks like a great catch for those signage points! Although, maybe not exactly how you described it. Here is the relevant signage data:
The interesting part is that all those signage points for those three minutes are coming roughly 10 secs apart (what is as expected). However, I don't understand why there is such repeat there (where I put that break line). So, there were basically no Chia network propagation delays, no double signage points, just some really odd behavior. I have never looked at those signage points before, so am not sure how often that happens, and/or whether this looks like a some kind of bug. |
Interesting... the CC is the same, but the RC is different in some cases, I don't really know what that means. Makes it sound like some sort of reorg, or peers disagreeing/stepping with each other... |
It looks kind of like reorg, as when the first repeat started (from 56), till the last one (63) all the timing is correct. My take is that if peers would be disagreeing, the timing would be a mess. Although, it took 20 secs to start the new batch (starting at 1). However, when the second sequence (starting at 1) started, it is kind of a mess for the first three have the same time, and #4 is missing. Then timing goes back to normal. By the way, who is generating those signage points (timelords?)? It looks more that the problem is on generation of those signage points, rather than network / node propagation. By the way, maybe you could update the error tip output ("Partial error details") to also mention that those errors may have nothing to do with the farmer, but rather "messed up" signage points. |
I put together those harvester logs with signage points logs, and in both cases looks like the same signage point (59) was processed. Here is the output:
Knowing that the second submission was against a duplicate signage point, all the timing make sense right now. Pool responded with an error to the second submission, and it took only about 200 msec to get that response. So, we know that there is nothing wrong with the timings. Saying that, do you know why the pool took the base time for the second signage point from the first sp issued (59), not the second one (the duplicate)? Maybe there is something in the submitted info that could identify the exact signage point, so pool could squash such reports? (I really don't know how that part of the protocol works.)
How "the Signage Point origin" is defined? Is it based on the initiation of the SP batch (i.e., SP 0 plus 10x current), or rather from the current SP? |
Honestly, I don't know. I will need to research on that. I noticed the CC on that is the same, but RC is different. I don't know yet what RC means... |
Chia's GitHub? I guess, we see two problems that maybe they can explain. The first is the logic behind those duplicate SPs. The second is how the pool should react to those. If you want, I can better mark those logs, and post it on their GH, and maybe you could chime in to narrow down what is expected from them to be explained. Or you can take my logs and post it there. I have no preference. UPDATE So, I guess, I am at my limits with this problem right now, so will let you work on it. If you need some additional info, let me know. |
Thanks, very helpful so far. I will see what I can find but may time a while since I am on vacation. |
Kind of thinking about it more, and trying to put it all together, I think that we really have two cases here:
The first one (chain self-correction) is when at some point we will have SP coming on schedule but with indexes that were already processed before. Those have different CCs and RCs and are indicated by the pool server as INVALID_TOO_LATE. However, maybe those SPs/proofs still need to be processed, as we don't know what caused chain reorganization, and whether chain thinks the first (not likely), or rather the second (there has to be the reason to push the second) should be valid, plus those come on schedule. So, I would assume that those should be processed as a normal SP (pushed to the network, and not locally marked as TOO_LATE). Although, the big question is what happens if chain already recognized the previously submitted proof(s) as the winning ones, and this new batch also has winning proofs. Still, I would say it is better to submit those extra proofs, and let the chain figure out what to do, rather than miss an opportunity. On the other hand, if those are peer burps (some spurious extra SP), those will not preserve SP timings. Therefore, depending on whether the server stores who submitted partials previously, those should not be marked as DOUBLE_SIGNAGE_POINTS. Sure, this is an error condition on the farmer, as the farmer should not be processing those burps. However, as those indicated that the farmer screwed up (by processing the same thing once more), there is no reason for the pool server to recognize it as farmer error, rather quietly sweep those under the rug. Although this is a bit tricky. The reason that I am saying that those should not be marked as doubles is that those are responses to challenges, and if the farm has duplicated plots, it should respond with dups in a single submission. On the other hand, if the farm replies with two found proofs in the same submission, that is a clear indication that there are duplicated plots. Again, I don't know much about how those are processed by the code, so all that is only based on what the logs show. |
There is no such thing as two found proofs per submission. Its one request per proof. |
Sorry, maybe that should be partials? What is triggering DOUBLE_SIGNAGE_POINT errors, then? |
Two different requests/partials for the same Signage Point? |
Assuming that harvester found two proofs processing a single challenge, e.g.: Assuming that a duplicate plot was hit. Will it report both (identical) proofs, or will it internally squash the duplicate? |
I think the pool will get two distinct partials. |
I guess, this is the whole point of this discussion. Seeing the logs on the farmer, we have a better understanding of the patterns in each case. Can logs on the server side be also scrutinized to eventually squash those errors, or differently is there enough info in the submitted data to better classify those errors. |
The request the pool receives only contains the partial proof, and the signange point hash. I don't see what else we could gather to squash anything, but I will keep thinking. |
Yeah, that place has that class really well documented. I would rather go to either farmer.py or farmer_api.py to see how those members are filled. I think more info is on your pool side. Not just what you get from all farmers and the blockchain, but also what you store in the db (to not kill at the same time performance, though). It would be nice, if you could provide a JSON string that comes from a submission POST (even better, if it would be already preprocessed by the server, so the structure would be better visible). Saying that, I would consider the following cases for classifying proofs:
Of course, proofs A, B, D, F should normally processed, as those are basically the first proofs for a given SP idx, and just serve as the base for the next proof Proof C, as coming from a duplicate plot should be coming right on the hills of PB (we can ignore farmer processing time, but expect some network jitter). Maybe if such proof comes within 1-2 sec (maybe just 100-200 ms would do) or so from the previous one, we could assume that this is a real plot duplicate. This is clearly an error on the harvester / farmer side that is not handled right by Chia team. If there are duplicates on one harvester, then harvester should be squashing them, and flagging as such. On the other hand, if different harvesters have duplicate plots, that would be farmers job to squash those, as proofs should be identical. Proof E is basically identical to proof C, except that timing difference could be slightly bigger, so maybe if that difference is bigger than 2-3 secs, we could assume that those are network burps. It looks like farmer is sending the SP idx, although, I am not sure whether those proofs would be identical (pool logs could help here). Proof G. This one is kind of a wild card. Although, in the logs that I got for this case, those SP were coming on schedule (exactly in their 10 secs intervals), plus RC/CC were different. So, the first thing is that we would have same SP idx, but everything else should be different (most likely). Currently, those are marked as TOO_LATE. However, pool should already store PF, as such this cannot be too late, as it is clearly at worst a duplicate for that PF SP idx). Also, depending how deep the reorg is, those should be coming with 10/20/... sec latency comparing to the original one. As mentioned, if such proofs could be identified, my understanding is that those should be still submitted, as we have no clue what was the reason for the reorg, and based on that which proof will be validated. I guess, that explanation is a bit simplistic. It may be also the case that there were no prior submission for SPs with indexes equal or higher than this one, as such it will classify as a late one (if no chain reference taken with respect to reorg). Although, if there were any submissions for index higher (no equal), maybe that indicates that the drive that holds a plot went to sleep, and this sp was processed late. Proof H will be clearly late, if that is the very first one for that SP idx (this is the main difference between this and PG), and is coming late (based on the current server logic). Agreed, all that above is mostly based on timings and previous proof, and possibly not that stable. Although, maybe if even some of those can be identified, that would be a good enough improvement? I have never seen how a submitted proof looks like (i.e., that JSON string), neither know how much data you store on your side, so cannot comment on that side. No need to comment. I think that I have all that off my mind now. |
I am getting those errors once every other day or so (not that many). Recently, I did few changes on my farm, changed a bit my ChiaDog reports, and today got lucky with one such error. Here is the data:
The above shows two partials found, where it looks like the first one is the offending one. In both cases, lookup times where well below 1 sec, and submission was followed within 300-400 msec.
As trying to nail down the issue is a bit tricky due to networking lags, I would still assume that 45 seconds is rather too long for a round trip (pinging pool.openchia.io gives me around 80 msec averages, so assuming your server is East Cost located (I am West Coast)). Although, we all know that if those round trip results are not measured at the time of the problem, those are rather worthless as far as telling what might happened at different times.
Here is where I got really lucky. I checked the top three farms in your pool, and all three have the same error at roughly the same time (15 sec spread). (Sure, it is possible that those three farms are also in California, but grasping for straws, this is a pattern that points to a potential problem on your side.)
Could you check logs on your side, whether you can find something that could potentially be addressed? Maybe you could run some reports against those late responses (all for the pool), and that could shed some light?
The text was updated successfully, but these errors were encountered: