-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
activerecord: can produce a dirty read #237
Comments
Hi @otan, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
removing blocker for now as i haven't been able to reproduce a third time and questioning my sanity |
ruby code:
have not been able to reproduce with a go equivalent, you can try https://github.com/otan-cockroach/repro71652 if you'd like |
Within the test, it happens on about 1/3 of the times I've run it and we've seen it in the CI. I was able to get it to work in a shell on the first try, but I just had a single node running instead of a cluster, so maybe that makes it easier to reproduce? |
I was able to reproduce on version 21.2.0-beta.4 |
@keithdoggett have you been able to reproduce while running just this single test, or does it require running the entire test suite? Have you been able to reproduce while running the test locally (not in CI)? |
@nvanbenschoten I can reproduce it as a single test both locally and in the CI. |
Hi @dt, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Thanks @keithdoggett, that's very helpful information! @rafiss has a Ruby env set up, so he's going to see how easily he can get this to fail when stressing the single test. If the cycle time is short enough, he's going to help bisect this back to a single commit. |
That sounds good let me know if you need any help with it. It was failing often enough when running locally so hopefully it won't take too long to find the commit. |
Update: still working on this, but my local ruby environment had broken since the last time I used it. I should have things working locally soon. |
I was able to reproduce this in 1 of 40 runs of the However, I have been unable to reproduce in a shell or while running with @otan's Go script: https://github.com/otan-cockroach/repro71652. I'm wondering whether that script performs txn retries in the same way as activerecord. |
if you need to speed up the AR roach test, run Apparently activerecord rarely hits the ProtoRefresh error :\ |
Even when using https://pkg.go.dev/github.com/cockroachdb/cockroach-go/v2/crdb, I'm still not able to reproduce in Go. |
I have the tests running locally. I used the test file below then ran with
Now that I have at least some sort of baseline I'll try to get this running in a script. Code to repro:
|
RIP i wish i knew about the |
@rafiss if you can get a reproduction with a crdb binary that includes cockroachdb/cockroach@247332b and is running with |
also one thing i didn't check is whether it's a dirty read, or we actually commit 200000 to be the result (which is also bad) |
Whoops.... I ran against the wrong commit |
OK, here's the debug.zip from cockroachdb/cockroach@247332b |
I see something strange in those logs. Searching for locking Get operations on what I believe to be this key, I see the following logs:
These logs include 3 transactions. The first two read the key twice, which aligns with the two UPDATE statements. The third only reads the key once, which is surprising. Searching then for the outcome of these three txns, we see:
The first two txns look like we would expect. Notice that they commit with seq num 3, meaning that they performed 2 writes (plus the EndTxn's seq num). The third txn commits with a seq num of 2, so it only performed a single write! That indicates that something strange is going on. Are we sure we're issuing two UPDATE statements for this third txn? And there's more. Notice that the third txn has two EndTxn requests while the first two only have one. That's because the third commit is performing a parallel commit, because its Put is still in flight by the time the txn commits. This is not the case for the first two txns because they read their write ( |
Would someone mind adding in a sanity check that we're actually reaching and then making it past the second |
Also, as another sanity check, when people are saying that they reproduced in the shell, are they seeing a salary of |
I saw |
I think I found the first half of this transaction:
Notice that it hits a WriteTooOld error on the first UPDATE statement's locking Get, can't refresh, and then gets rolled back. I wonder if this somehow (on the client or server) broke the txn context and then allowed the UPDATE statement to retry and commit in a separate txn. |
This timeline appears to be consistent with the first debug.zip @rafiss sent over. |
@keithdoggett i'm having trouble reproducing this on shell again. i did see 200000 can you try, and paste the output of the shell as you are running the commands? |
huh, i misspoke. looks like this is a "problem" on 21.1 as well. cockroachdb/cockroach#69080 Aug 24, master first hit Aug 12. |
Sharing something strange I see in my repro from running ActiveRecord tests locally. I'm using The full logs are long and pasted below, but the strange part is I see:
as well as
However, I'm using the code I shared in https://github.com/cockroachdb/cockroach/issues/71652#issuecomment-946118074 -- and both line 50 and line 76 are from
|
After @otan pointed out that (And an update on my bisecting attempts: they've been inconclusive since as per above, this issue goes back further than the commits I've been bisecting.) So I feel comfortable calling this not a release-blocker. But let's keep the issue open as a tracking issue for SQL Experience so we can keep debugging then ultimately decide if there is a CRDB bug or issue in the test infra. update: repro'd on v21.1.0 also |
This reminds me of a bug I saw in another life (10 years ago, in a python/mysql app). Some confused exception handling (in this case relating to the differences between plain I started this comment in response to Nathan's analysis, and then Rafi's comment showing cross-test effects also suggests the connection pool as a possible vector (assuming the ActiveRecord test framework doesn't clear the connection pool between every test). |
One theory I floated in Slack that could explain these "shell reproductions" is that they were run on the same DB instance that had previously been corrupted (i.e. I'm curious whether this explanation matches @otan and @keithdoggett's reproductions. Has anyone ever reproduced this in a shell on a fresh instance where they actually had to run the |
I first thought I got it on a shell (AR was not setup), but I am really wondering whether my manual testing accidentally made a commit instead. Curious how @keithdoggett got it on his end. As they say, pics or it didn't happen (and I don't have pics). |
I guess it is also suspicious in my repro using Go, I ended up with TxnProtoRefreshError frequently, but activerecord never hits that error. The AR test should, in theory, hit it as regularly as I did (3 write threads, 5 reads threads). |
That's a good point @nvanbenschoten, when I reproduced it in a shell, I had connected to the same database that the AR tests are run on. It was after the tests ran, but if it's some sort of data corruption issue that would make sense why I was able to see it on the first try. To @bdarnell's point, the connection pool is not reset between tests and we've seen issues in the past where a connection will have an issue during a transaction and then every following test would raise a TxnProtoRefreshError. So it's possible that it's related to that, although there's usually an error that gets raised when executing in an invalid state which we're not seeing here. |
This upstream change seems very relevant to what's being discussed. @keithdoggett can we think of a way to make sure the test is using this logic correctly? |
Ah nvm rails/rails@a2c1168 is only in rails v7.0.0.alpha1. @keithdoggett can you determine if the rails project will get that change backported? Based on the discussion in rails/rails#40541 I feel pretty confident that this is the problem. Also, let's make sure the adapter's retry logic is still correct: Lines 11 to 29 in f492592
|
@rafiss this is on the I'll take a look at the retry logic. Might have to add a check to confirm the connection hasn't been disconnected in |
Ah thanks for pointing me at that backport - I didn't see that it was merged into Though it looks like that commit has already been released:
So I think it likely means one of two things:
A next step is for us to write an activerecord-cockroachdb-adapter test that intentionally causes transaction retry errors. If we're running against v22.1 we can use the |
Gotcha so we've been running against this for a little while. The We're not currently running against 22.1, but I could make a separate branch to test that or try to come up with some code that would trigger a transaction retry error. |
well I wrote this simple test and tried locally with v22.1, and it seems to work correctly
I think we can probably hold off on merging in a test to activerecord-cockroachdb-adapter right now. I guess I no longer have a great theory for where to look next. Maybe there are other issues that could happen with the retry logic. I'll try out other tests. |
it could be working correctly but still using a poisoned connection holding an uncommitted txn though? would a fairer test be:
|
I've moved this issue to the activerecord-cockroachdb-adapter repo. It seems like an issue specific to the adapter. @keithdoggett when you have time to get into the backlog, I think it would be valuable to look into this more. |
Refs: cockroachdb/cockroach#68363 (
ConcurrentTransactionTest#test_transaction_isolation__read_committed
, first hit Aug 12)Setup a table:
In shell 1, type:
In shell 2, run:
Back to shell 1:
Now in shell 2, we see a salary 200000, which is a dirty read (we never committed a salary of 200000):
This can take a bit of time to reproduce (i've had to do this a few times each reproduction), but an external ORM, the contributor to the ORM and I have hit it multiple times, so it is not a fluke.
The text was updated successfully, but these errors were encountered: