-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config.py.in: use SERIALIZABLE isolation level for the db #93
Conversation
By default postgres uses the READ COMMITTED isolation level, which allows transactions to read data that may be modified concurrently by another transaction [0]. When teuthology queues a job, it adds it to beanstalkd, then reports that new job id to paddles. In handling this request, paddles: 1) checks for a job with the same id 2) adds a new row in the jobs table for this job On the server side, when teuthology-dispatcher reads from the queue, it also updates the job status in paddles, going through the same request handling as above. When both requests are run simultaneously, both requests may do step (1) and record no existing job, then both proceed to step (2) and add a new row. With SERIALIZABLE isolation, this is impossible - the 2nd transaction will be aborted when the job is added by the 1st transaction. This makes DB operations easier to reason about, and since this db is not performance-sensitive, is a reasonable default for all transactions. This may also address some races we've seen with multiple jobs thinking they locked the same machine. I don't have evidence of that race, though it seems possible based on the implementation of the nodes model. Both node locking and job updating APIs retry on the client side, so a transaction hitting a conflict and returning an error shouldn't cause any issues. The result of this race adding jobs is duplicate DB records for each job like these: SELECT job_id, posted, updated, status FROM jobs WHERE run_id = X AND job_id IN (SELECT job_id FROM jobs WHERE run_id = X AND status = 'queued') ORDER BY posted ASC; job_id | posted | updated | status ---------+----------------------------+----------------------------+-------- 5931205 | 2021-03-03 20:53:52.124034 | 2021-03-03 20:53:52.124047 | queued 5931205 | 2021-03-03 20:53:52.131244 | 2021-03-03 22:01:12.236465 | pass 5931209 | 2021-03-03 20:53:54.146957 | 2021-03-03 20:53:54.146968 | queued 5931209 | 2021-03-03 20:53:54.147755 | 2021-03-03 21:52:56.254025 | pass 5931223 | 2021-03-03 20:54:00.96613 | 2021-03-03 20:54:00.966142 | queued 5931223 | 2021-03-03 20:54:00.971739 | 2021-03-03 22:03:12.814133 | pass 5931236 | 2021-03-03 20:54:07.313768 | 2021-03-03 20:54:07.313778 | queued 5931236 | 2021-03-03 20:54:07.319185 | 2021-03-04 04:54:00.689333 | pass 5931264 | 2021-03-03 20:54:20.714723 | 2021-03-03 20:54:20.714736 | queued 5931264 | 2021-03-03 20:54:20.715809 | 2021-03-03 21:28:38.805917 | pass 5931273 | 2021-03-03 20:54:25.00497 | 2021-03-03 20:54:25.004981 | queued 5931273 | 2021-03-03 20:54:25.019827 | 2021-03-03 21:29:41.890552 | pass 5931277 | 2021-03-03 20:54:26.963148 | 2021-03-03 20:54:26.96316 | queued 5931277 | 2021-03-03 20:54:26.965546 | 2021-03-03 21:20:27.454767 | pass 5931286 | 2021-03-03 20:54:31.254743 | 2021-03-03 20:54:31.254755 | queued 5931286 | 2021-03-03 20:54:31.258285 | 2021-03-03 21:45:30.325423 | pass 5931304 | 2021-03-03 20:54:39.934984 | 2021-03-03 20:54:39.934996 | queued 5931304 | 2021-03-03 20:54:39.935474 | 2021-03-03 22:41:53.914094 | pass 5931317 | 2021-03-03 20:54:46.290555 | 2021-03-03 20:54:46.290566 | queued 5931317 | 2021-03-03 20:54:46.294983 | 2021-03-03 22:12:38.405483 | pass 5931322 | 2021-03-03 20:54:48.68503 | 2021-03-03 20:54:48.685043 | queued 5931322 | 2021-03-03 20:54:48.694535 | 2021-03-03 22:01:12.71814 | pass 5931326 | 2021-03-03 20:54:51.643081 | 2021-03-03 20:54:51.643105 | queued 5931326 | 2021-03-03 20:54:51.670604 | 2021-03-03 22:09:21.44187 | pass 5931331 | 2021-03-03 20:54:54.922704 | 2021-03-03 20:54:54.922716 | queued 5931331 | 2021-03-03 20:54:54.925522 | 2021-03-03 22:00:25.523793 | pass 5931347 | 2021-03-03 20:55:03.562209 | 2021-03-03 20:55:03.56222 | queued 5931347 | 2021-03-03 20:55:03.564115 | 2021-03-03 21:58:32.887666 | pass 5931356 | 2021-03-03 20:55:07.922281 | 2021-03-03 20:55:07.922293 | queued 5931356 | 2021-03-03 20:55:07.922722 | 2021-03-03 21:57:17.223546 | pass 5931365 | 2021-03-03 20:55:12.282366 | 2021-03-03 21:52:59.552419 | fail 5931365 | 2021-03-03 20:55:12.31297 | 2021-03-03 20:55:12.31298 | queued 5931379 | 2021-03-03 20:55:19.008472 | 2021-03-03 22:04:57.608941 | pass 5931379 | 2021-03-03 20:55:19.019315 | 2021-03-03 20:55:19.019328 | queued 5931380 | 2021-03-03 20:55:19.505256 | 2021-03-03 20:55:19.505267 | queued 5931380 | 2021-03-03 20:55:19.525025 | 2021-03-03 21:53:02.772712 | pass Note that sometimes the queued job is posted first, and sometimes the job with the correctly updated status is first, demonstrating the race between teuthology-schedule and teuthology-dispatcher both adding the job to paddles. There is a secondary race that needs to be fixed in teuthology, to prevent teuthology-schedule from updating the status to 'queued' after the job has begun. Pulpito displays these jobs as queued on the queue page [1], and the run view [2], but as passed or failed on the individual job page [3], causing confusion about the actual state of the queue. [0] https://www.postgresql.org/docs/9.5/transaction-iso.html [1] https://pulpito.ceph.com/queue/ [2] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/ [3] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/5931205/ Signed-off-by: Josh Durgin <[email protected]>
Oddly enough we had a similar race that I fixed incorrectly in the original lock server - ceph/teuthology@b1a0c1a - this may have made the issue rarer but there was still no guarantee the reads and writes implied by that single update were atomic due to the lower isolation level |
Did you test this by manually implementing it? |
@djgalloway I added it to the paddles config in sepia but didn't want to restart it last night when no one was around |
Yeah good call. We'll need to pause the queue and wait for things to quiet down before we can test. Do you want me to schedule a downtime? |
@djgalloway Yeah, if it wouldn't take too much time. Don't want to disrupt all the pacific + stable release testing going on through the rest of the month. |
How big of an issue is this? We could wait till April. Kind of up to @yuriw |
I think it can wait, the queued jobs are cosmetic (just causing some confusion), and the locked machines aren't going wrong too frequently |
It would be great if we found a reproducer for this, so we can test on isolated from Sepia environment. |
The finished jobs masquerading as queued reproduce pretty frequently in pulpito, I think we'll see whether this works pretty quickly once it's deployed in sepia. We may want to think about a way to redeploy paddles without interrupting service - there are a few other bugs popping up (500 errors updating jobs) and more changes coming in the future. |
This requires using postgres or another DB that defaults to non-serializable isolation, and running paddles via gunicorn with multiple workers and keeping postgres in non-serializable mode. This setup reproduces multiple rows for the same job every time for me. Inspired by test_nodes_race.py. Signed-off-by: Josh Durgin <[email protected]>
This turned out to be much easier to reproduce than I thought - added a test based on @zmc's test_nodes_race. I tried a few approaches to handle the conflicts by retrying within paddles, using a decorator or manually handling the whole transaction within a retry loop, however the structure of a pecan-based app makes this very complex, effectively requiring a rewrite to be able to retry all of the db reads and writes together in one place. Instead, it's much simpler to set the db engine default to serializable to avoid the duplicate rows, and have the paddles clients retry requests that fail due to conflict. The test does this, adding a random delay to avoid conflicts the 2nd time. In practice, paddles doesn't have highly-contended updates so this approach shouldn't cause too many delays. Errors due to conflicts like these are causing https://tracker.ceph.com/issues/49864 - adding retries to teuthology's writes to paddles is the 2nd half of the fix: ceph/teuthology#1633 |
Here's an example test run without serializable isolation: select id, job_id, status from jobs where run_id = 63 ORDER BY job_id ; id | job_id | status ------+--------+--------- 1942 | 1 | running 1943 | 1 | running 1940 | 1 | queued 1941 | 1 | queued 1945 | 2 | queued 1971 | 2 | queued 1947 | 2 | running 1955 | 2 | running 1966 | 3 | running 1951 | 3 | queued 1962 | 3 | queued 1961 | 3 | running 1946 | 4 | running 1969 | 4 | running 1965 | 4 | queued 1958 | 4 | queued 1957 | 5 | queued 1973 | 5 | queued 1970 | 5 | running 1959 | 5 | running 1954 | 6 | queued 1963 | 6 | running 1964 | 6 | queued 1972 | 6 | running 1944 | 7 | queued 1967 | 7 | running 1960 | 8 | queued 1956 | 8 | queued 1949 | 8 | running 1952 | 8 | running 1968 | 9 | queued 1953 | 9 | running 1950 | 9 | running 1948 | 9 | queued (34 rows) and with serializable isolation (everything else equal): select id, job_id, status from jobs where run_id = 59 ORDER BY job_id ; id | job_id | status ------+--------+--------- 1877 | 1 | queued 1880 | 2 | queued 1882 | 3 | queued 1885 | 4 | running 1884 | 5 | running 1878 | 6 | running 1879 | 7 | running 1883 | 8 | queued 1881 | 9 | queued (9 rows) |
Now that ceph/teuthology#1642 has been deployed as well, this is looking good with multiple gunicorn workers. No more dispatcher errors, or failures from paddles in runs that include that change. |
By default postgres uses the READ COMMITTED isolation level,
which allows transactions to read data that may be modified
concurrently by another transaction [0].
When teuthology queues a job, it adds it to beanstalkd, then
reports that new job id to paddles. In handling this request,
paddles:
On the server side, when teuthology-dispatcher reads from the
queue, it also updates the job status in paddles, going through
the same request handling as above.
When both requests are run simultaneously, both requests may do
step (1) and record no existing job, then both proceed to
step (2) and add a new row.
With SERIALIZABLE isolation, this is impossible - the 2nd
transaction will be aborted when the job is added by the 1st
transaction. This makes DB operations easier to reason about, and
since this db is not performance-sensitive, is a reasonable
default for all transactions.
This may also address some races we've seen with multiple jobs
thinking they locked the same machine. I don't have evidence of that
race, though it seems possible based on the implementation of the
nodes model.
Both node locking and job updating APIs retry on the client side, so a
transaction hitting a conflict and returning an error shouldn't cause
any issues.
The result of this race adding jobs is duplicate DB records for each
job like these:
Note that sometimes the queued job is posted first, and sometimes the
job with the correctly updated status is first, demonstrating the race
between teuthology-schedule and teuthology-dispatcher both adding the
job to paddles. There is a secondary race that needs to be fixed in
teuthology, to prevent teuthology-schedule from updating the status to
'queued' after the job has begun.
Pulpito displays these jobs as queued on the queue page [1], and the
run view [2], but as passed or failed on the individual job page [3],
causing confusion about the actual state of the queue.
[0] https://www.postgresql.org/docs/9.5/transaction-iso.html
[1] https://pulpito.ceph.com/queue/
[2] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/
[3] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/5931205/
Signed-off-by: Josh Durgin [email protected]