config.py.in: use SERIALIZABLE isolation level for the db #93

jdurgin · 2021-03-16T13:50:38Z

By default postgres uses the READ COMMITTED isolation level,
which allows transactions to read data that may be modified
concurrently by another transaction [0].

When teuthology queues a job, it adds it to beanstalkd, then
reports that new job id to paddles. In handling this request,
paddles:

checks for a job with the same id
adds a new row in the jobs table for this job

On the server side, when teuthology-dispatcher reads from the
queue, it also updates the job status in paddles, going through
the same request handling as above.

When both requests are run simultaneously, both requests may do
step (1) and record no existing job, then both proceed to
step (2) and add a new row.

With SERIALIZABLE isolation, this is impossible - the 2nd
transaction will be aborted when the job is added by the 1st
transaction. This makes DB operations easier to reason about, and
since this db is not performance-sensitive, is a reasonable
default for all transactions.

This may also address some races we've seen with multiple jobs
thinking they locked the same machine. I don't have evidence of that
race, though it seems possible based on the implementation of the
nodes model.

Both node locking and job updating APIs retry on the client side, so a
transaction hitting a conflict and returning an error shouldn't cause
any issues.

The result of this race adding jobs is duplicate DB records for each
job like these:

SELECT job_id, posted, updated, status FROM jobs WHERE run_id = X AND job_id IN (SELECT job_id FROM jobs WHERE run_id = X AND status = 'queued') ORDER BY posted ASC;
 job_id  |           posted           |          updated           | status
---------+----------------------------+----------------------------+--------
 5931205 | 2021-03-03 20:53:52.124034 | 2021-03-03 20:53:52.124047 | queued
 5931205 | 2021-03-03 20:53:52.131244 | 2021-03-03 22:01:12.236465 | pass
 5931209 | 2021-03-03 20:53:54.146957 | 2021-03-03 20:53:54.146968 | queued
 5931209 | 2021-03-03 20:53:54.147755 | 2021-03-03 21:52:56.254025 | pass
 5931223 | 2021-03-03 20:54:00.96613  | 2021-03-03 20:54:00.966142 | queued
 5931223 | 2021-03-03 20:54:00.971739 | 2021-03-03 22:03:12.814133 | pass
 5931236 | 2021-03-03 20:54:07.313768 | 2021-03-03 20:54:07.313778 | queued
 5931236 | 2021-03-03 20:54:07.319185 | 2021-03-04 04:54:00.689333 | pass
 5931264 | 2021-03-03 20:54:20.714723 | 2021-03-03 20:54:20.714736 | queued
 5931264 | 2021-03-03 20:54:20.715809 | 2021-03-03 21:28:38.805917 | pass
 5931273 | 2021-03-03 20:54:25.00497  | 2021-03-03 20:54:25.004981 | queued
 5931273 | 2021-03-03 20:54:25.019827 | 2021-03-03 21:29:41.890552 | pass
 5931277 | 2021-03-03 20:54:26.963148 | 2021-03-03 20:54:26.96316  | queued
 5931277 | 2021-03-03 20:54:26.965546 | 2021-03-03 21:20:27.454767 | pass
 5931286 | 2021-03-03 20:54:31.254743 | 2021-03-03 20:54:31.254755 | queued
 5931286 | 2021-03-03 20:54:31.258285 | 2021-03-03 21:45:30.325423 | pass
 5931304 | 2021-03-03 20:54:39.934984 | 2021-03-03 20:54:39.934996 | queued
 5931304 | 2021-03-03 20:54:39.935474 | 2021-03-03 22:41:53.914094 | pass
 5931317 | 2021-03-03 20:54:46.290555 | 2021-03-03 20:54:46.290566 | queued
 5931317 | 2021-03-03 20:54:46.294983 | 2021-03-03 22:12:38.405483 | pass
 5931322 | 2021-03-03 20:54:48.68503  | 2021-03-03 20:54:48.685043 | queued
 5931322 | 2021-03-03 20:54:48.694535 | 2021-03-03 22:01:12.71814  | pass
 5931326 | 2021-03-03 20:54:51.643081 | 2021-03-03 20:54:51.643105 | queued
 5931326 | 2021-03-03 20:54:51.670604 | 2021-03-03 22:09:21.44187  | pass
 5931331 | 2021-03-03 20:54:54.922704 | 2021-03-03 20:54:54.922716 | queued
 5931331 | 2021-03-03 20:54:54.925522 | 2021-03-03 22:00:25.523793 | pass
 5931347 | 2021-03-03 20:55:03.562209 | 2021-03-03 20:55:03.56222  | queued
 5931347 | 2021-03-03 20:55:03.564115 | 2021-03-03 21:58:32.887666 | pass
 5931356 | 2021-03-03 20:55:07.922281 | 2021-03-03 20:55:07.922293 | queued
 5931356 | 2021-03-03 20:55:07.922722 | 2021-03-03 21:57:17.223546 | pass
 5931365 | 2021-03-03 20:55:12.282366 | 2021-03-03 21:52:59.552419 | fail
 5931365 | 2021-03-03 20:55:12.31297  | 2021-03-03 20:55:12.31298  | queued
 5931379 | 2021-03-03 20:55:19.008472 | 2021-03-03 22:04:57.608941 | pass
 5931379 | 2021-03-03 20:55:19.019315 | 2021-03-03 20:55:19.019328 | queued
 5931380 | 2021-03-03 20:55:19.505256 | 2021-03-03 20:55:19.505267 | queued
 5931380 | 2021-03-03 20:55:19.525025 | 2021-03-03 21:53:02.772712 | pass

Note that sometimes the queued job is posted first, and sometimes the
job with the correctly updated status is first, demonstrating the race
between teuthology-schedule and teuthology-dispatcher both adding the
job to paddles. There is a secondary race that needs to be fixed in
teuthology, to prevent teuthology-schedule from updating the status to
'queued' after the job has begun.

Pulpito displays these jobs as queued on the queue page [1], and the
run view [2], but as passed or failed on the individual job page [3],
causing confusion about the actual state of the queue.

[0] https://www.postgresql.org/docs/9.5/transaction-iso.html
[1] https://pulpito.ceph.com/queue/
[2] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/
[3] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/5931205/

Signed-off-by: Josh Durgin [email protected]

By default postgres uses the READ COMMITTED isolation level, which allows transactions to read data that may be modified concurrently by another transaction [0]. When teuthology queues a job, it adds it to beanstalkd, then reports that new job id to paddles. In handling this request, paddles: 1) checks for a job with the same id 2) adds a new row in the jobs table for this job On the server side, when teuthology-dispatcher reads from the queue, it also updates the job status in paddles, going through the same request handling as above. When both requests are run simultaneously, both requests may do step (1) and record no existing job, then both proceed to step (2) and add a new row. With SERIALIZABLE isolation, this is impossible - the 2nd transaction will be aborted when the job is added by the 1st transaction. This makes DB operations easier to reason about, and since this db is not performance-sensitive, is a reasonable default for all transactions. This may also address some races we've seen with multiple jobs thinking they locked the same machine. I don't have evidence of that race, though it seems possible based on the implementation of the nodes model. Both node locking and job updating APIs retry on the client side, so a transaction hitting a conflict and returning an error shouldn't cause any issues. The result of this race adding jobs is duplicate DB records for each job like these: SELECT job_id, posted, updated, status FROM jobs WHERE run_id = X AND job_id IN (SELECT job_id FROM jobs WHERE run_id = X AND status = 'queued') ORDER BY posted ASC; job_id | posted | updated | status ---------+----------------------------+----------------------------+-------- 5931205 | 2021-03-03 20:53:52.124034 | 2021-03-03 20:53:52.124047 | queued 5931205 | 2021-03-03 20:53:52.131244 | 2021-03-03 22:01:12.236465 | pass 5931209 | 2021-03-03 20:53:54.146957 | 2021-03-03 20:53:54.146968 | queued 5931209 | 2021-03-03 20:53:54.147755 | 2021-03-03 21:52:56.254025 | pass 5931223 | 2021-03-03 20:54:00.96613 | 2021-03-03 20:54:00.966142 | queued 5931223 | 2021-03-03 20:54:00.971739 | 2021-03-03 22:03:12.814133 | pass 5931236 | 2021-03-03 20:54:07.313768 | 2021-03-03 20:54:07.313778 | queued 5931236 | 2021-03-03 20:54:07.319185 | 2021-03-04 04:54:00.689333 | pass 5931264 | 2021-03-03 20:54:20.714723 | 2021-03-03 20:54:20.714736 | queued 5931264 | 2021-03-03 20:54:20.715809 | 2021-03-03 21:28:38.805917 | pass 5931273 | 2021-03-03 20:54:25.00497 | 2021-03-03 20:54:25.004981 | queued 5931273 | 2021-03-03 20:54:25.019827 | 2021-03-03 21:29:41.890552 | pass 5931277 | 2021-03-03 20:54:26.963148 | 2021-03-03 20:54:26.96316 | queued 5931277 | 2021-03-03 20:54:26.965546 | 2021-03-03 21:20:27.454767 | pass 5931286 | 2021-03-03 20:54:31.254743 | 2021-03-03 20:54:31.254755 | queued 5931286 | 2021-03-03 20:54:31.258285 | 2021-03-03 21:45:30.325423 | pass 5931304 | 2021-03-03 20:54:39.934984 | 2021-03-03 20:54:39.934996 | queued 5931304 | 2021-03-03 20:54:39.935474 | 2021-03-03 22:41:53.914094 | pass 5931317 | 2021-03-03 20:54:46.290555 | 2021-03-03 20:54:46.290566 | queued 5931317 | 2021-03-03 20:54:46.294983 | 2021-03-03 22:12:38.405483 | pass 5931322 | 2021-03-03 20:54:48.68503 | 2021-03-03 20:54:48.685043 | queued 5931322 | 2021-03-03 20:54:48.694535 | 2021-03-03 22:01:12.71814 | pass 5931326 | 2021-03-03 20:54:51.643081 | 2021-03-03 20:54:51.643105 | queued 5931326 | 2021-03-03 20:54:51.670604 | 2021-03-03 22:09:21.44187 | pass 5931331 | 2021-03-03 20:54:54.922704 | 2021-03-03 20:54:54.922716 | queued 5931331 | 2021-03-03 20:54:54.925522 | 2021-03-03 22:00:25.523793 | pass 5931347 | 2021-03-03 20:55:03.562209 | 2021-03-03 20:55:03.56222 | queued 5931347 | 2021-03-03 20:55:03.564115 | 2021-03-03 21:58:32.887666 | pass 5931356 | 2021-03-03 20:55:07.922281 | 2021-03-03 20:55:07.922293 | queued 5931356 | 2021-03-03 20:55:07.922722 | 2021-03-03 21:57:17.223546 | pass 5931365 | 2021-03-03 20:55:12.282366 | 2021-03-03 21:52:59.552419 | fail 5931365 | 2021-03-03 20:55:12.31297 | 2021-03-03 20:55:12.31298 | queued 5931379 | 2021-03-03 20:55:19.008472 | 2021-03-03 22:04:57.608941 | pass 5931379 | 2021-03-03 20:55:19.019315 | 2021-03-03 20:55:19.019328 | queued 5931380 | 2021-03-03 20:55:19.505256 | 2021-03-03 20:55:19.505267 | queued 5931380 | 2021-03-03 20:55:19.525025 | 2021-03-03 21:53:02.772712 | pass Note that sometimes the queued job is posted first, and sometimes the job with the correctly updated status is first, demonstrating the race between teuthology-schedule and teuthology-dispatcher both adding the job to paddles. There is a secondary race that needs to be fixed in teuthology, to prevent teuthology-schedule from updating the status to 'queued' after the job has begun. Pulpito displays these jobs as queued on the queue page [1], and the run view [2], but as passed or failed on the individual job page [3], causing confusion about the actual state of the queue. [0] https://www.postgresql.org/docs/9.5/transaction-iso.html [1] https://pulpito.ceph.com/queue/ [2] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/ [3] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/5931205/ Signed-off-by: Josh Durgin <[email protected]>

jdurgin · 2021-03-16T15:06:51Z

Oddly enough we had a similar race that I fixed incorrectly in the original lock server - ceph/teuthology@b1a0c1a - this may have made the issue rarer but there was still no guarantee the reads and writes implied by that single update were atomic due to the lower isolation level

djgalloway · 2021-03-16T17:59:21Z

Did you test this by manually implementing it?

jdurgin · 2021-03-16T18:14:40Z

@djgalloway I added it to the paddles config in sepia but didn't want to restart it last night when no one was around

djgalloway · 2021-03-16T18:23:01Z

@djgalloway I added it to the paddles config in sepia but didn't want to restart it last night when no one was around

Yeah good call. We'll need to pause the queue and wait for things to quiet down before we can test. Do you want me to schedule a downtime?

jdurgin · 2021-03-16T18:30:22Z

@djgalloway Yeah, if it wouldn't take too much time. Don't want to disrupt all the pacific + stable release testing going on through the rest of the month.

djgalloway · 2021-03-16T18:33:05Z

How big of an issue is this? We could wait till April. Kind of up to @yuriw

jdurgin · 2021-03-16T18:35:55Z

I think it can wait, the queued jobs are cosmetic (just causing some confusion), and the locked machines aren't going wrong too frequently

kshtsk · 2021-03-16T19:52:31Z

It would be great if we found a reproducer for this, so we can test on isolated from Sepia environment.
Other than that, it's a great finding.

jdurgin · 2021-03-17T01:16:18Z

The finished jobs masquerading as queued reproduce pretty frequently in pulpito, I think we'll see whether this works pretty quickly once it's deployed in sepia.

We may want to think about a way to redeploy paddles without interrupting service - there are a few other bugs popping up (500 errors updating jobs) and more changes coming in the future.

This requires using postgres or another DB that defaults to non-serializable isolation, and running paddles via gunicorn with multiple workers and keeping postgres in non-serializable mode. This setup reproduces multiple rows for the same job every time for me. Inspired by test_nodes_race.py. Signed-off-by: Josh Durgin <[email protected]>

jdurgin · 2021-03-21T22:43:26Z

This turned out to be much easier to reproduce than I thought - added a test based on @zmc's test_nodes_race.

I tried a few approaches to handle the conflicts by retrying within paddles, using a decorator or manually handling the whole transaction within a retry loop, however the structure of a pecan-based app makes this very complex, effectively requiring a rewrite to be able to retry all of the db reads and writes together in one place.

Instead, it's much simpler to set the db engine default to serializable to avoid the duplicate rows, and have the paddles clients retry requests that fail due to conflict. The test does this, adding a random delay to avoid conflicts the 2nd time. In practice, paddles doesn't have highly-contended updates so this approach shouldn't cause too many delays.

Errors due to conflicts like these are causing https://tracker.ceph.com/issues/49864 - adding retries to teuthology's writes to paddles is the 2nd half of the fix: ceph/teuthology#1633

jdurgin · 2021-03-21T22:46:32Z

Here's an example test run without serializable isolation:

select id, job_id, status from jobs where run_id = 63 ORDER BY job_id ;
  id  | job_id | status
------+--------+---------
 1942 | 1      | running
 1943 | 1      | running
 1940 | 1      | queued
 1941 | 1      | queued
 1945 | 2      | queued
 1971 | 2      | queued
 1947 | 2      | running
 1955 | 2      | running
 1966 | 3      | running
 1951 | 3      | queued
 1962 | 3      | queued
 1961 | 3      | running
 1946 | 4      | running
 1969 | 4      | running
 1965 | 4      | queued
 1958 | 4      | queued
 1957 | 5      | queued
 1973 | 5      | queued
 1970 | 5      | running
 1959 | 5      | running
 1954 | 6      | queued
 1963 | 6      | running
 1964 | 6      | queued
 1972 | 6      | running
 1944 | 7      | queued
 1967 | 7      | running
 1960 | 8      | queued
 1956 | 8      | queued
 1949 | 8      | running
 1952 | 8      | running
 1968 | 9      | queued
 1953 | 9      | running
 1950 | 9      | running
 1948 | 9      | queued
(34 rows)

and with serializable isolation (everything else equal):

select id, job_id, status from jobs where run_id = 59 ORDER BY job_id ;
  id  | job_id | status  
------+--------+---------
 1877 | 1      | queued
 1880 | 2      | queued
 1882 | 3      | queued
 1885 | 4      | running
 1884 | 5      | running
 1878 | 6      | running
 1879 | 7      | running
 1883 | 8      | queued
 1881 | 9      | queued
(9 rows)

jdurgin · 2021-05-05T18:23:54Z

Now that ceph/teuthology#1642 has been deployed as well, this is looking good with multiple gunicorn workers. No more dispatcher errors, or failures from paddles in runs that include that change.

jdurgin requested review from zmc, kshtsk and djgalloway March 16, 2021 13:53

jdurgin mentioned this pull request Mar 21, 2021

report, lock.ops: retry write requests to paddles ceph/teuthology#1633

Merged

jdurgin requested a review from liewegas March 21, 2021 22:47

jdurgin merged commit 0d1a849 into ceph:master May 5, 2021

jdurgin deleted the wip-jobs-race branch May 5, 2021 18:24

djgalloway mentioned this pull request Jun 30, 2021

report, lock.ops, lock.query: add retry mechanism to paddles requests ceph/teuthology#1656

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config.py.in: use SERIALIZABLE isolation level for the db #93

config.py.in: use SERIALIZABLE isolation level for the db #93

jdurgin commented Mar 16, 2021 •

edited

Loading

jdurgin commented Mar 16, 2021

djgalloway commented Mar 16, 2021

jdurgin commented Mar 16, 2021

djgalloway commented Mar 16, 2021

jdurgin commented Mar 16, 2021

djgalloway commented Mar 16, 2021

jdurgin commented Mar 16, 2021

kshtsk commented Mar 16, 2021

jdurgin commented Mar 17, 2021

jdurgin commented Mar 21, 2021

jdurgin commented Mar 21, 2021

jdurgin commented May 5, 2021

config.py.in: use SERIALIZABLE isolation level for the db #93

config.py.in: use SERIALIZABLE isolation level for the db #93

Conversation

jdurgin commented Mar 16, 2021 • edited Loading

jdurgin commented Mar 16, 2021

djgalloway commented Mar 16, 2021

jdurgin commented Mar 16, 2021

djgalloway commented Mar 16, 2021

jdurgin commented Mar 16, 2021

djgalloway commented Mar 16, 2021

jdurgin commented Mar 16, 2021

kshtsk commented Mar 16, 2021

jdurgin commented Mar 17, 2021

jdurgin commented Mar 21, 2021

jdurgin commented Mar 21, 2021

jdurgin commented May 5, 2021

jdurgin commented Mar 16, 2021 •

edited

Loading