Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config.py.in: use SERIALIZABLE isolation level for the db #93

Merged
merged 2 commits into from
May 5, 2021

Conversation

jdurgin
Copy link
Member

@jdurgin jdurgin commented Mar 16, 2021

By default postgres uses the READ COMMITTED isolation level,
which allows transactions to read data that may be modified
concurrently by another transaction [0].

When teuthology queues a job, it adds it to beanstalkd, then
reports that new job id to paddles. In handling this request,
paddles:

  1. checks for a job with the same id
  2. adds a new row in the jobs table for this job

On the server side, when teuthology-dispatcher reads from the
queue, it also updates the job status in paddles, going through
the same request handling as above.

When both requests are run simultaneously, both requests may do
step (1) and record no existing job, then both proceed to
step (2) and add a new row.

With SERIALIZABLE isolation, this is impossible - the 2nd
transaction will be aborted when the job is added by the 1st
transaction. This makes DB operations easier to reason about, and
since this db is not performance-sensitive, is a reasonable
default for all transactions.

This may also address some races we've seen with multiple jobs
thinking they locked the same machine. I don't have evidence of that
race, though it seems possible based on the implementation of the
nodes model.

Both node locking and job updating APIs retry on the client side, so a
transaction hitting a conflict and returning an error shouldn't cause
any issues.

The result of this race adding jobs is duplicate DB records for each
job like these:

SELECT job_id, posted, updated, status FROM jobs WHERE run_id = X AND job_id IN (SELECT job_id FROM jobs WHERE run_id = X AND status = 'queued') ORDER BY posted ASC;
 job_id  |           posted           |          updated           | status
---------+----------------------------+----------------------------+--------
 5931205 | 2021-03-03 20:53:52.124034 | 2021-03-03 20:53:52.124047 | queued
 5931205 | 2021-03-03 20:53:52.131244 | 2021-03-03 22:01:12.236465 | pass
 5931209 | 2021-03-03 20:53:54.146957 | 2021-03-03 20:53:54.146968 | queued
 5931209 | 2021-03-03 20:53:54.147755 | 2021-03-03 21:52:56.254025 | pass
 5931223 | 2021-03-03 20:54:00.96613  | 2021-03-03 20:54:00.966142 | queued
 5931223 | 2021-03-03 20:54:00.971739 | 2021-03-03 22:03:12.814133 | pass
 5931236 | 2021-03-03 20:54:07.313768 | 2021-03-03 20:54:07.313778 | queued
 5931236 | 2021-03-03 20:54:07.319185 | 2021-03-04 04:54:00.689333 | pass
 5931264 | 2021-03-03 20:54:20.714723 | 2021-03-03 20:54:20.714736 | queued
 5931264 | 2021-03-03 20:54:20.715809 | 2021-03-03 21:28:38.805917 | pass
 5931273 | 2021-03-03 20:54:25.00497  | 2021-03-03 20:54:25.004981 | queued
 5931273 | 2021-03-03 20:54:25.019827 | 2021-03-03 21:29:41.890552 | pass
 5931277 | 2021-03-03 20:54:26.963148 | 2021-03-03 20:54:26.96316  | queued
 5931277 | 2021-03-03 20:54:26.965546 | 2021-03-03 21:20:27.454767 | pass
 5931286 | 2021-03-03 20:54:31.254743 | 2021-03-03 20:54:31.254755 | queued
 5931286 | 2021-03-03 20:54:31.258285 | 2021-03-03 21:45:30.325423 | pass
 5931304 | 2021-03-03 20:54:39.934984 | 2021-03-03 20:54:39.934996 | queued
 5931304 | 2021-03-03 20:54:39.935474 | 2021-03-03 22:41:53.914094 | pass
 5931317 | 2021-03-03 20:54:46.290555 | 2021-03-03 20:54:46.290566 | queued
 5931317 | 2021-03-03 20:54:46.294983 | 2021-03-03 22:12:38.405483 | pass
 5931322 | 2021-03-03 20:54:48.68503  | 2021-03-03 20:54:48.685043 | queued
 5931322 | 2021-03-03 20:54:48.694535 | 2021-03-03 22:01:12.71814  | pass
 5931326 | 2021-03-03 20:54:51.643081 | 2021-03-03 20:54:51.643105 | queued
 5931326 | 2021-03-03 20:54:51.670604 | 2021-03-03 22:09:21.44187  | pass
 5931331 | 2021-03-03 20:54:54.922704 | 2021-03-03 20:54:54.922716 | queued
 5931331 | 2021-03-03 20:54:54.925522 | 2021-03-03 22:00:25.523793 | pass
 5931347 | 2021-03-03 20:55:03.562209 | 2021-03-03 20:55:03.56222  | queued
 5931347 | 2021-03-03 20:55:03.564115 | 2021-03-03 21:58:32.887666 | pass
 5931356 | 2021-03-03 20:55:07.922281 | 2021-03-03 20:55:07.922293 | queued
 5931356 | 2021-03-03 20:55:07.922722 | 2021-03-03 21:57:17.223546 | pass
 5931365 | 2021-03-03 20:55:12.282366 | 2021-03-03 21:52:59.552419 | fail
 5931365 | 2021-03-03 20:55:12.31297  | 2021-03-03 20:55:12.31298  | queued
 5931379 | 2021-03-03 20:55:19.008472 | 2021-03-03 22:04:57.608941 | pass
 5931379 | 2021-03-03 20:55:19.019315 | 2021-03-03 20:55:19.019328 | queued
 5931380 | 2021-03-03 20:55:19.505256 | 2021-03-03 20:55:19.505267 | queued
 5931380 | 2021-03-03 20:55:19.525025 | 2021-03-03 21:53:02.772712 | pass

Note that sometimes the queued job is posted first, and sometimes the
job with the correctly updated status is first, demonstrating the race
between teuthology-schedule and teuthology-dispatcher both adding the
job to paddles. There is a secondary race that needs to be fixed in
teuthology, to prevent teuthology-schedule from updating the status to
'queued' after the job has begun.

Pulpito displays these jobs as queued on the queue page [1], and the
run view [2], but as passed or failed on the individual job page [3],
causing confusion about the actual state of the queue.

[0] https://www.postgresql.org/docs/9.5/transaction-iso.html
[1] https://pulpito.ceph.com/queue/
[2] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/
[3] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/5931205/

Signed-off-by: Josh Durgin [email protected]

By default postgres uses the READ COMMITTED isolation level,
which allows transactions to read data that may be modified
concurrently by another transaction [0].

When teuthology queues a job, it adds it to beanstalkd, then
reports that new job id to paddles.  In handling this request,
paddles:

1) checks for a job with the same id
2) adds a new row in the jobs table for this job

On the server side, when teuthology-dispatcher reads from the
queue, it also updates the job status in paddles, going through
the same request handling as above.

When both requests are run simultaneously, both requests may do
step (1) and record no existing job, then both proceed to
step (2) and add a new row.

With SERIALIZABLE isolation, this is impossible - the 2nd
transaction will be aborted when the job is added by the 1st
transaction. This makes DB operations easier to reason about, and
since this db is not performance-sensitive, is a reasonable
default for all transactions.

This may also address some races we've seen with multiple jobs
thinking they locked the same machine. I don't have evidence of that
race, though it seems possible based on the implementation of the
nodes model.

Both node locking and job updating APIs retry on the client side, so a
transaction hitting a conflict and returning an error shouldn't cause
any issues.

The result of this race adding jobs is duplicate DB records for each
job like these:

SELECT job_id, posted, updated, status FROM jobs WHERE run_id = X AND job_id IN (SELECT job_id FROM jobs WHERE run_id = X AND status = 'queued') ORDER BY posted ASC;
 job_id  |           posted           |          updated           | status
---------+----------------------------+----------------------------+--------
 5931205 | 2021-03-03 20:53:52.124034 | 2021-03-03 20:53:52.124047 | queued
 5931205 | 2021-03-03 20:53:52.131244 | 2021-03-03 22:01:12.236465 | pass
 5931209 | 2021-03-03 20:53:54.146957 | 2021-03-03 20:53:54.146968 | queued
 5931209 | 2021-03-03 20:53:54.147755 | 2021-03-03 21:52:56.254025 | pass
 5931223 | 2021-03-03 20:54:00.96613  | 2021-03-03 20:54:00.966142 | queued
 5931223 | 2021-03-03 20:54:00.971739 | 2021-03-03 22:03:12.814133 | pass
 5931236 | 2021-03-03 20:54:07.313768 | 2021-03-03 20:54:07.313778 | queued
 5931236 | 2021-03-03 20:54:07.319185 | 2021-03-04 04:54:00.689333 | pass
 5931264 | 2021-03-03 20:54:20.714723 | 2021-03-03 20:54:20.714736 | queued
 5931264 | 2021-03-03 20:54:20.715809 | 2021-03-03 21:28:38.805917 | pass
 5931273 | 2021-03-03 20:54:25.00497  | 2021-03-03 20:54:25.004981 | queued
 5931273 | 2021-03-03 20:54:25.019827 | 2021-03-03 21:29:41.890552 | pass
 5931277 | 2021-03-03 20:54:26.963148 | 2021-03-03 20:54:26.96316  | queued
 5931277 | 2021-03-03 20:54:26.965546 | 2021-03-03 21:20:27.454767 | pass
 5931286 | 2021-03-03 20:54:31.254743 | 2021-03-03 20:54:31.254755 | queued
 5931286 | 2021-03-03 20:54:31.258285 | 2021-03-03 21:45:30.325423 | pass
 5931304 | 2021-03-03 20:54:39.934984 | 2021-03-03 20:54:39.934996 | queued
 5931304 | 2021-03-03 20:54:39.935474 | 2021-03-03 22:41:53.914094 | pass
 5931317 | 2021-03-03 20:54:46.290555 | 2021-03-03 20:54:46.290566 | queued
 5931317 | 2021-03-03 20:54:46.294983 | 2021-03-03 22:12:38.405483 | pass
 5931322 | 2021-03-03 20:54:48.68503  | 2021-03-03 20:54:48.685043 | queued
 5931322 | 2021-03-03 20:54:48.694535 | 2021-03-03 22:01:12.71814  | pass
 5931326 | 2021-03-03 20:54:51.643081 | 2021-03-03 20:54:51.643105 | queued
 5931326 | 2021-03-03 20:54:51.670604 | 2021-03-03 22:09:21.44187  | pass
 5931331 | 2021-03-03 20:54:54.922704 | 2021-03-03 20:54:54.922716 | queued
 5931331 | 2021-03-03 20:54:54.925522 | 2021-03-03 22:00:25.523793 | pass
 5931347 | 2021-03-03 20:55:03.562209 | 2021-03-03 20:55:03.56222  | queued
 5931347 | 2021-03-03 20:55:03.564115 | 2021-03-03 21:58:32.887666 | pass
 5931356 | 2021-03-03 20:55:07.922281 | 2021-03-03 20:55:07.922293 | queued
 5931356 | 2021-03-03 20:55:07.922722 | 2021-03-03 21:57:17.223546 | pass
 5931365 | 2021-03-03 20:55:12.282366 | 2021-03-03 21:52:59.552419 | fail
 5931365 | 2021-03-03 20:55:12.31297  | 2021-03-03 20:55:12.31298  | queued
 5931379 | 2021-03-03 20:55:19.008472 | 2021-03-03 22:04:57.608941 | pass
 5931379 | 2021-03-03 20:55:19.019315 | 2021-03-03 20:55:19.019328 | queued
 5931380 | 2021-03-03 20:55:19.505256 | 2021-03-03 20:55:19.505267 | queued
 5931380 | 2021-03-03 20:55:19.525025 | 2021-03-03 21:53:02.772712 | pass

Note that sometimes the queued job is posted first, and sometimes the
job with the correctly updated status is first, demonstrating the race
between teuthology-schedule and teuthology-dispatcher both adding the
job to paddles. There is a secondary race that needs to be fixed in
teuthology, to prevent teuthology-schedule from updating the status to
'queued' after the job has begun.

Pulpito displays these jobs as queued on the queue page [1], and the
run view [2], but as passed or failed on the individual job page [3],
causing confusion about the actual state of the queue.

[0] https://www.postgresql.org/docs/9.5/transaction-iso.html
[1] https://pulpito.ceph.com/queue/
[2] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/
[3] https://pulpito.ceph.com/jdillaman-2021-03-03_15:53:41-rbd-wip-jd-testing-pacific-testing-basic-smithi/5931205/

Signed-off-by: Josh Durgin <[email protected]>
@jdurgin
Copy link
Member Author

jdurgin commented Mar 16, 2021

Oddly enough we had a similar race that I fixed incorrectly in the original lock server - ceph/teuthology@b1a0c1a - this may have made the issue rarer but there was still no guarantee the reads and writes implied by that single update were atomic due to the lower isolation level

@djgalloway
Copy link

Did you test this by manually implementing it?

@jdurgin
Copy link
Member Author

jdurgin commented Mar 16, 2021

@djgalloway I added it to the paddles config in sepia but didn't want to restart it last night when no one was around

@djgalloway
Copy link

@djgalloway I added it to the paddles config in sepia but didn't want to restart it last night when no one was around

Yeah good call. We'll need to pause the queue and wait for things to quiet down before we can test. Do you want me to schedule a downtime?

@jdurgin
Copy link
Member Author

jdurgin commented Mar 16, 2021

@djgalloway Yeah, if it wouldn't take too much time. Don't want to disrupt all the pacific + stable release testing going on through the rest of the month.

@djgalloway
Copy link

How big of an issue is this? We could wait till April. Kind of up to @yuriw

@jdurgin
Copy link
Member Author

jdurgin commented Mar 16, 2021

I think it can wait, the queued jobs are cosmetic (just causing some confusion), and the locked machines aren't going wrong too frequently

@kshtsk
Copy link
Contributor

kshtsk commented Mar 16, 2021

It would be great if we found a reproducer for this, so we can test on isolated from Sepia environment.
Other than that, it's a great finding.

@jdurgin
Copy link
Member Author

jdurgin commented Mar 17, 2021

The finished jobs masquerading as queued reproduce pretty frequently in pulpito, I think we'll see whether this works pretty quickly once it's deployed in sepia.

We may want to think about a way to redeploy paddles without interrupting service - there are a few other bugs popping up (500 errors updating jobs) and more changes coming in the future.

This requires using postgres or another DB that defaults to
non-serializable isolation, and running paddles via gunicorn with
multiple workers and keeping postgres in non-serializable mode. This
setup reproduces multiple rows for the same job every time for me.

Inspired by test_nodes_race.py.

Signed-off-by: Josh Durgin <[email protected]>
@jdurgin
Copy link
Member Author

jdurgin commented Mar 21, 2021

This turned out to be much easier to reproduce than I thought - added a test based on @zmc's test_nodes_race.

I tried a few approaches to handle the conflicts by retrying within paddles, using a decorator or manually handling the whole transaction within a retry loop, however the structure of a pecan-based app makes this very complex, effectively requiring a rewrite to be able to retry all of the db reads and writes together in one place.

Instead, it's much simpler to set the db engine default to serializable to avoid the duplicate rows, and have the paddles clients retry requests that fail due to conflict. The test does this, adding a random delay to avoid conflicts the 2nd time. In practice, paddles doesn't have highly-contended updates so this approach shouldn't cause too many delays.

Errors due to conflicts like these are causing https://tracker.ceph.com/issues/49864 - adding retries to teuthology's writes to paddles is the 2nd half of the fix: ceph/teuthology#1633

@jdurgin
Copy link
Member Author

jdurgin commented Mar 21, 2021

Here's an example test run without serializable isolation:

select id, job_id, status from jobs where run_id = 63 ORDER BY job_id ;
  id  | job_id | status
------+--------+---------
 1942 | 1      | running
 1943 | 1      | running
 1940 | 1      | queued
 1941 | 1      | queued
 1945 | 2      | queued
 1971 | 2      | queued
 1947 | 2      | running
 1955 | 2      | running
 1966 | 3      | running
 1951 | 3      | queued
 1962 | 3      | queued
 1961 | 3      | running
 1946 | 4      | running
 1969 | 4      | running
 1965 | 4      | queued
 1958 | 4      | queued
 1957 | 5      | queued
 1973 | 5      | queued
 1970 | 5      | running
 1959 | 5      | running
 1954 | 6      | queued
 1963 | 6      | running
 1964 | 6      | queued
 1972 | 6      | running
 1944 | 7      | queued
 1967 | 7      | running
 1960 | 8      | queued
 1956 | 8      | queued
 1949 | 8      | running
 1952 | 8      | running
 1968 | 9      | queued
 1953 | 9      | running
 1950 | 9      | running
 1948 | 9      | queued
(34 rows)

and with serializable isolation (everything else equal):

select id, job_id, status from jobs where run_id = 59 ORDER BY job_id ;
  id  | job_id | status  
------+--------+---------
 1877 | 1      | queued
 1880 | 2      | queued
 1882 | 3      | queued
 1885 | 4      | running
 1884 | 5      | running
 1878 | 6      | running
 1879 | 7      | running
 1883 | 8      | queued
 1881 | 9      | queued
(9 rows)

@jdurgin jdurgin requested a review from liewegas March 21, 2021 22:47
@jdurgin
Copy link
Member Author

jdurgin commented May 5, 2021

Now that ceph/teuthology#1642 has been deployed as well, this is looking good with multiple gunicorn workers. No more dispatcher errors, or failures from paddles in runs that include that change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants