Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate BatchOver in favor of in_batches(use_ranges: true) #136

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maximerety
Copy link
Contributor

@maximerety maximerety commented Feb 22, 2024

Starting from ActiveRecord 7.1, there's a built-in helper equivalent to what BatchOver does, let's use it instead of maintaining our own implementation forever.

We need to keep BatchOver as long as compatibility with ActiveRecord < 7.1 is maintained though.

See:

If using ActiveRecord 7.1 or later, we would use the recommended built-in method in_batches with the use_ranges: true option, e.g.

User.in_batches(of: 100, use_ranges: true).each { |batch| ... }

Otherwise, we would still use BatchOver as a fallback:

SafePgMigrations::Helpers::BatchOver.new(User, of: 100).each_batch { |batch| ... }

Note that although both helpers are almost equivalent, there are small differences in the queries generated.

With the example code above, and assuming the users tables contains 250 records, we would have with BatchOver:

/* Get batch #1 */
SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 101 ORDER BY "users"."id" ASC
/* Do something with result */

/* Get batch #2 */
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 101 AND "users"."id" < 201 ORDER BY "users"."id" ASC
/* Do something with result */

/* Get batch #3 */
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC LIMIT 1
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 100
SELECT "users".* FROM "users" WHERE "users"."id" >= 201 ORDER BY "users"."id" ASC
/* Do something with result */

/* No more batches */

Whereas in_batches(of: 100, use_ranges: true) would give:

/* Get batch #1 */
SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 100
SELECT "users".* FROM "users" WHERE "users"."id" <= 100 
/* Do something with result */

/* Get batch #2 */
SELECT "users"."id" FROM "users" WHERE "users"."id" > 100 ORDER BY "users"."id" ASC LIMIT 100 
SELECT "users".* FROM "users" WHERE "users"."id" > 100 AND "users"."id" <= 200
/* Do something with result */

/* Get batch #3 */
SELECT "users"."id" FROM "users" WHERE "users"."id" > 200 ORDER BY "users"."id" ASC LIMIT 100
SELECT "users".* FROM "users" WHERE "users"."id" > 200 AND "users"."id" <= 250
/* Do something with result */

/* No more batches */

It would work exactly the same if passing any ActiveRecord::Relation object in place of the model User, e.g. User.where(condition: 'something'), with the additional condition appearing in the WHERE clause of each query.

For the selection of a range of record ids (id min / id max), BatchOver will generate two queries per batch, returns full records (and not only ids), but retrieves only 2 records. Conversely, in_batches(use_ranges: true) will generate a single query per batch, return only ids (and not full records), but would return the full list of ids instead of the min/max only.

I believe that trade-off is acceptable for our purpose, but that is debatable.

@maximerety maximerety marked this pull request as ready for review February 23, 2024 10:05
@maximerety maximerety requested a review from a team as a code owner February 23, 2024 10:05
@maximerety maximerety force-pushed the prefer-in-batches-use-ranges branch from 92b4694 to af7d561 Compare February 25, 2024 20:42
@maximerety
Copy link
Contributor Author

Benchmark

Scenario

The benchmark iterates over a table having 30M records, in batches of 10k records.

The table has an auto-incrementing index starting from 1 and 65 columns (so returning full records is at least a little costly).

Measurements

The benchmark is executed on a single machine, with a round-trip-time < 1ms.

Network (I/O) stats obtained with:

docker stats --no-stream postgres --format 'table {{.NetIO}}'

Postgres I/O stats obtained with:

SELECT
  heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit
FROM
  pg_statio_user_tables
WHERE
  relname = '<the-table>';

The metrics below only account for the time spent generating the scopes, not using them. For example, should we have used the scopes generated with use_ranges: false, the duration and network spent would have been even worse because of the inclusion of a lot of ids in generated queries.

Results

Batching method Duration Network (I/O) heap_blks_read/hit idx_blks_read/hit
BatchOver 22.4 s. 16 MB /   16 MB 372k / 6k 96k / 1743k
BatchOver + optim 11.9 s. 3 MB /      1 MB 0k / 6k 96k / 1743k
in_batches(use_ranges: false) 21.7 s.   4 MB / 560 MB 0k / 3k 96k /   872k
in_batches(use_ranges: true) 17.5 s.   4 MB / 560 MB 0k / 3k 96k /   872k
in_batches(use_ranges: true) + optim 6.6 s.   1 MB /     <1 MB 0k / 3k 96k /   872k

(*) + optim: see below

Additional optimizations

In BatchOver + optim, we reduce the number of database block reads and network used by requesting only record ids and not full records. This optimization is proposed in #138. In the case of the present benchmark, we are able use an Index Only Scan on the primary key instead of an Index Scan which is the case in which the optimization produces the greatest gains.

In in_batches(use_ranges: true) + optim, the optimization consists in querying only the last id of the range (LIMIT + OFFSET strategy actually taken from BatchOver) instead of returning the list of all ids in the range. So we get the best of both worlds: a single query + a single id returned. I'm preparing a fix to upstream to https://github.com/rails/rails.

@frederic-martin-doctolib
Copy link
Member

Benchmark

Scenario

The benchmark iterates over a table having 30M records, in batches of 10k records.

The table has an auto-incrementing index starting from 1 and 65 columns (so returning full records is at least a little costly).

Measurements

The benchmark is executed on a single machine, with a round-trip-time < 1ms.

Network (I/O) stats obtained with:

docker stats --no-stream postgres --format 'table {{.NetIO}}'

Postgres I/O stats obtained with:

SELECT
  heap_blks_read, heap_blks_hit, idx_blks_read, idx_blks_hit
FROM
  pg_statio_user_tables
WHERE
  relname = '<the-table>';

The metrics below only account for the time spent generating the scopes, not using them. For example, should we have used the scopes generated with use_ranges: false, the duration and network spent would have been even worse because of the inclusion of a lot of ids in generated queries.

Results

Batching method Duration Network (I/O) heap_blks_read/hit idx_blks_read/hit
BatchOver 22.4 s. 16 MB /   16 MB 372k / 6k 96k / 1743k
BatchOver + optim 11.9 s. 3 MB /      1 MB 0k / 6k 96k / 1743k
in_batches(use_ranges: false) 21.7 s.   4 MB / 560 MB 0k / 3k 96k /   872k
in_batches(use_ranges: true) 17.5 s.   4 MB / 560 MB 0k / 3k 96k /   872k
in_batches(use_ranges: true) + optim 6.6 s.   1 MB /     <1 MB 0k / 3k 96k /   872k
(*) + optim: see below

Additional optimizations

In BatchOver + optim, we reduce the number of database block reads and network used by requesting only record ids and not full records. This optimization is proposed in #138. In the case of the present benchmark, we are able use an Index Only Scan on the primary key instead of an Index Scan which is the case in which the optimization produces the greatest gains.

In in_batches(use_ranges: true) + optim, the optimization consists in querying only the last id of the range (LIMIT + OFFSET strategy actually taken from BatchOver) instead of returning the list of all ids in the range. So we get the best of both worlds: a single query + a single id returned. I'm preparing a fix to upstream to https://github.com/rails/rails.

Did you restart pg between each run ? it's weird to see that we read data from disk only the first run ("BatchOver")

@frederic-martin-doctolib
Copy link
Member

frederic-martin-doctolib commented Feb 26, 2024

Did you restart pg between each run ? it's weird to see that we read data from disk only the first run ("BatchOver")

forget what i wrote, i understood my mistake ;)


backfill_batch_size = SafePgMigrations.config.backfill_batch_size

if ActiveRecord.version >= Gem::Version.new('7.1')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until the patch on rails is not provided/approved/merged/released, i think is to early to switch towards rails implementation. WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's keep this PR in draft and keep the new small optim from #138.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maximerety maximerety marked this pull request as draft February 26, 2024 17:02
Starting from ActiveRecord 7.1, there's a built-in helper equivalent
to what BatchOver does, let's use it instead of maintaining our own
implementation forever.

We keep BatchOver for compatibility with ActiveRecord < 7.1.
@maximerety maximerety force-pushed the prefer-in-batches-use-ranges branch from af7d561 to 8d89fba Compare February 28, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants