feat: Modify recon storage tables, trait and sqlite config to improve throughput #243

dav1do · 2024-01-24T03:39:44Z

Linear WS1-1432.

Before making any changes, I was seeing about 400/100 rps keys added/synced. After these changes, we're able to push the incoming over 1k, and a steady workload targeting 300 completed with ~310/240 keys exchanged/sec.

Changes:

I moved the recon values into their own table: recon_value. This doesn't have any real benefit up front, but as the table grows it will help. Eventually, a single table will pay during reads as scans will load a lot more data. The value is about 10x bigger and will explode row sizes: a key is ~80 bytes and the value is ~1KB. The initial change moved writes up and syncs down: 450/65.
- In a POC, I left things in a single table and hit around 800/100 rps just changing the events_post insert to use a single statement. I expect we could push things higher than we are with that format, but eventually I would expect it to slow down while our multiple table approach should stay relatively consistent.
- All that to say, I think this is the right choice even though it reduces pure throughput potential initially.
I refactored the recon traits (particularly storage) to handle inserting a key and value simultaneously. This allows us more control at the database layer about how we handle these writes (in a transaction, batched, etc). Re-using a single transaction for all the writes for a request improved throughput substantially (36%) to 600/85.
I modified some sqlite config settings. Changing to pragma synchronous = NORMAL had a huge impact after the other changes. When I tried it originally, it made zero noticeable difference. After the batched write/transaction usage, it sped up another 80% to 1120/150.

Follow ups:

Something seems wrong with the hash_range query (or rather, the indexes/table layout), after a certain point it begins it increase in duration linearly (seems like it's potentially a table scan as we add more rows).
It's possible using BEGIN IMMEDIATE for transactions would be faster (not in sqlx currently, so would likely require a wrapper to make sure drop is called/things are rolled back, or updating to a single statement).
Other settings like mmap_size or temp_store = memory might be worth changing, and possibly other pragma options.
We could also INSERT multiple rows in one statement. We currently insert a single key/value or at most 2 keys and values on all paths, but 2-4 round trips is probably quite a bit slower than one, even if it is a local file.
I think our next bottlenecks may be in the recon client/server protocol loops. I need to prove that, but it appears we prioritize incoming writes and starve syncs (potentially a result of the select!(biased) usage). It's not necessarily a bad thing, but we may want to look at it.

Steady simulation results:

I ran a simulation with the following spec and it almost hit 300 rps on both sides. One side was 280314 over 15m for 311 rps. The sync target was 217945 new keys for a rate of 242 rps. It took about 5 minutes for the recv to catch up, but it ended at 380313 keys (one short) 🤔 This seems like it may be a result of the hanging hash_range queries.

apiVersion: "keramik.3box.io/v1alpha1"
kind: Simulation
metadata:
  name: event-ids
  # Must be the same namespace as the network to test
  namespace: keramik-small
spec:
  scenario: recon-event-sync
  devMode: true
  throttleRequests: 350
  successRequestTarget: 280
  users: 1
  runTime: 15

The SQL queries were quite consistent throughout and much more performant. Before these changes, the best I was seeing was 1.5m on inserts and upwards of 2.5m depending on the query setup. Here things stay about 0.7m on the high end, while most are much faster.

Difference between keys on both sides (logarithmic)

Simulation sync rate

Stress test results:

When adding keys much faster to one node, we see the middle and hash_range queries on the write side taking the longest. This is 5x what we were seeing above, but they do level out pretty quickly. The second spike resulted in a substantial dip in sync rate. Unfortunately, the result was 1008588 keys on one side and only 883593 on the other after 30 minutes and nothing was happening and the hash_range duration kept increasing (over 1s queries in the logs).

Botched metrics:

Unfortunately, all of the values below are inflated due to counting value updates (existing key) as a new key. It is interesting compared to the corrected results, as it demonstrates how syncing the values for keys we already have is what is consuming most of the remaining distance between the new keys on the writer and our side. One side was 280395 req in 15 minutes for 311 rps and the other was 282165 (counting keys and new_keys || values), which shows the key + value follow up are going faster than the incoming requests, but just getting keys is not.

insert and insert_many have been modified to perform better in sqlite. We grab a transaction and do all our writes, quitting early if we've done anything before. This has appeared to about 2x throughput from the shared pool and multiple write operations we had before

synchronous = normal is typically sufficient in WAL journal mode, and demonstrates a 2x increase in writes in my benchmarks

linear · 2024-01-24T03:39:47Z

WS1-1432 Restructure Recon table with separate tables

recon/src/recon/sqlitestore.rs

nathanielc · 2024-01-24T17:05:05Z

Something seems wrong with the hash_range query (or rather, the indexes/table layout), after a certain point it begins it increase in duration linearly (seems like it's potentially a table scan as we add more rows).

It is a table scan. In the synchronized state two nodes will be doing a hash_range query for an entire model range (i.e. a single interest). If they both get the same hash then they are synchronized and they stop the synchronization process. This means we often do hash_range queries over large ranges which require that we do the sum for that entire range. I would expect this query to grow as datasize grows. (We have some medium term) plans for how to address this but for now hopefully we can kept it fast enough.

I think our next bottlenecks may be in the recon client/server protocol loops. I need to prove that, but it appears we prioritize incoming writes and starve syncs (potentially a result of the select!(biased) usage). It's not necessarily a bad thing, but we may want to look at it.

FWIW the biased select usage is intentional to avoid potential deadlocks in the protocol. Happy to explore its usage more in depth.

When adding keys much faster to one node, we see the middle and hash_range queries on the write side taking the longest

I discovered an easy optimization we can make to the middle query last time I was in the code just didn't have to the time to add it. In fact we may not even need middle anymore. The indent is that we use middle to split the key space. It does this by counting the rows. However hash_range also counts the rows now and we always run hash_range right before we we run middle. So we can just reuse that count to split instead of calling middle. I'll create an issue for this so we can track it.

Still reviewing the code... Thanks for the in depth analysis

nathanielc

Change looks great

recon/src/recon.rs

dav1do · 2024-01-24T21:09:15Z

Something seems wrong with the hash_range query (or rather, the indexes/table layout), after a certain point it begins it increase in duration linearly (seems like it's potentially a table scan as we add more rows).

It is a table scan. In the synchronized state two nodes will be doing a hash_range query for an entire model range (i.e. a single interest). If they both get the same hash then they are synchronized and they stop the synchronization process. This means we often do hash_range queries over large ranges which require that we do the sum for that entire range. I would expect this query to grow as datasize grows. (We have some medium term) plans for how to address this but for now hopefully we can kept it fast enough.

Yep, makes sense. I looked more closely and figured out what it's doing, and it clicked seeing the min/max fenceposts being used. Since it should be associative and we don't support deletion, we can probably create a new table that's updated by a trigger that keeps the hash (total, over some ranges, etc). Think this is worth a discussion.

I think our next bottlenecks may be in the recon client/server protocol loops. I need to prove that, but it appears we prioritize incoming writes and starve syncs (potentially a result of the select!(biased) usage). It's not necessarily a bad thing, but we may want to look at it.

FWIW the biased select usage is intentional to avoid potential deadlocks in the protocol. Happy to explore its usage more in depth.

Yeah, that makes sense. I realized that was probably the reason. What I was seeing and want to figure out, is why it can exchange 1000+ keys/sec, but it's staying below the write speed when it's steady. I was assuming it might be the loop and message exchange frequency, but it could also be key discovery and when there are huge differences it shrinks faster.

When adding keys much faster to one node, we see the middle and hash_range queries on the write side taking the longest

I discovered an easy optimization we can make to the middle query last time I was in the code just didn't have to the time to add it. In fact we may not even need middle anymore. The indent is that we use middle to split the key space. It does this by counting the rows. However hash_range also counts the rows now and we always run hash_range right before we we run middle. So we can just reuse that count to split instead of calling middle. I'll create an issue for this so we can track it.

Awesome, we can discuss this with the hash_range optimizations and tackle in a follow up.

Still reviewing the code... Thanks for the in depth analysis

we need to figure out how to make these deterministic. currently they exact order of exchange depends on the speed at which the other side takes certain actions. speeding up the inserts appears to have caused a resp before a second req in some cases.

dav1do and others added 3 commits January 23, 2024 10:21

feat: split recon values into their own table

30b0c55

feat: adjust sqlite config options

f71ab6f

synchronous = normal is typically sufficient in WAL journal mode, and demonstrates a 2x increase in writes in my benchmarks

dav1do requested review from smrz2001, AaronGoldman and nathanielc January 24, 2024 03:39

dav1do commented Jan 24, 2024

View reviewed changes

recon/src/recon/sqlitestore.rs Outdated Show resolved Hide resolved

fix: only return true from insert when key is updated

5ca7e4b

dav1do changed the title ~~Feat: Modify recon storage tables, trait and sqlite config to improve throughput~~ feat: Modify recon storage tables, trait and sqlite config to improve throughput Jan 24, 2024

dav1do mentioned this pull request Jan 24, 2024

feat: synchronize value for synchronized ranges #238

Merged

1 task

nathanielc approved these changes Jan 24, 2024

View reviewed changes

recon/src/recon.rs Outdated Show resolved Hide resolved

dav1do force-pushed the feat/ws1-1432-recon-table branch from febf9a4 to 88aac7b Compare January 24, 2024 21:10

dav1do added this pull request to the merge queue Jan 24, 2024

Merged via the queue into main with commit 1958656 Jan 24, 2024
4 checks passed

dav1do deleted the feat/ws1-1432-recon-table branch January 24, 2024 22:57

github-actions bot mentioned this pull request Feb 6, 2024

chore: version v0.10.0 #261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Modify recon storage tables, trait and sqlite config to improve throughput #243

feat: Modify recon storage tables, trait and sqlite config to improve throughput #243

dav1do commented Jan 24, 2024 •

edited

Loading

linear bot commented Jan 24, 2024

nathanielc commented Jan 24, 2024

nathanielc left a comment

dav1do commented Jan 24, 2024

feat: Modify recon storage tables, trait and sqlite config to improve throughput #243

feat: Modify recon storage tables, trait and sqlite config to improve throughput #243

Conversation

dav1do commented Jan 24, 2024 • edited Loading

linear bot commented Jan 24, 2024

nathanielc commented Jan 24, 2024

nathanielc left a comment

Choose a reason for hiding this comment

dav1do commented Jan 24, 2024

dav1do commented Jan 24, 2024 •

edited

Loading