Sync server scalability for live collaboration #207

mweidner037 · 2023-10-18T23:09:27Z

I've been running benchmarks for rich-text editors with live collaboration. Using automerge with a modified automerge-repo-sync-server, I noticed performance issues starting at ~4 simultaneous users.

In each benchmark, the users connect to the server using BrowserWebSocketClientAdapter, wait for the server to share the document, and then start typing at 6 chars/sec (plus occasional formatting ops). They also record "remote latency": the time from when one user types some text until it shows up for the other users.

With 4 users, after about 90 sec of (continuous) typing, the server's CPU usage hits 100% and the remote latencies spike to ~20 seconds. With 8 users, the spike occurs within 30 seconds. I believe it depends on (# users) * (size of doc). Full time-series data: allActive-4-automergeRepo.csv, allActive-8-automergeRepo.csv

Variations:

If I use a basic WebSocket echo server instead of automerge-repo-sync-server, it scales to at least 16 simultaneous users.
When I ran similar benchmarks in August (v1.0.2), even 3 users would see a spike in remote latency, but now 3 users seem okay. So there has been recent improvement.

Reproduce

More details about the benchmarks are in Section 7.1 of this preprint. Server code we're using is here, client code is here.

You can easily run the benchmarks locally, although the data above is from running it on AWS (with clients + server in different regions).

Clone https://github.com/composablesys/collabs-rich-text-benchmarks
In server/package.json, update @automerge/... dependencies to the latest versions. (The existing versions are what's used in the preprint.)
Install and build:

npm ci && npm run build

Run a local experiment (7 minutes):

cd client
bash local_exp.sh ../data allActive 4 automergeRepo trial0

"4" is the number of users; replace "automergeRepo" with "automerge" to use the basic WebSocket echo server.
5. Analyze the data:

cd ../analysis
npm start ../results ../data/allActive-004-automergeRepo/

CSVs will be in results/. There are also client CPU profiles but not server CPU profiles (sorry).

Versions:

"@automerge/automerge": "2.1.5"
"@automerge/automerge-repo": "1.0.12"
"@automerge/automerge-repo-network-websocket": "1.0.12"
Node v18.17.1
Ubuntu 22.04.3

The text was updated successfully, but these errors were encountered:

pvh · 2023-10-19T03:54:41Z

Thanks for the report, @mweidner037, this is really helpful. Your results seem vaguely plausible given the performance traces I've seen lately in the browser. (As an aside the Chrome Dev Tools have really excellent performance analysis tools.)

We've been looking at similar problems and have a few patches in the work but getting independent performance testing is always very welcome!

orionz · 2023-10-26T19:22:01Z

I've been looking into this - There are two issues related to marks causing you to call Automerge.marks() when you shouldn't need to. I will have a PR soon that fixes this. As for the automerge-repo falling down - I saw the same behavior but just by updating to a more recent version this seemed to go away. I was hoping you could confirm

     "server": {
       "dependencies": {
-        "@automerge/automerge": "2.1.2-alpha.0",
-        "@automerge/automerge-repo": "1.0.2",
-        "@automerge/automerge-repo-network-websocket": "1.0.2",
+        "@automerge/automerge": "2.1.5",
+        "@automerge/automerge-repo": "1.0.12",
+        "@automerge/automerge-repo-network-websocket": "1.0.12",
         "@collabs/collabs": "0.13.4",

orionz · 2023-10-26T19:24:55Z

Also I made an even more lightweight version of the automerge-server.js and haven't seen it go over 1% CPU usage. Not very useful as it's just an echo server but happy to share if you like.

mweidner037 · 2023-10-26T21:27:09Z

Also I made an even more lightweight version of the automerge-server.js and haven't seen it go over 1% CPU usage. Not very useful as it's just an echo server but happy to share if you like.

Sure, I can change automerge mode to use that.

As for the automerge-repo falling down - I saw the same behavior but just by updating to a more recent version this seemed to go away.

I'll try this out when I can.

Aside: When you are running locally, you can make it more stressful (as if there were more users) by increasing the rate of edits. Here is the relevant constant: https://github.com/composablesys/collabs-nsdi/blob/master/client/src/scenarios/all_active.ts#L16

Probably 30 edits/second (value 33 ms) with 3-4 users will make things interesting. Remote latency P95 should be the most sensitive indicator.

orionz · 2023-10-27T17:41:52Z

Made a PR that adds a fast marksAt() method so you dont need to walk the whole text field on every insert

automerge/automerge#785

orionz · 2023-10-27T17:43:15Z

Something you could add to the benchmark that would be hugely valuable (and I was having trouble doing while reading your harness code) would be to have a server flamegraph added to the stats gathered maybe via the 0x package.

mweidner037 · 2023-11-07T23:53:30Z

I updated the code to use Automerge.marksAt and also upgraded to the latest versions (automerge 2.1.6, automerge-repo-* 1.0.15). However, I'm seeing similar performance - e.g. with 8 users on AWS, latency spikes around 30 sec and does not recover.

(The updates are in a dev copy of the repo - I can invite you if you like.)

Something you could add to the benchmark that would be hugely valuable (and I was having trouble doing while reading your harness code) would be to have a server flamegraph added to the stats gathered maybe via the 0x package.

You should be able to change "node" to "0x" on this line. I tried it a few times but it was flaky - it would give me a flamegraph if I killed the process after a few seconds, but not after running a whole experiment (even shortened to 30 sec).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync server scalability for live collaboration #207

Sync server scalability for live collaboration #207

mweidner037 commented Oct 18, 2023

pvh commented Oct 19, 2023

orionz commented Oct 26, 2023

orionz commented Oct 26, 2023

mweidner037 commented Oct 26, 2023

orionz commented Oct 27, 2023

orionz commented Oct 27, 2023 •

edited

Loading

mweidner037 commented Nov 7, 2023

Sync server scalability for live collaboration #207

Sync server scalability for live collaboration #207

Comments

mweidner037 commented Oct 18, 2023

Reproduce

pvh commented Oct 19, 2023

orionz commented Oct 26, 2023

orionz commented Oct 26, 2023

mweidner037 commented Oct 26, 2023

orionz commented Oct 27, 2023

orionz commented Oct 27, 2023 • edited Loading

mweidner037 commented Nov 7, 2023

orionz commented Oct 27, 2023 •

edited

Loading