Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync server scalability for live collaboration #207

Open
mweidner037 opened this issue Oct 18, 2023 · 7 comments
Open

Sync server scalability for live collaboration #207

mweidner037 opened this issue Oct 18, 2023 · 7 comments

Comments

@mweidner037
Copy link

I've been running benchmarks for rich-text editors with live collaboration. Using automerge with a modified automerge-repo-sync-server, I noticed performance issues starting at ~4 simultaneous users.

In each benchmark, the users connect to the server using BrowserWebSocketClientAdapter, wait for the server to share the document, and then start typing at 6 chars/sec (plus occasional formatting ops). They also record "remote latency": the time from when one user types some text until it shows up for the other users.

With 4 users, after about 90 sec of (continuous) typing, the server's CPU usage hits 100% and the remote latencies spike to ~20 seconds. With 8 users, the spike occurs within 30 seconds. I believe it depends on (# users) * (size of doc). Full time-series data: allActive-4-automergeRepo.csv, allActive-8-automergeRepo.csv

Variations:

  • If I use a basic WebSocket echo server instead of automerge-repo-sync-server, it scales to at least 16 simultaneous users.
  • When I ran similar benchmarks in August (v1.0.2), even 3 users would see a spike in remote latency, but now 3 users seem okay. So there has been recent improvement.

Reproduce

More details about the benchmarks are in Section 7.1 of this preprint. Server code we're using is here, client code is here.

You can easily run the benchmarks locally, although the data above is from running it on AWS (with clients + server in different regions).

  1. Clone https://github.com/composablesys/collabs-rich-text-benchmarks
  2. In server/package.json, update @automerge/... dependencies to the latest versions. (The existing versions are what's used in the preprint.)
  3. Install and build:
npm ci && npm run build
  1. Run a local experiment (7 minutes):
cd client
bash local_exp.sh ../data allActive 4 automergeRepo trial0

"4" is the number of users; replace "automergeRepo" with "automerge" to use the basic WebSocket echo server.
5. Analyze the data:

cd ../analysis
npm start ../results ../data/allActive-004-automergeRepo/

CSVs will be in results/. There are also client CPU profiles but not server CPU profiles (sorry).

Versions:

  • "@automerge/automerge": "2.1.5"
  • "@automerge/automerge-repo": "1.0.12"
  • "@automerge/automerge-repo-network-websocket": "1.0.12"
  • Node v18.17.1
  • Ubuntu 22.04.3
@pvh
Copy link
Member

pvh commented Oct 19, 2023

Thanks for the report, @mweidner037, this is really helpful. Your results seem vaguely plausible given the performance traces I've seen lately in the browser. (As an aside the Chrome Dev Tools have really excellent performance analysis tools.)

We've been looking at similar problems and have a few patches in the work but getting independent performance testing is always very welcome!

@orionz
Copy link
Collaborator

orionz commented Oct 26, 2023

I've been looking into this - There are two issues related to marks causing you to call Automerge.marks() when you shouldn't need to. I will have a PR soon that fixes this. As for the automerge-repo falling down - I saw the same behavior but just by updating to a more recent version this seemed to go away. I was hoping you could confirm

     "server": {
       "dependencies": {
-        "@automerge/automerge": "2.1.2-alpha.0",
-        "@automerge/automerge-repo": "1.0.2",
-        "@automerge/automerge-repo-network-websocket": "1.0.2",
+        "@automerge/automerge": "2.1.5",
+        "@automerge/automerge-repo": "1.0.12",
+        "@automerge/automerge-repo-network-websocket": "1.0.12",
         "@collabs/collabs": "0.13.4",

@orionz
Copy link
Collaborator

orionz commented Oct 26, 2023

Also I made an even more lightweight version of the automerge-server.js and haven't seen it go over 1% CPU usage. Not very useful as it's just an echo server but happy to share if you like.

@mweidner037
Copy link
Author

Also I made an even more lightweight version of the automerge-server.js and haven't seen it go over 1% CPU usage. Not very useful as it's just an echo server but happy to share if you like.

Sure, I can change automerge mode to use that.

As for the automerge-repo falling down - I saw the same behavior but just by updating to a more recent version this seemed to go away.

I'll try this out when I can.


Aside: When you are running locally, you can make it more stressful (as if there were more users) by increasing the rate of edits. Here is the relevant constant: https://github.com/composablesys/collabs-nsdi/blob/master/client/src/scenarios/all_active.ts#L16

Probably 30 edits/second (value 33 ms) with 3-4 users will make things interesting. Remote latency P95 should be the most sensitive indicator.

@orionz
Copy link
Collaborator

orionz commented Oct 27, 2023

Made a PR that adds a fast marksAt() method so you dont need to walk the whole text field on every insert

automerge/automerge#785

@orionz
Copy link
Collaborator

orionz commented Oct 27, 2023

Something you could add to the benchmark that would be hugely valuable (and I was having trouble doing while reading your harness code) would be to have a server flamegraph added to the stats gathered maybe via the 0x package.

@mweidner037
Copy link
Author

I updated the code to use Automerge.marksAt and also upgraded to the latest versions (automerge 2.1.6, automerge-repo-* 1.0.15). However, I'm seeing similar performance - e.g. with 8 users on AWS, latency spikes around 30 sec and does not recover.

(The updates are in a dev copy of the repo - I can invite you if you like.)

Something you could add to the benchmark that would be hugely valuable (and I was having trouble doing while reading your harness code) would be to have a server flamegraph added to the stats gathered maybe via the 0x package.

You should be able to change "node" to "0x" on this line. I tried it a few times but it was flaky - it would give me a flamegraph if I killed the process after a few seconds, but not after running a whole experiment (even shortened to 30 sec).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants