Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Maximum call stack size exceeded" on Safari 18.2 #2

Open
rhashimoto opened this issue Jan 24, 2025 · 21 comments
Open

"Maximum call stack size exceeded" on Safari 18.2 #2

rhashimoto opened this issue Jan 24, 2025 · 21 comments

Comments

@rhashimoto
Copy link
Collaborator

@maccman @vojto @ocavue

After many queries, Safari 18.2 can hang or produce a "Maximum call stack size exceeded" WebAssembly error from wa-sqlite.

Reproduce with this test page. This fails for me after ~4000-10000 queries (depending if you use dist or debug wa-sqlite) on Safari 18.2 while it has not been observed to fail on Safari Technology Preview, Chrome 132, or Firefox 134 for 1 million queries.

It is possible this is a stack leak similar to this one - this can be investigated as done here.

I think a browser bug is also a strong possibility. Since it doesn't appear in Safari Technology Preview, we can look at recently closed WebKit bugs to see if there is anything interesting there.

@rhashimoto
Copy link
Collaborator Author

rhashimoto commented Jan 24, 2025

There's one recent WebKit bug that could be related. It specifies iOS but it doesn't actually say that it doesn't happen on desktop. It could be an ARM thing so it would show up on any Apple Silicon.

The reporter says that the crash appears with Emscripten optimization levels "O2 O3 Os OZ". Our bug matches that behavior exactly:

Opt flag .wasm size Result Timing Relative to -Oz
O0 4506281 5.101 1.08
O1 4595948 5.072 1.07
O2 2180592 4.801 1.02
O3 2292367 4.761 1.01
Os 1595028 4.770 1.01
Oz 1463181 4.730 1.0

My working theory is we're running into the same problem, and its absence from the WebKit Technology Preview may mean it is already being addressed. I'll still take a look at the stack leak possibility, though.

@rhashimoto
Copy link
Collaborator Author

I tried building with Emscripten 4.0.1 (instead of 3.1.61) on the chance that a difference in code generated by the compiler might sidestep the problem. Unfortunately it didn't.

@rhashimoto
Copy link
Collaborator Author

I injected this printf into the SQLite C source:

SQLITE_API int sqlite3_step(sqlite3_stmt *pStmt){
  int rc = SQLITE_OK;      /* Result from sqlite3Step() */
  Vdbe *v = (Vdbe*)pStmt;  /* the prepared statement */
  int cnt = 0;             /* Counter to prevent infinite loop of reprepares */
  sqlite3 *db;             /* The database connection */

  printf("sqlite3_step %p\n", &rc);

That produced this on the Safari console:

Image

This shows thousands of calls to sqlite3_step() all at the identical stack location so we do not appear to be leaking stack.

@rhashimoto
Copy link
Collaborator Author

rhashimoto commented Jan 24, 2025

If this is a Safari bug then I'm wondering why a lot more developers haven't reported it. There are quite a few WebAssembly apps out there now.

My speculation is that it is related to Asyncify, which would really narrow the space. I tried running the test page with a synchronous build (i.e. no Asyncify) and OPFSCoopSyncVFS, and that ran successfully so that is a consistent observation.

@vojto
Copy link

vojto commented Jan 24, 2025

Well, we know that Notion uses SQLite. I just tried it in Safari and it looks like it created a SQLite file in OPFS.

And based on this article, my guess is they also use an async build.

Another pretty mature implementation is Livestore (example project). If we could somehow stress-test those by injecting our code, that would tell us whether they have the same problem.

I know author of Livestore, so maybe we could get him on a call and see if he encountered the same problem building the library.

@rhashimoto
Copy link
Collaborator Author

rhashimoto commented Jan 24, 2025

Notion uses the official SQLite WASM, which only has a synchronous build.

I added a comment on that WebKit bug to ask the reporter if they use Asyncify. Update: the response was that they do use Asyncify.

@rhashimoto
Copy link
Collaborator Author

I put a simplified test online (source), and added a comment to the WebKit bug in hopes that might aid a complete and timely fix.

@vojto
Copy link

vojto commented Jan 28, 2025

I just tried branch https://github.com/team-reflect/reflect/pull/3177 and got the maximum call stack size error even in Google Chrome.

This was when I was trying to sync my account with 10k notes. It seems that sync succeeded.

Also, app is working normally after the error.

Edit: I can reproduce this each time (tried 3 times) with this new build. Going to try multiple times with the old build to see if it's also happening there and we just missed it before.

Image

@rhashimoto
Copy link
Collaborator Author

My local Reflect app with the -O1 build is working fine for me on both Safari and Chrome. Maybe I don't have enough notes or they don't have the right content to trigger the problem.

I wonder also if your app build might somehow be picking up inconsistent pieces of wa-sqlite, e.g. because of build or browser caching? If you have mismatched Emscripten JS and WebAssembly, that can cause all kinds of strange failures. I'm thinking that isn't the problem here because it's usually more catastrophic than what you're getting, but if you can force a full rebuild and reload it might be worth trying.

I also created another wa-sqlite branch, rhashimoto/wa-sqlite-01-debug, that has debug symbols (the WebAssembly file is 12MB!). Try reproducing your Chrome stack overflow with that and see if that tells us which function is getting blamed for infinite recursion. I don't know if this will tell us anything meaningful, so not high priority. Just searching for any bit of information that might help from cases we can actually reproduce.

@vojto
Copy link

vojto commented Jan 29, 2025

@rhashimoto I'm gonna try your debug build, and in the meantime, here's an export with 10k notes that should reproduce the issue:

generated_export.json.zip

After importing, sign out, and sign back in, so that the full sync runs.

@vojto
Copy link

vojto commented Jan 29, 2025

With your wa-sqlite-01-debug build I'm getting errors no such table xx and I can't get any further.

@rhashimoto
Copy link
Collaborator Author

I'm gonna try your debug build, and in the meantime, here's an export with 10k notes that should reproduce the issue:

Thanks, that does in fact reproduce in my environment.

The failure of the O1 build in Chrome looks like a real stack overflow. Emscripten uses "far more [local variables]" for unoptimized (and presumably lightly optimized) code. My theory is that in trying to avoid a false stack overflow by reducing the optimization, we ironically caused a true stack overflow.

I was hoping to fix this by changing the "id" IN (SELECT "value" FROM ?) here to use JSON_EACH() as has been done elsewhere. That's probably still a good idea but it didn't fix the crash. I'm guessing that the IN operator may be the thing that is recursive and assumes that collections aren't that large.

Next I tried changing pageSize on this line. 100 didn't work but 20 did! So one workaround may be to limit the size of IN expressions (or avoid them). Can this query be done with a CTE instead?

The other path is to crank up the stack size. I think that should also work if we don't run into any hard limits. I'll try that tomorrow.

@vojto
Copy link

vojto commented Jan 30, 2025

@rhashimoto I switched to jsonEach here: https://github.com/team-reflect/reflect/pull/3191 But that won't actually help, right?

The SearchStore reindex happens with thousands of IDs after the initial sync. We could stop using IN like this:

  1. during initial sync, disable the notes observer in SearchStore (so reindexSome won't be enqueued)
  2. after it finishes, enqueue reindexAll
  3. that way reindexing doesn't have to use IN query, but instead standard pagination over all of the notes

However there is still an instance where a large IN query is necessary, and that is during the initial sync:

  1. we ask Firebase for all docs updated after . It might respond with 10k notes
  2. then we use IN query to get local timestamps for all of these notes
  3. we update local notes if remote timestamp is newer

There's no way to stop using the IN query here.


do you like the solution where we change the application code to stop using certain features, like IN? Would that be your suggestion for us?

@vojto
Copy link

vojto commented Jan 30, 2025

I suppose we could limit the size of IN queries in both reindexing code and initial sync.

the page size of 20 however feels too small. I feel uneasy about crashing at only 100 rows per page.

@rhashimoto
Copy link
Collaborator Author

There's no way to stop using the IN query here.

I'm thinking that if IN is the operation causing recursion (still a conjecture), we might be able to write an equivalent query with a CTE. So instead of this:

SELECT * FROM "notes"
 WHERE "graphId" == ?
  AND "deletedAt" IS NULL
  AND ("subject" IS NOT NULL OR "isBlank" = ?)
  AND "id" IN (SELECT "value" FROM JSON_EACH(?))
 ORDER BY "id"

Replace with something like this (untested even for syntax):

WITH "key" ("graphId", "id") AS
 (SELECT ?, "value" FROM JSON_EACH(?))
SELECT * FROM "notes" NATURAL JOIN "key"
 WHERE "deletedAt" IS NULL
  AND ("subject" IS NOT NULL OR "isBlank" = ?)
 ORDER BY "id"

I think the CTE form might be faster as well.

@vojto
Copy link

vojto commented Jan 30, 2025

@rhashimoto so do you want me to update our query and try that form? It'll take some messing around with Kysely, but I should be able to do that.

@rhashimoto
Copy link
Collaborator Author

rhashimoto commented Jan 30, 2025

Yes, I think trying a CTE is worth a shot. So far I'm not having much success with cranking up the stack size.

I would run both queries while you're developing so you can verify that they return identical results.

For a first pass it might be easiest to use the "sql" raw query builder just to see if it works. I think that would be something like:

await sql`WITH key (graphId, id) AS
  (SELECT ${sql.lit(graphId)}, value FROM ${jsonEach(chunkIds)})
  SELECT * FROM notes NATURAL JOIN key
    WHERE deletedAt IS NULL
    AND (subject IS NOT NULL OR isBlank = 0)
    ORDER BY id
  `.execute(db)

@vojto
Copy link

vojto commented Jan 30, 2025

@ocavue I'm not gonna have enough time to wrap this up today - so could you please try @rhashimoto's suggestion?

And in the meantime @maccman agreed that we restore the dialog warning about Safari.

@rhashimoto
Copy link
Collaborator Author

So far I'm not having much success with cranking up the stack size.

Apparently there are two stacks in WebAssembly, the C stack and the VM stack. The "RangeError" in

RangeError: Maximum call stack size exceeded

indicates that it is the VM stack that is overflowing. Unfortunately, that one is built in to the browser and Emscripten build flags don't affect it. Generally the fix is...use an optimizing build to reduce stack usage by eliminating unnecessary locals. 😖

So we need to decrease optimization to avoid the Safari bug but doing so causes certain queries to overflow the Chrome VM stack. This is looking like a dead end to fix this problem.

The pathways still open are:

  • Reduce page size to 99, or
  • See if avoiding large IN tests works

@rhashimoto
Copy link
Collaborator Author

If the CTE is hard to express with Kysely, maybe a subquery would be easier:

SELECT * FROM notes
  NATURAL JOIN (SELECT 'myGraph' AS graphId, value AS id FROM JSON_EACH('[...]'))
  WHERE ...
  ORDER BY ...

@rhashimoto
Copy link
Collaborator Author

rhashimoto commented Jan 31, 2025

This is close:

const q = this.noteTable.query()
  .innerJoin(
    sql<{ graphId: string, id: string }>`(SELECT ${sql.val(this.graphId).as('graphId')}, ${sql.ref('value').as('id')} FROM JSON_EACH(${sql.val(JSON.stringify(chunkIds))}))`.as('key'),
    join => join
     .onRef('notes.graphId', '=', 'key.graphId')
     .onRef('notes.id', '=', 'key.id'))
  .selectAll()
  .where('deletedAt', 'is', null)
  .where((eb) => eb.or([eb('subject', 'is not', null), eb('isBlank', '=', 0)]))
  .orderBy('key.id')
  // .compile()

But it produces this:

select * from "notes" inner join
  (SELECT ? as "graphId", "value" as "id" FROM JSON_EACH(?)) as "key"
    on "notes"."graphId" = "key"."graphId" and "notes"."id" = "key"."id"
  where "graphId" == ? and "deletedAt" is null and ("subject" is not null or "isBlank" = ?)
  order by "id"

That has an extra "graphId" == ? right after "where" and I don't know how it got there or how to get rid of it (it's unnecessary and ambiguous without a table prefix).

In any case, although this might eventually replace IN, I think this will be a partial solution at best. I see other queries without IN that also crash Chrome. I'm not feeling optimistic about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants