json.dumps takes significant CPU time in data heavy workloads like ETL or streaming #689

widmogrod · 2025-01-15T22:03:11Z

Hi Team,

I use the crate-python driver during CDC process that writes data to CrateDB.
During benchmarking and optimizations of the process I recoded CPU flamegraphs that show that a lot of CPU time was spent during json serialization

crate-python/src/crate/client/http.py

Line 337 in c6892d5

return json.dumps(data, cls=CrateJsonEncoder)

There are publicly available benchmarks reporting that json.dumps is very slow, and changing the library to ujson or orjson can make a huge difference. I can confirm that swapping json to ujson reduces time spent on serialization by around 30-40%.

I didn't check how this impacts correctness since the current implementation uses JSON.dumps(cls=) to provide custom transformation logic.

Cheers,
Gabriel

amotl · 2025-01-15T23:03:02Z

Thanks for your suggestion. Switching to orjson seems to work well without much ado, as it doesn't break the test suite.

Use orjson to improve JSON marshalling performance #691

You can test the package in your application by using crate==2.0.0.dev0.

https://pypi.org/project/crate/2.0.0.dev0/

amotl · 2025-01-16T00:04:39Z

If you are using SQLAlchemy, please also check this patch, which will unlock a communication path not using JSON marshalling at all.

Dialect: Add support for asyncpg and psycopg3 drivers sqlalchemy-cratedb#11

Just yesterday @seut emphasized within another conversation that the communication path using the PostgreSQL wire protocol has many benefits when compared to the HTTP/JSON-based communication, specifically about efficiency matters, especially when shuffling large data around in the context of streaming query results from the database cluster to the client.

/cc @karynzv, @surister, @wierdvanderhaar, @hlcianfagna, @hammerhead, @simonprickett, @kneth

amotl · 2025-01-16T09:00:59Z

@seut: Do you agree to include GH-691 into a release version 2.0.0, as proposed?

@widmogrod: Because we received approval of GH-691, may I kindly ask you to test crate>=2.0.0.dev0 already in a downstream application (actually, as many as possible would be excellent), in order to validate that nothing goes south?

widmogrod · 2025-01-16T09:24:44Z

Thank you for your fast response 🚀 .
I'm looking at the change.
I will let you know how things work, when I re-run benchmarks with the current and new versions.

widmogrod · 2025-01-16T11:11:10Z

I used poetry to fetch crate = "^2.0.0.dev1" and run benchmark
This is how a section of the flame chart looks like

left - crate = "^1.0.0"
right - crate = "^2.0.0.dev1"

This part of the code is much more faster:

amotl · 2025-01-17T09:48:09Z

Update 2.0.0.dev5

Hi again. The improvements in GH-691 have been shipped per pre-release packages. Version 2.0.0.dev5 is the most recent one, and it is available on PyPI, together with an SQLAlchemy dialect package that permits installing it.

They can be used to be slotted into downstream applications and frameworks of our and 3rd party realms, in order to give them more eyeballs.

amotl · 2025-01-18T12:43:46Z

Update 2.0.0.dev6

Hi again. The improvements in GH-691 have been updated and shipped per pre-release package version 2.0.0.dev6, again together with a corresponding SQLAlchemy dialect package.

Please use them in your downstream applications and frameworks, we will be happy to hear back about any outcomes.

widmogrod · 2025-01-20T13:39:32Z

I can confirm that on version dev6, performance benefits remain consistent

amotl mentioned this issue Jan 15, 2025

Use orjson to improve JSON marshalling performance #691

Open

This comment has been minimized.

Sign in to view

amotl mentioned this issue Jan 17, 2025

[2025] Releasing version 2.0.0 #693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.dumps takes significant CPU time in data heavy workloads like ETL or streaming #689

json.dumps takes significant CPU time in data heavy workloads like ETL or streaming #689

widmogrod commented Jan 15, 2025

amotl commented Jan 15, 2025 •

edited

Loading

amotl commented Jan 16, 2025 •

edited

Loading

amotl commented Jan 16, 2025 •

edited

Loading

widmogrod commented Jan 16, 2025 •

edited

Loading

This comment has been minimized.

widmogrod commented Jan 16, 2025 •

edited

Loading

amotl commented Jan 17, 2025 •

edited

Loading

amotl commented Jan 18, 2025

widmogrod commented Jan 20, 2025

json.dumps takes significant CPU time in data heavy workloads like ETL or streaming #689

json.dumps takes significant CPU time in data heavy workloads like ETL or streaming #689

Comments

widmogrod commented Jan 15, 2025

amotl commented Jan 15, 2025 • edited Loading

amotl commented Jan 16, 2025 • edited Loading

amotl commented Jan 16, 2025 • edited Loading

widmogrod commented Jan 16, 2025 • edited Loading

This comment has been minimized.

widmogrod commented Jan 16, 2025 • edited Loading

amotl commented Jan 17, 2025 • edited Loading

Update 2.0.0.dev5

amotl commented Jan 18, 2025

Update 2.0.0.dev6

widmogrod commented Jan 20, 2025

amotl commented Jan 15, 2025 •

edited

Loading

amotl commented Jan 16, 2025 •

edited

Loading

amotl commented Jan 16, 2025 •

edited

Loading

widmogrod commented Jan 16, 2025 •

edited

Loading

widmogrod commented Jan 16, 2025 •

edited

Loading

amotl commented Jan 17, 2025 •

edited

Loading