-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
json.dumps takes significant CPU time in data heavy workloads like ETL or streaming #689
Comments
Thanks for your suggestion. Switching to orjson seems to work well without much ado, as it doesn't break the test suite. You can test the package in your application by using |
If you are using SQLAlchemy, please also check this patch, which will unlock a communication path not using JSON marshalling at all. Just yesterday @seut emphasized within another conversation that the communication path using the PostgreSQL wire protocol has many benefits when compared to the HTTP/JSON-based communication, specifically about efficiency matters, especially when shuffling large data around in the context of streaming query results from the database cluster to the client. /cc @karynzv, @surister, @wierdvanderhaar, @hlcianfagna, @hammerhead, @simonprickett, @kneth |
@seut: Do you agree to include GH-691 into a release version 2.0.0, as proposed? @widmogrod: Because we received approval of GH-691, may I kindly ask you to test |
Thank you for your fast response 🚀 . |
This comment has been minimized.
This comment has been minimized.
Update 2.0.0.dev5Hi again. The improvements in GH-691 have been shipped per pre-release packages. Version 2.0.0.dev5 is the most recent one, and it is available on PyPI, together with an SQLAlchemy dialect package that permits installing it. They can be used to be slotted into downstream applications and frameworks of our and 3rd party realms, in order to give them more eyeballs. |
Update 2.0.0.dev6Hi again. The improvements in GH-691 have been updated and shipped per pre-release package version 2.0.0.dev6, again together with a corresponding SQLAlchemy dialect package. Please use them in your downstream applications and frameworks, we will be happy to hear back about any outcomes. |
Hi Team,
I use the crate-python driver during CDC process that writes data to CrateDB.
During benchmarking and optimizations of the process I recoded CPU flamegraphs that show that a lot of CPU time was spent during json serialization
crate-python/src/crate/client/http.py
Line 337 in c6892d5
There are publicly available benchmarks reporting that json.dumps is very slow, and changing the library to ujson or orjson can make a huge difference. I can confirm that swapping json to ujson reduces time spent on serialization by around 30-40%.
I didn't check how this impacts correctness since the current implementation uses JSON.dumps(cls=) to provide custom transformation logic.
Cheers,
Gabriel
The text was updated successfully, but these errors were encountered: