Connectorx should take less time to execute sql queries but it is taking more time than sqlite3 module #251

deepakpunia20 · 2022-03-09T04:54:31Z

deepakpunia20
Mar 9, 2022

Connectorx should take less time to execute sql queries but it is taking more time than sqlite3 module. Below is the example:

wangxiaoying · 2022-03-09T05:20:58Z

wangxiaoying
Mar 9, 2022
Maintainer

Hi @deepakpunia20 ,

ConnectorX is mainly targeting on the large query result fetching scenario. It speeds up the process by optimizing the client-side execution and saturating both network and machine resource through parallelism. When query execution is the bottleneck (for example the result size is small in your case, or the query is very complex), there will be overhead coming from metadata fetching. In ConnectorX, there are up to three info that will be fetched before issue the query to database:

MIN, MAX query for partition range (if partition is applied)
COUNT query (if return_type=pandas)
schema fetching query

In your situation the overhead comes from 2 and 3. In order to avoid the potentially costly COUNT query, we suggest to use Arrow as an intermediate destination from ConnectorX and convert it into Pandas using Arrow’s to_pandas API. For example:

import connectorx as cx

table = cx.read_sql(db_uri, query, return_type="arrow")
df = table.to_pandas(split_blocks=False, date_as_object=False)

Please feel free to have a try. It may reduce the time a bit. But since the query result in your case is too small, the overhead from 3 may still affect the end-to-end time.

2 replies

deepakpunia20 Mar 9, 2022
Author

Hi @wangxiaoying ,

I tried using arrow, though it reduced time in comparison to connectorx but still it's taking more time than sqlite3 module. Do you have way to reduce selection time from large dataset(12GB size) ? Generally source datasets(databases) are of large size, So our emphasis should be to select and update data in such large datasets quickly rather than focusing on result dataset because results datasets are comparatively very small in size.

wangxiaoying Mar 9, 2022
Maintainer

ConnectorX is targeting on optimizing large query result in which converting the data from binary to pandas on the client machine is the bottleneck. I think you can try partition the query either manually or automatically on a numerical column like examples here, so that we will issue subqueries in parallel. But in general, when query result is very small like in this case, there is very little that we can accelerate on the client side, and the overhead we have for accelerating large result may become obvious.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connectorx should take less time to execute sql queries but it is taking more time than sqlite3 module #251

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Connectorx should take less time to execute sql queries but it is taking more time than sqlite3 module #251

deepakpunia20 Mar 9, 2022

Replies: 1 comment · 2 replies

wangxiaoying Mar 9, 2022 Maintainer

deepakpunia20 Mar 9, 2022 Author

wangxiaoying Mar 9, 2022 Maintainer

deepakpunia20
Mar 9, 2022

Replies: 1 comment 2 replies

wangxiaoying
Mar 9, 2022
Maintainer

deepakpunia20 Mar 9, 2022
Author

wangxiaoying Mar 9, 2022
Maintainer