Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta+Azure is slow #104

Open
nfoerster opened this issue Oct 2, 2024 · 8 comments
Open

Delta+Azure is slow #104

nfoerster opened this issue Oct 2, 2024 · 8 comments

Comments

@nfoerster
Copy link

nfoerster commented Oct 2, 2024

We are doing some tests as now azure and delta plugin work together, however we have heavy problems to write performant queries on our deltalake test.

The deltalake has around 70 columns and 1,5 billion rows, it is partitioned by 2 layers, the first on serialnumber has around 270 partitions and the second layer around 10-20 based on year-month. All files are parquets, we have only one deltalake version, the data is compacted and vacuumed, the metadata history is almost clean.

We are performing the queries with duckdb 1.1.1 from azure vm inside same vnet as the blob store is.

This is the setup:

LOAD azure;
LOAD delta;

set azure_transport_option_type = 'curl'; -- this is needed as we otherwise get SSL execptions, stated also in the azure plugin section on duckdb docs

-- two partition selection 
SELECT uuid FROM delta_scan('abfss://q2deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3') WHERE SN='XXXX0005' and yyyymm='202304';

-- 717931 rows in 2m30s

-- one partition selection
SELECT uuid FROM delta_scan('abfss://q2deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3') WHERE SN='XXXX0005';

-- 5493265 rows in 2m35s

As you can see there is not much of a difference between one or two partition clauses although its far less data. I think it scans the whole deltatable instead of pushing down the filters to partitions.

Did you ever had similar observations? Any hints would be nice.

@nfoerster nfoerster changed the title D Delta+Azure is incredible slow Oct 2, 2024
@nfoerster nfoerster changed the title Delta+Azure is incredible slow Delta+Azure is slow Oct 2, 2024
@nfoerster
Copy link
Author

This is something else we tried, the query time is around 4-5s. It has a range query so takes much more data in consideration.

import fsspec
from azure.identity.aio import DefaultAzureCredential
import pyarrow.parquet as pq
import time

credentials = DefaultAzureCredential()
fs = fsspec.filesystem("abfs", credential=credentials, account_name="saweud")
filters = []
filter_ = [("SN", "==", "XXXX2987"), ("yyyymm", ">=", 202107), ("yyyymm", "<=", 202204)]
filters.append(filter_)
start = time.time()
# filters.append(('billing_date', '<=', de.strftime('%Y-%m-%d')))
dt = pq.read_table(
    "q2deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3",
    partitioning="hive",
    filters=filters,
    filesystem=fs,
    columns=[
        "SN",
        "Date",
        "OperatingMode",
        "OP"
    ],
).to_pandas()
print(dt)
print(time.time() - start)

@kyrre
Copy link

kyrre commented Oct 23, 2024

How does the transaction log look? We have experienced similar performance issues for tables where the transaction log had a lot of entries.

@samansmink
Copy link
Collaborator

@nfoerster thanks for reporting, I will investigate. This sounds like filter pushdown on partitions is not working properly.

@kyrre I'm working on an optimization that should improve that: #110

@nfoerster2
Copy link
Contributor

How does the transaction log look? We have experienced similar performance issues for tables where the transaction log had a lot of entries.

How can I record the logs?

@kyrre
Copy link

kyrre commented Nov 22, 2024

How does the transaction log look? We have experienced similar performance issues for tables where the transaction log had a lot of entries.

How can I record the logs?

I am referring to the logs in in _delta_log folder.

Afaik this issue is due to limitations with the Azure API: microsoft/AzureStorageExplorer#134 (comment)
which means that you can't easily retrieve the entries that are never than the checkpoint!

@nfoerster2
Copy link
Contributor

nfoerster2 commented Nov 26, 2024

We know this limitation, if you dont care about the delta log you will have a continous increase in the append or merge operation time. Therefore we frequently condense, vacuum and checkpoint the deltatable and have a very short metadata history set. This is all in the _delta_log folder:

image

@kyrre
Copy link

kyrre commented Nov 26, 2024

We know this limitation, if you dont care about the delta log you will have a continous increase in the append or merge operation time. Therefore we frequently condense, vacuum and checkpoint the deltatable and have a very short metadata history set....

are you setting delta.logRetentionDuration to a really low value to achieve this?

@nfoerster2
Copy link
Contributor

yes for that test we set the value to 5min

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants