Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updates duckdb/motherduck load job, adds full ci for motherduck and updates docs #1674

Merged
merged 4 commits into from
Aug 12, 2024

Conversation

rudolfix
Copy link
Collaborator

@rudolfix rudolfix commented Aug 8, 2024

Description

  1. adds full CI for Motherduck
  2. updates docs
  3. enables multi statement txs and sets parallelism to 8 threads for Motherduck
  4. matches parquet by column name to table columns in load job. fixes Fix: handle column order gracefully when loading into ddb from pqt #1553
  5. loads jsonl with read_json without using COPY FROM which allows skipping of the fields
  6. drops internal locks for loading parquet files to the same table from multiple threads

Copy link

netlify bot commented Aug 8, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 1112687
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66b5ff666a3c150008e8f085

@rudolfix rudolfix self-assigned this Aug 8, 2024
@@ -1,11 +1,10 @@
---
title: 🧪 MotherDuck
title: MotherDuck
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rudolfix rudolfix changed the title adds full ci for motherduck and updates docs updates duckdb/motherduck load job, adds full ci for motherduck and updates docs Aug 9, 2024
qualified_table_name, threading.Lock()
)
source_format = "read_parquet"
options = ", union_by_name=true"
elif self._file_path.endswith("jsonl"):
# NOTE: loading JSON does not work in practice on duckdb: the missing keys fail the load instead of being interpreted as NULL
Copy link
Collaborator

@sh-rp sh-rp Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the work on my datasets PR this is not or no longer true. I have a test there that migrates a table and it still works in duckdb with json and parquet.

# we will use a different pipeline with a separate schema but writing to the same dataset and to the same table
# the table schema is identical to the previous one with a single field ("time") added
# this will create a different order of columns than in the destination database ("time" will map to "_dlt_id")
# duckdb copies columns by column index so that will fail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the last line, afaik you are testing the union_by_name here and this should NOT fail.

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rudolfix rudolfix merged commit 250c2e2 into devel Aug 12, 2024
56 checks passed
@rudolfix rudolfix deleted the feat/runs-motherduck-ci branch August 12, 2024 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants