Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CREATE TABLE AS #130

Merged
merged 5 commits into from
Aug 16, 2024
Merged

Add support for CREATE TABLE AS #130

merged 5 commits into from
Aug 16, 2024

Conversation

JelteF
Copy link
Collaborator

@JelteF JelteF commented Aug 16, 2024

This adds support for queries like:

CREATE TABLE mytable AS SELECT column00 from read_csv("my_data.csv") AS (column00 int)

The table will be then be created in Postgres (not in DuckDB). The main
change that supports this is that we now start "deparsing" the parsed
Query type from Postgres to an actual query string, instead of sending
the original string that Postgres received to DuckDB.

This also accidentally fixes some differences such as the DuckDB
count(*) aggregate creating a count_star vs Postgres its count
column. So it reduces the need for #36 a bit. The reason this happens is
because it is deparsed very explicitly like count(*) AS count.

For similar reasons this also changes the behaviour of the following
query to only return the id and name columns, instead of all the
columns. It's arguable whether that's a good thing or not: it's more
verbose if you want those columns but it's also less surprising.

SELECT * FROM read_parquet(...) AS (id int, name text);

One especially surprising thing with the above query is that
if you used a column name that was not in the parquet file,
then your query would succeed, but you would not actually
get that column back in the result (only the actual columns
in the file).

@JelteF JelteF requested a review from Tishj August 16, 2024 12:38
Comment on lines +30 to +36
WARNING: (DuckDB) Binder Error: Referenced table "t" not found!
Candidate tables: "s.t"
LINE 3: FROM ( SELECT t."?column?"
^
?column? | ?column?
----------+----------
21 | 21
21 | 42
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous DuckDB pushdown was actually returning the wrong results here. So the DuckDB failure that this introduces is better than the alternative imho. Because now it warns and automatically falls back to Postgres, instead of returning a wrong result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I noticed that as well, the postgres catalog extension solved the same problem

Comment on lines +1 to +2
CREATE TABLE t(a INT, b VARCHAR);
INSERT INTO t SELECT g % 1000, MD5(g::VARCHAR) FROM generate_series(1,20000) g;
Copy link
Collaborator Author

@JelteF JelteF Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SELECT in the CREATE MATERIALIZED VIEW command now also goes through DuckDB. This is a good thing imho, because that means that we can use read_csv and read_parquet in those queries too.

However, on PG16 our type mapping results in the following error:

 CREATE MATERIALIZED VIEW tv AS SELECT * FROM t WHERE a % 100 = 0;
+ERROR:  SELECT rule's target entry 2 has different type from column "b"
+DETAIL:  SELECT target entry has type text, but column has type character varying.

Surprisingly in PG17 this works just fine.

While suboptimal that we now need to use VARCHAR, I think this is an improvement overall. And probably this means that we need to improve our type mapping a bit and actually return TEXT instead of VARCHAR from queries where the original table uses TEXT.

mkaruza
mkaruza previously approved these changes Aug 16, 2024
return true;
return false;
/* We don't support modifying statements yet */
if (query->commandType != CMD_SELECT) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already check this condition in DuckdbPlannerHook if statement but this is better place to contain this check commandType == SELECT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved a few more checks from the DuckDBPlannerHook to IsAllowedStatement now.

Tishj
Tishj previously approved these changes Aug 16, 2024
Copy link
Collaborator

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this sounds like a good change to me 👍

Base automatically changed from disallow-modifying-ctes to main August 16, 2024 14:01
@JelteF JelteF dismissed stale reviews from Tishj and mkaruza August 16, 2024 14:01

The base branch was changed.

An error occurred while trying to automatically change base from disallow-modifying-ctes to main August 16, 2024 14:01
JelteF added 5 commits August 16, 2024 16:01
This adds support for queries like:
```
CREATE TABLE mytable AS SELECT column00 from read_csv("my_data.csv");
```

The table will be then be created in Postgres (not in DuckDB). The main
change that supports this is that we now start "deparsing" the parsed
`Query` type from Postgres to an actual query string, instead of sending
the original string that Postgres received to DuckDB.

This also accidentally fixes some differences such as the DuckDB
`count(*)` aggregate creating a `count_star` vs Postgres its `count`
column. So it reduces the need for #36 a bit. The reason this happens is
because it is deparsed very explicitly like `count(*) AS count`.

For similar reasons this also changes the behaviour of the following
query to only return the `id` and `name` columns, instead of all the
columns. It's arguable whether that's a good thing or not.
```
SELECT * FROM read_parquet(...) AS (id int, name text);
```
@mkaruza mkaruza self-requested a review August 16, 2024 14:02
@JelteF JelteF merged commit 8dfa127 into main Aug 16, 2024
3 checks passed
@JelteF JelteF deleted the create-table-as branch August 16, 2024 14:03
JelteF added a commit that referenced this pull request Aug 20, 2024
The issue in #119 was already fixed by #130, but this PR adds a test for
it.

Fixes #119
mkaruza pushed a commit that referenced this pull request Sep 11, 2024
Postgres contains code in the `src/backend/utils/adt/ruleutils.c` file
to build a query string from a `Query *`. We also need to do this to be
able to send a query string to DuckDB, and we started using this
functionality in #130. Sadly this part of the Postgres code is not
extendable at all, and we do need to make some small changes.

So this vendors in the files, one for each supported postgres version.
This commit does not make any changes yet.
JelteF added a commit that referenced this pull request Sep 11, 2024
Postgres contains code in the `src/backend/utils/adt/ruleutils.c` file
to build a query string from a `Query *`. We also need to do this to be
able to send a query string to DuckDB, and we started using this
functionality in #130. Sadly this part of the Postgres code is not
extendable at all, and we do need to make some small changes.

So this vendors in the files, one for each supported postgres version.
This commit does not make any changes yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants