Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask-SQL not wrapping partition value in quotes #179

Closed
gallamine opened this issue May 18, 2021 · 6 comments · Fixed by #834
Closed

Dask-SQL not wrapping partition value in quotes #179

gallamine opened this issue May 18, 2021 · 6 comments · Fixed by #834

Comments

@gallamine
Copy link

gallamine commented May 18, 2021

Hello, I am querying a Hive table that is partitioned using string values that look like a date -YYYY-MM-DD. My create_table command fails with a HiveSQLException because the - isn't being escaped.
For example:

[SQL: DESCRIBE FORMATTED schema.table PARTITION (partition_col=2019-01-21)]

this should instead read partition_col="2019-01-21".

I am using dask-sql==0.3.6

@nils-braun
Copy link
Collaborator

Thanks for the bug report @gallamine - that definitely is a bug.
I think all that we would need to fix is this SQL string to also include proper escaping.
Would you like to do a PR, as you have already the Hive data available for testing?
You can follow the description here for how to setup dask-sql in development. I would love to see your PR!

@gallamine
Copy link
Author

@nils-braun I'm interested. Trying to think of an elegant way of knowing whether the partition value should be enclosed in quotations or not. Is there a good way you know of for determining what the type of partition_col is?

@Robert-Christensen-visa

@gallamine is it a problem to enclose all partition values in quotes? In our setup, it looks like if a partition has an int type, when running the query DESCRIBE FORMATTED {table_name} PARTITION (...) it can either be enclosed in quotes or not. Both will work. It seems like it would not be a problems to enclose all in quotes.

Only putting quotes around strings that contain at least one non-digit, such as what is in #180 would be problematic for my case. There exists a table that has a partition value 01. The column is type string, but often contains numeric only values. For the case of 01 if I try to query partition information without quoting the string it removes the leading 0 and fails because it can't find a partition with the value 1. Putting quotes around everything seems to work for me, so it is not clear me why you would need to be selective about what you enclose in quotes and what you don't. Just do it for everything.

@nils-braun
Copy link
Collaborator

Hi @Robert-Christensen-visa , @gallamine - that would be very nice and would simplify our lives a lot!

Sorry @gallamine for not continuing the discussion on the PRs and here. Can we pick this up again?

@quasiben
Copy link
Contributor

quasiben commented Oct 7, 2021

@Robert-Christensen-visa or @gallamine would it be possible to build a small reproducer ? I'm having some trouble generating the data partitioned in the way you've described

@Robert-Christensen-visa

Hi @quasiben. Here is a short script that will do something that causes an error for me:

HDFS_BASE_PATH="..." # place the path to put the table
hdfs dfs -mkdir ${HDFS_BASE_PATH}/month01
hdfs dfs -mkdir ${HDFS_BASE_PATH}/month02
echo "hello world,1234" | hdfs dfs -put - ${HDFS_BASE_PATH}/data0001
echo "foo bar,4321" | hdfs dfs -put - ${HDFS_BASE_PATH}/data0001

hive -e "CREATE EXTERNAL TABLE partitiontest (name STRING, number INT) \
         PARTITIONED BY (month STRING) \
         ROW FORMAT DELIMITED \
          FIELDS TERMINATED BY ',' \
          LINES TERMINATED BY '\n' \
      LOCATION ${HDFS_BASE_PATH}
hive -e "ALTER TABLE partitiontest ADD PARTITION (month='01') LOCATION ${HDFS_BASE_PATH}/month01"
hive -e "ALTER TABLE partitiontest ADD PARTITION (month='02') LOCATION ${HDFS_BASE_PATH}/month02"

When I create a table like the one above and try to query it using dask-sql I get an error:

from dask_sql import Context
c = Context()
c.create_table("partitiontest", get_hive_connection().cursor(), hive_table_name="partitiontest")

The error returned is long, but here is a small portion that would be most relevant here:

...sqlState='42000', errorCode=10006, errorMessage='Error while compiling statement: FAILED: SemanticException [Error 10006]: Partition not found {month=1}'), operationHandle=None)

When it collects information about the partitions it will find two partitions, month=01 and month=02. When it tries to get additional information about the partition it is not quoted, so it submits a request to get the partition month=1, removing the 0. In the example from @gallamine it might try to do subtraction since the dates are not quoted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants