Error trying to convert Parquet `date` into Pandas datatype #181

gallamine · 2021-05-19T20:22:28Z

When the Parquet file contains a date type dask_sql will try and convert the corresponding Pandas dataframe column into a date type that Pandas doesn't recognize. The issue seems to arise in https://github.com/nils-braun/dask-sql/blob/main/dask_sql/mappings.py#L273. Example:

import pandas as pd
import dask_sql
from datetime import datetime
df = pd.DataFrame({'date_col':[datetime.today()]})
dask_sql.mappings.cast_column_to_type(df, 'date_col', 'date')

results in TypeError: data type 'date' not understood.
This is after PyArrow has decided that the Parquet date time should be a datetime64[ns] type in Pandas.

I'm using Python 3.7, pyarrow==4.0.0 and pandas==1.2.4

The text was updated successfully, but these errors were encountered:

nils-braun · 2021-05-20T05:45:14Z

Hi @gallamine!
Thanks again for your bug report and I am sorry that it seems you need to iron out all the shortcomings :-( Thank you that you keep trying and report those back!

I can reproduce the error message in your code (in the newest version of dask-sql the function call is a bit different, but that does not change anything), but I am wondering why we actually end up with this function call at all during "normal" operation, because in principle (but I think here is the problem), all calls to cast_column_to_type should only get a type, which was converted using sql_to_python_type before (and are a valid python/pandas type).
(Background: the function cast_column_to_type should only do the casting, the conversion between a SQL type like DATE and the corresponding Python/Pandas type should happen before).

I tried to reproduce with a parquet file:

import pandas as pd
from datetime import datetime

df = pd.DataFrame({"date_col": [datetime.today()]})
df.to_parquet("181.parquet")

import dask.dataframe as dd
from dask_sql import Context

df = dd.read_parquet("181.parquet")

c = Context()
c.create_table("df", df)

print(c.sql("SELECT *, CAST(date_col AS DATE), EXTRACT(DOW FROM date_col) FROM df").compute())

but that works.

Relating to your other open issue #179, I just assumed that this issue came up during interaction with hive - and especially during interaction with a partitioned hive table?

Looking back to the code of the hive input, I am wondering if the call to sql_to_python_type is actually missing. I see that I implemented it for non-partitioned tables here, but I can not find it for the partition column (should be around https://github.com/nils-braun/dask-sql/blob/main/dask_sql/input_utils/hive.py#L132). Well, that is clearly a bug!

If you want, I can add try to reproduce it using my local hive setup and come up with a bugfix. If you have already a setup for testing this (and my assumption is correct that your bug came up during hive usage), I am also happy to work together with you!
(are you currently participating in the dask summit?)

gallamine · 2021-05-24T20:10:02Z

I'm testing your suggested fix now. I was OOO the past few days and (sadly) not attending the Dask Summit

This was referenced May 20, 2021

Use the correct hive partition type information (hacky solution) gallamine/dask-sql#1

Open

Add better partition string escaping #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error trying to convert Parquet `date` into Pandas datatype #181

Error trying to convert Parquet `date` into Pandas datatype #181

gallamine commented May 19, 2021

nils-braun commented May 20, 2021

gallamine commented May 24, 2021

Error trying to convert Parquet date into Pandas datatype #181

Error trying to convert Parquet date into Pandas datatype #181

Comments

gallamine commented May 19, 2021

nils-braun commented May 20, 2021

gallamine commented May 24, 2021

Error trying to convert Parquet `date` into Pandas datatype #181

Error trying to convert Parquet `date` into Pandas datatype #181