Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error trying to convert Parquet date into Pandas datatype #181

Open
gallamine opened this issue May 19, 2021 · 2 comments
Open

Error trying to convert Parquet date into Pandas datatype #181

gallamine opened this issue May 19, 2021 · 2 comments

Comments

@gallamine
Copy link

When the Parquet file contains a date type dask_sql will try and convert the corresponding Pandas dataframe column into a date type that Pandas doesn't recognize. The issue seems to arise in https://github.com/nils-braun/dask-sql/blob/main/dask_sql/mappings.py#L273. Example:

import pandas as pd
import dask_sql
from datetime import datetime
df = pd.DataFrame({'date_col':[datetime.today()]})
dask_sql.mappings.cast_column_to_type(df, 'date_col', 'date')

results in TypeError: data type 'date' not understood.
This is after PyArrow has decided that the Parquet date time should be a datetime64[ns] type in Pandas.

I'm using Python 3.7, pyarrow==4.0.0 and pandas==1.2.4

@nils-braun
Copy link
Collaborator

Hi @gallamine!
Thanks again for your bug report and I am sorry that it seems you need to iron out all the shortcomings :-( Thank you that you keep trying and report those back!

I can reproduce the error message in your code (in the newest version of dask-sql the function call is a bit different, but that does not change anything), but I am wondering why we actually end up with this function call at all during "normal" operation, because in principle (but I think here is the problem), all calls to cast_column_to_type should only get a type, which was converted using sql_to_python_type before (and are a valid python/pandas type).
(Background: the function cast_column_to_type should only do the casting, the conversion between a SQL type like DATE and the corresponding Python/Pandas type should happen before).

I tried to reproduce with a parquet file:

import pandas as pd
from datetime import datetime

df = pd.DataFrame({"date_col": [datetime.today()]})
df.to_parquet("181.parquet")

import dask.dataframe as dd
from dask_sql import Context

df = dd.read_parquet("181.parquet")

c = Context()
c.create_table("df", df)

print(c.sql("SELECT *, CAST(date_col AS DATE), EXTRACT(DOW FROM date_col) FROM df").compute())

but that works.

Relating to your other open issue #179, I just assumed that this issue came up during interaction with hive - and especially during interaction with a partitioned hive table?

Looking back to the code of the hive input, I am wondering if the call to sql_to_python_type is actually missing. I see that I implemented it for non-partitioned tables here, but I can not find it for the partition column (should be around https://github.com/nils-braun/dask-sql/blob/main/dask_sql/input_utils/hive.py#L132). Well, that is clearly a bug!

If you want, I can add try to reproduce it using my local hive setup and come up with a bugfix. If you have already a setup for testing this (and my assumption is correct that your bug came up during hive usage), I am also happy to work together with you!
(are you currently participating in the dask summit?)

@gallamine
Copy link
Author

I'm testing your suggested fix now. I was OOO the past few days and (sadly) not attending the Dask Summit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants