Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all parquet types? #9

Open
mcaceresb opened this issue Oct 31, 2018 · 14 comments
Open

Support all parquet types? #9

mcaceresb opened this issue Oct 31, 2018 · 14 comments

Comments

@mcaceresb
Copy link
Owner

See arrow/cpp/src/arrow/builder.cc. For instance,

      BUILDER_CASE(UINT8, UInt8Builder);
      BUILDER_CASE(INT8, Int8Builder);
      BUILDER_CASE(UINT16, UInt16Builder);
      BUILDER_CASE(INT16, Int16Builder);
      BUILDER_CASE(UINT32, UInt32Builder);
      BUILDER_CASE(INT32, Int32Builder);
      BUILDER_CASE(UINT64, UInt64Builder);
      BUILDER_CASE(INT64, Int64Builder);
      BUILDER_CASE(DATE32, Date32Builder);
      BUILDER_CASE(DATE64, Date64Builder);
      BUILDER_CASE(TIME32, Time32Builder);
      BUILDER_CASE(TIME64, Time64Builder);
      BUILDER_CASE(TIMESTAMP, TimestampBuilder);
      BUILDER_CASE(BOOL, BooleanBuilder);
      BUILDER_CASE(HALF_FLOAT, HalfFloatBuilder);
      BUILDER_CASE(FLOAT, FloatBuilder);
      BUILDER_CASE(DOUBLE, DoubleBuilder);
      BUILDER_CASE(STRING, StringBuilder);
      BUILDER_CASE(BINARY, BinaryBuilder);
      BUILDER_CASE(FIXED_SIZE_BINARY, FixedSizeBinaryBuilder);
      BUILDER_CASE(DECIMAL, Decimal128Builder);
@kylebarron
Copy link
Contributor

The question is what to do with data types that Stata doesn't support natively. These include:

  • uint32
  • uint64
  • int64
  • date64
  • time64

The options are

  1. raise an error. Not great because some of these are written by default from, say, pandas. Datetimes are written as Int64 by default.
In [32]: df['time'] = pd.to_datetime('2018-01-02')

In [33]: df
Out[33]: 
   a       time
0  a 2018-01-02
1  b 2018-01-02
2  c 2018-01-02
3  d 2018-01-02

In [34]: df.to_parquet('test.parquet')

In [35]: pf = pq.ParquetFile('test.parquet')

In [36]: pf.schema
Out[36]: 
<pyarrow._parquet.ParquetSchema object at 0x7f824ac50a08>
a: BYTE_ARRAY UTF8
time: INT64 TIMESTAMP_MILLIS
__index_level_0__: INT64
  1. Try to coerce to a Stata-capable format (double?). Doubles have huge integer precision still

@mcaceresb
Copy link
Owner Author

I think that making them doubles is the way to go.

@mcaceresb
Copy link
Owner Author

Brownie points if it parses dates into a stata data format.

@kylebarron
Copy link
Contributor

Brownie points if it parses dates into a stata data format.

I think the dates are from January 1, 1970, whereas in Stata they're compared to January 1, 1960, so a recomputation might be needed...

@mcaceresb
Copy link
Owner Author

It seems that unix time is 1970, but for whatever reason Stata does 1960 (SAS?)

@kylebarron
Copy link
Contributor

I suppose so. I didn't know SAS also had 1960 as epoch.

@mcaceresb
Copy link
Owner Author

It seems that parquet has 8 data primitives; the above are built on those primitives, so the plugin should already be able to read all of these.

What ought to happen is that we should keep the formats somehow...

@kylebarron
Copy link
Contributor

keep the formats somehow

What do you mean?

@mcaceresb
Copy link
Owner Author

I mean that it would be ideal to keep the display format.

@kylebarron
Copy link
Contributor

When writing or reading?

@mcaceresb
Copy link
Owner Author

Both. Not sure if it's automagic when writing if the date tyoe is declared, but for sure that is not the case when reading. Atm it's treated as long or double.

@kylebarron
Copy link
Contributor

I don't know how you would keep the display format in either.

@mcaceresb
Copy link
Owner Author

I think when reading it would be more relevant than when writing. For the latter if the target is a date type I suppose that's enough.

When reading, Stata could format the variable after the data is in memory. Maybe not super crucial the more I think about it, but possibly convenient.

@kylebarron
Copy link
Contributor

Well dates should be possible because there's a date type in Parquet... But you won't be able to retrieve like %8.0g or whatever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants