Is there any plan to support MultiIndex DataFrames in Parquet I/O in the future? #223

Roger-Liang · 2025-02-07T01:59:44Z

As the title descripted.

ehsantn · 2025-02-07T13:47:07Z

@Roger-Liang Thanks for the feature request. This is definitely something we can look into. Could you share more about your use case and why this is important to you? Also, example code to support would be appreciated.

Roger-Liang · 2025-02-10T02:50:23Z

Hi @ehsantn ,

Thanks for the quick response!

I frequently work with MultiIndex DataFrames where one of the levels represents dates, and I rely on the pyarrow engine to handle Parquet I/O with partitioning based on the Date column. Partitioning by date is essential for efficiently managing and querying large, time-series datasets.

Currently, my workflow requires resetting the MultiIndex to promote the Date level to a column so that I can partition the data during the write process. When reading the data back, I need to manually reconstruct the MultiIndex. This workaround not only adds extra code but also increases the risk of errors, especially as the complexity and size of the data grow.

Below is an example of my current approach:

# Create a sample MultiIndex DataFrame with 'Date' and 'Sensor' as index levels
arrays = [
    ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
    ['Sensor1', 'Sensor2', 'Sensor1', 'Sensor2']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Date', 'Sensor'))
df = pd.DataFrame({'Value': np.random.randn(4)}, index=index)

# Reset the 'Date' level to a column for partitioning
df_reset = df.reset_index(level='Date')

# Save the DataFrame using pyarrow engine, partitioned by the 'Date' column
df_reset.to_parquet('multiindex_data', engine='pyarrow', partition_cols=['Date'])

# Read the partitioned Parquet files back into a DataFrame
df_loaded = pd.read_parquet('multiindex_data', engine='pyarrow')

# Manually reconstruct the MultiIndex by setting 'Sensor' back as an index level
df_loaded.set_index('Sensor', append=True, inplace=True)
print(df_loaded)

Native support for MultiIndex DataFrames in Parquet I/O would greatly simplify this process by preserving the full hierarchical index automatically—even when partitioning by Date. This enhancement would not only streamline my workflow but also improve data integrity and reduce the overhead of manual index management.

Looking forward to your thoughts on this!

ehsantn · 2025-02-10T20:26:50Z

Thank you @Roger-Liang for the detailed example! We will look into it and prioritize soon.

Roger-Liang · 2025-02-12T01:16:07Z

Thank you @ehsantn !!!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any plan to support MultiIndex DataFrames in Parquet I/O in the future? #223

Is there any plan to support MultiIndex DataFrames in Parquet I/O in the future? #223

Roger-Liang commented Feb 7, 2025

ehsantn commented Feb 7, 2025

Roger-Liang commented Feb 10, 2025

ehsantn commented Feb 10, 2025

Roger-Liang commented Feb 12, 2025

Is there any plan to support MultiIndex DataFrames in Parquet I/O in the future? #223

Is there any plan to support MultiIndex DataFrames in Parquet I/O in the future? #223

Comments

Roger-Liang commented Feb 7, 2025

ehsantn commented Feb 7, 2025

Roger-Liang commented Feb 10, 2025

ehsantn commented Feb 10, 2025

Roger-Liang commented Feb 12, 2025