Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any plan to support MultiIndex DataFrames in Parquet I/O in the future? #223

Open
Roger-Liang opened this issue Feb 7, 2025 · 4 comments

Comments

@Roger-Liang
Copy link

As the title descripted.

@ehsantn
Copy link
Collaborator

ehsantn commented Feb 7, 2025

@Roger-Liang Thanks for the feature request. This is definitely something we can look into. Could you share more about your use case and why this is important to you? Also, example code to support would be appreciated.

@Roger-Liang
Copy link
Author

Hi @ehsantn ,

Thanks for the quick response!

I frequently work with MultiIndex DataFrames where one of the levels represents dates, and I rely on the pyarrow engine to handle Parquet I/O with partitioning based on the Date column. Partitioning by date is essential for efficiently managing and querying large, time-series datasets.

Currently, my workflow requires resetting the MultiIndex to promote the Date level to a column so that I can partition the data during the write process. When reading the data back, I need to manually reconstruct the MultiIndex. This workaround not only adds extra code but also increases the risk of errors, especially as the complexity and size of the data grow.

Below is an example of my current approach:

# Create a sample MultiIndex DataFrame with 'Date' and 'Sensor' as index levels
arrays = [
    ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
    ['Sensor1', 'Sensor2', 'Sensor1', 'Sensor2']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Date', 'Sensor'))
df = pd.DataFrame({'Value': np.random.randn(4)}, index=index)

# Reset the 'Date' level to a column for partitioning
df_reset = df.reset_index(level='Date')

# Save the DataFrame using pyarrow engine, partitioned by the 'Date' column
df_reset.to_parquet('multiindex_data', engine='pyarrow', partition_cols=['Date'])

# Read the partitioned Parquet files back into a DataFrame
df_loaded = pd.read_parquet('multiindex_data', engine='pyarrow')

# Manually reconstruct the MultiIndex by setting 'Sensor' back as an index level
df_loaded.set_index('Sensor', append=True, inplace=True)
print(df_loaded)

Native support for MultiIndex DataFrames in Parquet I/O would greatly simplify this process by preserving the full hierarchical index automatically—even when partitioning by Date. This enhancement would not only streamline my workflow but also improve data integrity and reduce the overhead of manual index management.

Looking forward to your thoughts on this!

@ehsantn
Copy link
Collaborator

ehsantn commented Feb 10, 2025

Thank you @Roger-Liang for the detailed example! We will look into it and prioritize soon.

@Roger-Liang
Copy link
Author

Thank you @ehsantn !!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants