Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-15: Reject PDEP-10 #58623

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
4 changes: 4 additions & 0 deletions web/pandas/pdeps/0010-required-pyarrow-dependency.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@
[Patrick Hoefler](https://github.com/phofl)
- Revision: 1

# Note

This PDEP is superseded by PDEP-15.

## Abstract

This PDEP proposes that:
Expand Down
79 changes: 79 additions & 0 deletions web/pandas/pdeps/0015-do-not-require-pyarrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# PDEP-15: Do not require PyArrow as a required dependency (for pandas 3.0)

- Created: 8 May 2024
- Status: Under discussion
- Discussion: [#58623](https://github.com/pandas-dev/pandas/pull/58623)
[#52711](https://github.com/pandas-dev/pandas/pull/52711)
[#52509](https://github.com/pandas-dev/pandas/issues/52509)
[#54466](https://github.com/pandas-dev/pandas/issues/54466)
- Author: [Thomas Li](https://github.com/lithomas1)
- Revision: 1

## Abstract

This PDEP supersedes PDEP-10, which stipulated that PyArrow should become a required dependency
for pandas 3.0. After reviewing feedback posted
on the feedback issue [#54466](https://github.com/pandas-dev/pandas/issues/54466), we've
decided against moving forward with this PDEP for pandas 3.0.

The primary reasons for rejecting this PDEP are twofold:

1) Requiring pyarrow as a dependency can cause installation problems for a significant portion of users.

- Pyarrow does not fit or has a hard time fitting in space-constrained environments
Copy link
Member

@WillAyd WillAyd Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this really exaggerates the problem on AWS. AWS has long distributed its own AWS SDK for pandas library (formerly called awswrangler) which uses pyarrow to better integrate with many of its services (ex: pyarrow is used for high performance data exports to/from AWS Redshift, rather than using a traditional ODBC driver)

The issue here is really just scoped to users that don't want to use the AWS Lambda Managed Layer, but instead want to build their environment from scratch, assumedly without the AWS SDK for pandas. Even then, it may not be a current issue given the drastic reductions in the binary size of pyarrow through both pip and conda

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's only conda that's been drastically reduced - install pyarrow and pandas in a fresh venv and it already hits 295 MB. And the wheel size on PyPI is still ~40MB, so we'd be noticeably increasing the load on PyPI by making PyArrow required

My current stance is: if the desire was to make PyArrow dtypes the default for all dtypes, then ok, maybe that'd be justified. But probably not if it's just for the sake of strings, for which I think that

  1. recommending pip install pandas[pyarrow] in all instructions
  2. auto-inferring pyarrow strings if pyarrow is installed

should be enough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

I'll add this to the PDEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the wheel size on PyPI is still ~40MB, so we'd be noticeably increasing the load on PyPI by making PyArrow required

Out of curiosity - is this something that the PyPi maintainers have mentioned as a problem?

If so, the plot twist is then that we should really be pushing AWS users towards using the pre-provided image, rather than building their own from scratch. I believe that would forgo hitting PyPi altogether, and even if it doesn't, it is still smaller than what people are building themselves (see #54466 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, the plot twist is then that we should really be pushing AWS users towards using the pre-provided image, rather than building their own from scratch. I believe that would forgo hitting PyPi altogether, and even if it doesn't, it is still smaller than what people are building themselves (see #54466 (comment))

Wouldn't we then be asking our users to be dependent on the AWS people updating their pre-provided image whenever we created a new release of pandas or the arrow team did a new release of pyarrow ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think that is generally how lambda works, even with the overall Python version. It's not quite an "anything goes" type of execution environment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's this post from 2021, in which they write about the monthly PyPI bill being almost 2 million US dollars

https://dustingram.com/articles/2021/04/14/powering-the-python-package-index-in-2021/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's this post from 2021, in which they write about the monthly PyPI bill being almost 2 million US dollars

https://dustingram.com/articles/2021/04/14/powering-the-python-package-index-in-2021/

I'm not too worried about this - PyPI's bill is subsidized anyways. Of course, it is nice to reduce our load on PyPI, but I don't think we are close to being the worse offenders here (those would probably be something like tensorflow and pytorch), and it's important to keep in mind that a lot of people have pyarrow installed for whatever reason anyways.

I would mostly only be concerned about an increase in size in our own pandas package (since PyPI does limit the total size of all packages uploaded by a project, and raising the limit is a manual and annoying process)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made another go at clarifying point1, incorporating the feedback here.

PTAL @WillAyd @MarcoGorelli

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, but I still disagree with calling the AWS-provided layer a "workaround" - essentially it is the canonical approach to solve an issue that has existed for years with pandas and lambda functions.

A quick google search for something how "how to run pandas on aws lambda" yields a slew of conflicting results on how to get this to work. If the AWS-provided layer is a workaround, then what are we calling the proper approach?

https://stackoverflow.com/questions/36054976/pandas-aws-lambda
https://stackoverflow.com/questions/53824556/how-to-install-numpy-and-pandas-for-aws-lambdas
https://medium.com/swlh/how-to-add-python-pandas-layer-to-aws-lambda-bab5ea7ced4f
https://medium.com/@johnnymao/how-to-use-pandas-in-your-aws-lambda-function-c3ce29f6f189
https://medium.com/@shimo164/lambda-layer-to-use-numpy-and-pandas-in-aws-lambda-function-8a0e040faa18
https://www.youtube.com/watch?v=1UDEp90S9h8

such as AWS Lambda, due to its large size of around ~40 MB for a compiled wheel
(which is larger than pandas' own wheel sizes)
- This can also cause problems for downstream libraries that use pandas as a dependency
as while pandas + pyarrow can potentially fit in an AWS Lambda environment, the combination of
pandas, pyarrow, and the downstream library may not fit.
- While it may potentially be possible to work around this issue by using the AWS Lambda Layer from
the [AWS SDK for pandas](https://aws-sdk-pandas.readthedocs.io/en/stable/install.html#aws-lambda-layer),
the primary benefit of pyarrow strings is not enough to force users to make a disruptive change.

- Installation of pyarrow is not possible on some platforms. We provide support for some
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
less widely used platforms such as Alpine Linux, which pyarrow does not provide wheels for.
- While pyarrow has made great strides towards supporting most platforms that pandas is installable on
(e.g. the recent addition of pyodide support in pyarrow), we would still have to drop support for some
platforms like musllinux (the feature request is tracked [here](https://github.com/apache/arrow/issues/18036)) if pyarrow was to be required.

While installation issues are mentioned in the drawbacks section of PDEP-10, at the time of the writing
of the PDEP, we underestimated the impact this would have on users, and also downstream developers.

2) Many of the benefits presented in PDEP-10 can be materialized for users that have pyarrow installed, without
forcing a pyarrow requirement on other users.

In PDEP-10, there are three primary benefits listed:

- First class support for strings.

- PDEP-14 enables a new string dtype by default for pandas 3.0,
which will be backed by a pyarrow string dtype by default,
(for users who have pyarrow installed) and use a Python object based fallback for
users that don't have pyarrow installed. This allows all users to experience the usability
benefits of a string dtype by default, and for users with pyarrow to experience the performance
benefits of a pyarrow backed string array.

- Support for dtypes not present in pandas.
- There are some types in pyarrow that don't have a corresponding pandas/numpy dtype, for example
the nested pyarrow types(e.g. lists and structs), and decimal types.
- Currently, users can already create arrays with these dtypes if they have pyarrow installed, but we cannot infer
arrays to those dtypes by default, without forcing a pyarrow requirement on users,
as there is no Python/numpy equivalent for these dtypes).

- Interoperability
Copy link
Member

@WillAyd WillAyd Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @MarcoGorelli point was to remove the Interoperability section entirely, but if that's not true then I don't understand the point this is trying to make in its current form.

The beneficiary of the Arrow C Data interface is not just other dataframe libraries - a decent listing can be found here:

apache/arrow#39195 (comment)

For a direct benefit to pandas, it helps with I/O to boost performance, ensure proper data types, and reduce the amount of code burden. We have already seen this benefit with ADBC drivers, and from the link above, it looks like there is some near-term potential for it to help Excel I/O via fastexcel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will take it out the next go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we don't require PyArrow might put is in a bind for downstream libraries that want interchange with pandas, but themselves probably aren't in a position to require PyArrow. In particular, this conversation is happening with seaborn:

mwaskom/seaborn#3782 (comment)

Somewhat unfortunately, this may mean that we are asked to put more maintenance work into the interchange protocol

- The Arrow C Data Interface would allow us to import/export pandas DataFrames to and from other libraries
that support Arrow in a zero-copy manner.

- While several libraries have adopted the Arrow C Data Interface, e.g. polars, xgboost, duckdb, etc., the main
beneficiaries of Arrow C Data Interface are other dataframe libraries, as most downstream libraries tend to
already support using pandas dataframes as input.

Although this PR recommends not adopting pyarrow as a required dependency in pandas 3.0, this does not mean that we are
abandoning pyarrow support and integration in pandas. Adopting support for pyarrow arrays
and data types in more of pandas will lead to greater interoperability with the
ecosystem and better performance for users. Furthermore, a lot of the drawbacks, such as the large installation size of
pyarrow and the lack of support for certain platforms, can be solved (as shown by the recent addition of pyarrow to the pyodide
distributions), allowing us to potentially revisit this decision in the future.

However, at this point in time, it is clear that we are not ready to require pyarrow
as a dependency in pandas.