-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip-feat: pandas as soft dependency #3384
Conversation
…nd index attributes over type
Great to get the ball rolling on this, thank you @mattijn! I did not yet have time to review but just wanted to say that I'm happy to have a look at the types once I get to it. As long as the package works, I'm optimistic that we can make mypy happy. |
Thanks @binste! No rush! Maybe something for version 5.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some first comments. I haven't had the chance yet to run mypy on this PR (reviewed it in the browser) but I have some ideas how to make it work which I want to try out depending on the errors it throws.
|
||
|
||
def import_pandas() -> ModuleType: | ||
min_version = "0.25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment in pyproject.toml
? Next to the pandas requirement that if the pandas version is updated, it also needs to be changed here. Although I'm realizing now that that file needs to be changed anyway to make pandas optional
return curried.pipe(data, data_transformers.get()) | ||
elif isinstance(data, str): | ||
return {"url": data} | ||
elif _is_pandas_dataframe(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that this line is only reached if it's an old Pandas version which does not support the dataframe interchange protocol? Else it would already stop at line 43, right?
If yes, could you add a comment about this?
@@ -53,6 +52,11 @@ def __dataframe__( | |||
) -> DfiDataFrame: ... | |||
|
|||
|
|||
def _is_pandas_dataframe(obj: Any) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this function be a simple isinstance(obj, pd.DataFrame)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for start reviewing this PR @binste! I don't think I can do this without importing pandas first.
I tried setting up a function on which I can do some duck typing
def instance(obj):
return type(obj).__name__
But found out that both polars and pandas are using the instance type DataFrame
for their dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but couldn't we call the pandas import function you created in here and if it raises an importerror, we know it's not a pandas dataframe anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pragmatic, I admit. But that would be an unnecessary import of pandas if it is available in the environment, but if the data object is something else.
I wish we could sniff the type without importing modules first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the optional import logic I added to plotly.py a while back: https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/_plotly_utils/optional_imports.py if should_load
is False
then it won't perform the import even if the library is installed. This was used with isinstance
checks, because if pandas
hasn't been loaded yet, you know the object you're dealing with isn't a pandas DataFrame, even if pandas is installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A trick I learned from scikit-learn is to check if pandas
is in sys.modules
before doing the isinstance check, something like
if (pd := sys.modules.get('pandas')) is not None and isinstance(df, pd.DataFrame):
...
If pandas was never imported, then df
is definitely not pandas
(this is also what we do in Narwhals, were pandas/polars/etc are never explicitly imported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just see your response here @MarcoGorelli! I also made this observation recently, see the comment I just added #3384 (comment)...
return pd | ||
except ImportError as err: | ||
raise ImportError( | ||
f"Serialization of the DataFrame requires\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Serialization of the DataFrame requires\n" | |
f"Serialization of this data requires\n" |
It can also be a dict
as in data.py: _data_to_csv_string
. Furthermore, if it's a dataframe, it's already given that Pandas is installed.
if TYPE_CHECKING: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if TYPE_CHECKING: | |
pass |
Aware that it's just a wip PR, thought I'd just note it anyway :)
class _PandasTimestamp: | ||
def isoformat(self): | ||
return "dummy_isoformat" # Return a dummy ISO format string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should inherit from a Protocol as a pd.Timestamp is not an instance of _PandasTimestamp
. You'll then also need to add the @runtime_checkable
decorator from typing
. Also, we could directly test for a pandas timestamp in a similar function to is_pandas_dataframe
to keep these approaches consistent?
@@ -4,11 +4,11 @@ | |||
|
|||
import numpy as np | |||
import pandas as pd | |||
from pandas.api.types import infer_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the tests also run without pandas installed so that we can run the whole test suite once with pandas installed and once without. Prevents us from accidentally reintroducing a hard dependency again in the future
@mattijn just throwing in as a suggestion, have you considered
Even if you were not to go down this route; they collected a range of Issues/PRs in narwhals-dev/narwhals#62 from projects interested in the same topic as this PR - which could prove to be a great resource regardless. Side note: Related |
Thanks And if not, no worries, I'm still happy to see that you're going down this route, thanks for all your work here 🙌 |
@mattijn if you arrive here confused, I was the one who summoned @MarcoGorelli 😄 |
Interesting! Thanks for sharing! This is possibly both interesting for type inference of columns and serialization of the dataframe? See also related historical issues/PRs:
I think @jonmmease is also able to understand better if this is of interest for altair. What do you think? |
No problem
NoteWhile I was writing this up, @MarcoGorelli opened #3445 so I'm not covering the OriginalFrom my understanding so far, part of this would be solved with translate_dtype. Lines 600 to 651 in 62ab14d
For the
@MarcoGorelli was this restriction intentional? import narwhals
import pandas
pandas.DataFrame.convert_dtypes
pandas.Series.convert_dtypes
narwhals.maybe_convert_dtypes # seems to only apply for DataFrame These altair/tests/utils/test_core.py Lines 73 to 85 in 76a9ce1
Overall, these seem like minor, solvable issues to me |
as in, the restriction of
could you clarify please? when is the dtype not known? Perhaps we should have a separate thread to discuss this so as to not risk losing focus on this PR too much. I think Narwhals support is related but orthogonal, and that the simplest way to go about things might be:
|
Yeah that was what I meant.
Apologies, maybe that makes more sense expanding to the code prior to block I linked.
That would be fine with me, @mattijn what are your thoughts on this plan? |
Make sense to open a new issue with the suggestion of utilizing narwhals. Thanks! Regarding this issue, the recent work within vegafusion to make imports lazy, might also be of interested here. See vega/vegafusion#491. Especially the approach as such: pd = sys.modules.get("pandas", None)
pl = sys.modules.get("polars", None)
if pd is not None and isinstance(value, pd.DataFrame):
...
if pl is not None and isinstance(value, pl.DataFrame):
... |
Great to see all the activity on this topic and thanks to everyone chiming in! :) Regarding narwhals, not sure how it relates to https://github.com/data-apis/dataframe-api but as mentioned by others, best to continue this discussion separately and first strip Pandas out as a hard dependency. The approach of scikit-learn/vegafusion with |
Superseded by #3452 |
This PR is an attempt to make pandas a soft dependency. I hope it can be used as inspiration, as I was not able to make the types happy. I've no real idea how it should be done, but I've been trying a few things, some with success and others without.
I also made an attempt to prioritize the DataFrameLike approach over the pandas routine, but decided to not do this as otherwise usage of a pandas DataFrame within Altair will require pyarrow to infer/serialize. My current feeling is that usage of pandas to infer and serialize the data is still preferred as it is not yet depending on pyarrow.