-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Allow third-party packages to register IO engines #61642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f33778c
555459b
1ca77c1
d388101
e333510
cb82ffb
088e5de
9e71a9d
ebfc20c
a4b6cdc
0b3b00c
776e04c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -90,6 +90,7 @@ Other enhancements | |
- Support passing a :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`) | ||
- Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`) | ||
- Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`) | ||
- Third-party packages can now register engines that can be used in pandas I/O operations :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` (:issue:`61584`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence makes it seem that it only applies to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. This PR creates the new system for third party engines in a generic way, and the idea is to use it everywhere, but the PR only applies it to iceberg for now. The reason is to make reviewing easier, as adding the engine keyword to mamy connectors will make the PR significantly bigger. My idea is to add the whatsnew note for what's delivered in this PR, and in the follow up PR update it to what you suggest. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's a good practice to make PRs atomic, and don't assume things about other PRs. If we were going to release just after this commit, things would be correct. As said, the follow up PR will update the whatsnew. |
||
|
||
.. --------------------------------------------------------------------------- | ||
.. _whatsnew_300.notable_bug_fixes: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,13 +9,15 @@ | |
import codecs | ||
from collections import defaultdict | ||
from collections.abc import ( | ||
Callable, | ||
Hashable, | ||
Mapping, | ||
Sequence, | ||
) | ||
import dataclasses | ||
import functools | ||
import gzip | ||
from importlib.metadata import entry_points | ||
from io import ( | ||
BufferedIOBase, | ||
BytesIO, | ||
|
@@ -90,6 +92,10 @@ | |
|
||
from pandas import MultiIndex | ||
|
||
# registry of I/O engines. It is populated the first time a non-core | ||
# pandas engine is used | ||
_io_engines: dict[str, Any] | None = None | ||
|
||
|
||
@dataclasses.dataclass | ||
class IOArgs: | ||
|
@@ -1282,3 +1288,149 @@ def dedup_names( | |
counts[col] = cur_count + 1 | ||
|
||
return names | ||
|
||
|
||
def _get_io_engine(name: str) -> Any: | ||
""" | ||
Return an I/O engine by its name. | ||
|
||
pandas I/O engines can be registered via entry points. The first time this | ||
function is called it will register all the entry points of the "pandas.io_engine" | ||
group and cache them in the global `_io_engines` variable. | ||
|
||
Engines are implemented as classes with the `read_<format>` and `to_<format>` | ||
methods (classmethods) for the formats they wish to provide. This function will | ||
return the method from the engine and format being requested. | ||
|
||
Parameters | ||
---------- | ||
name : str | ||
The engine name provided by the user in `engine=<value>`. | ||
|
||
Examples | ||
-------- | ||
An engine is implemented with a class like: | ||
|
||
>>> class DummyEngine: | ||
... @classmethod | ||
... def read_csv(cls, filepath_or_buffer, **kwargs): | ||
... # the engine signature must match the pandas method signature | ||
... return pd.DataFrame() | ||
|
||
It must be registered as an entry point with the engine name: | ||
|
||
``` | ||
[project.entry-points."pandas.io_engine"] | ||
dummy = "pandas:io.dummy.DummyEngine" | ||
|
||
``` | ||
|
||
Then the `read_csv` method of the engine can be used with: | ||
|
||
>>> _get_io_engine(engine_name="dummy").read_csv("myfile.csv") # doctest: +SKIP | ||
|
||
This is used internally to dispatch the next pandas call to the engine caller: | ||
|
||
>>> df = read_csv("myfile.csv", engine="dummy") # doctest: +SKIP | ||
""" | ||
global _io_engines | ||
|
||
if _io_engines is None: | ||
_io_engines = {} | ||
for entry_point in entry_points().select(group="pandas.io_engine"): | ||
if entry_point.dist: | ||
package_name = entry_point.dist.metadata["Name"] | ||
else: | ||
package_name = None | ||
if entry_point.name in _io_engines: | ||
_io_engines[entry_point.name]._packages.append(package_name) | ||
else: | ||
_io_engines[entry_point.name] = entry_point.load() | ||
_io_engines[entry_point.name]._packages = [package_name] | ||
Comment on lines
+1342
to
+1349
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have to wonder if it is better to just get the entry points here but NOT There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds like a good idea, I didn't think about it before. I think it'll make the code slightly more complex, but not loading the code of unused connectors would be nice, in case a package takes a long time to run. I won't be updating this PR, as I don't think it's likely that it'll be merged, so not worth the effort. But I'd be happy to implement it in a follow up. |
||
|
||
try: | ||
engine = _io_engines[name] | ||
except KeyError as err: | ||
raise ValueError( | ||
f"'{name}' is not a known engine. Some engines are only available " | ||
"after installing the package that provides them." | ||
) from err | ||
|
||
if len(engine._packages) > 1: | ||
msg = ( | ||
f"The engine '{name}' has been registered by the package " | ||
f"'{engine._packages[0]}' and will be used. " | ||
) | ||
if len(engine._packages) == 2: | ||
msg += ( | ||
f"The package '{engine._packages[1]}' also tried to register " | ||
"the engine, but it couldn't because it was already registered." | ||
) | ||
else: | ||
msg += ( | ||
"The packages {str(engine._packages[1:]}[1:-1] also tried to register " | ||
"the engine, but they couldn't because it was already registered." | ||
) | ||
warnings.warn(msg, RuntimeWarning, stacklevel=find_stack_level()) | ||
|
||
return engine | ||
|
||
|
||
def allow_third_party_engines( | ||
skip_engines: list[str] | Callable | None = None, | ||
) -> Callable: | ||
""" | ||
Decorator to avoid boilerplate code when allowing readers and writers to use | ||
third-party engines. | ||
|
||
The decorator will introspect the function to know which format should be obtained, | ||
and to know if it's a reader or a writer. Then it will check if the engine has been | ||
registered, and if it has, it will dispatch the execution to the engine with the | ||
arguments provided by the user. | ||
|
||
Parameters | ||
---------- | ||
skip_engines : list of str, optional | ||
For engines that are implemented in pandas, we want to skip them for this engine | ||
dispatching system. They should be specified in this parameter. | ||
|
||
Examples | ||
-------- | ||
The decorator works both with the `skip_engines` parameter, or without: | ||
|
||
>>> class DataFrame: | ||
... @allow_third_party_engines(["python", "c", "pyarrow"]) | ||
... def read_csv(filepath_or_buffer, **kwargs): | ||
... pass | ||
... | ||
... @allow_third_party_engines | ||
... def read_sas(filepath_or_buffer, **kwargs): | ||
... pass | ||
""" | ||
|
||
def decorator(func: Callable) -> Callable: | ||
@functools.wraps(func) | ||
def wrapper(*args: Any, **kwargs: Any) -> Any: | ||
if callable(skip_engines) or skip_engines is None: | ||
skip_engine = False | ||
else: | ||
skip_engine = kwargs["engine"] in skip_engines | ||
|
||
if "engine" in kwargs and not skip_engine: | ||
engine_name = kwargs.pop("engine") | ||
engine = _get_io_engine(engine_name) | ||
try: | ||
return getattr(engine, func.__name__)(*args, **kwargs) | ||
except AttributeError as err: | ||
raise ValueError( | ||
f"The engine '{engine_name}' does not provide a " | ||
f"'{func.__name__}' function" | ||
) from err | ||
else: | ||
return func(*args, **kwargs) | ||
|
||
return wrapper | ||
|
||
if callable(skip_engines): | ||
return decorator(skip_engines) | ||
return decorator |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,10 @@ | |
|
||
from pandas import DataFrame | ||
|
||
from pandas.io.common import allow_third_party_engines | ||
|
||
|
||
@allow_third_party_engines | ||
def read_iceberg( | ||
table_identifier: str, | ||
catalog_name: str | None = None, | ||
|
@@ -18,6 +21,7 @@ def read_iceberg( | |
snapshot_id: int | None = None, | ||
limit: int | None = None, | ||
scan_properties: dict[str, Any] | None = None, | ||
engine: str | None = None, | ||
) -> DataFrame: | ||
""" | ||
Read an Apache Iceberg table into a pandas DataFrame. | ||
|
@@ -52,6 +56,10 @@ def read_iceberg( | |
scan_properties : dict of {str: obj}, optional | ||
Additional Table properties as a dictionary of string key value pairs to use | ||
for this scan. | ||
engine : str, optional | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very good point. In I didn't want to add the engine to all connectors in this PR to keep it simpler, but I'm planning to follow up with another PR that adds it, and adds There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if engine-specific kwargs are needed, isn't that a good reason to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good point. Thinking about readers we don't care about I think what you propose is the best choice. And this PR doesn't really prevent that from happening anyway. But for readers we cared enough to include in pandas, I think this new interface offers an advantage. For example, there was some discussion on whether we should move the fastparquet engine out of pandas, Patrick suggested it. I think this interface allows moving the fastparquet engine to the fastparquet package, users with fastparquet installed will still have it available in the same way as it is now, but we can forget about it. Of course discussions about moving readers out of pandas will have to happen later. But this interface seems quite useful and it's very simple, so in my opinion it's a good deal. |
||
The engine to use. Engines can be installed via third-party packages. For an | ||
updated list of existing pandas I/O engines check the I/O engines section of | ||
our Ecosystem page. | ||
|
||
Returns | ||
------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if this can happen, but what if the project isn't using
pyproject.toml
for some reason. Is there another way to do the configuration or is usingpyproject.toml
required?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entry points existed before pyproject.toml, and can also be added to setup.py. it makes no difference how the package defines them, pip or conda will add the entry point to the environment registry, and pandas will be able to find them regardless of how the project created them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The language here suggests that the only way to add the entry point is via
pyproject.toml
. If this is the recommended way, we can say that. Or if other ways are supported, we should show that too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyproject.toml
is the way to do it,setup.py
is how it was done in the past. I'm sure people reading this will be able to figure out how this was done in the past if their code is still usingsetup.py