-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible nbstripout-fast integration #179
Comments
A few projects have optional dependencies on accelerator libraries. For example GeoPandas will use pygeos if available, otherwise fallback to alternative implementations (with configuration overrides). I think this serves as a good model for how this speedup can be made available to users who need it, without fracturing the ecosystem. # _compat.py
INSTALL_PYGEOS_ERROR = "To use PyGEOS within GeoPandas, you need to install PyGEOS: \
'conda install pygeos' or 'pip install pygeos'"
try:
import pygeos # noqa
# only automatically use pygeos if version is high enough
if Version(pygeos.__version__) >= Version("0.8"):
HAS_PYGEOS = True
PYGEOS_GE_09 = Version(pygeos.__version__) >= Version("0.9")
PYGEOS_GE_010 = Version(pygeos.__version__) >= Version("0.10")
else:
warnings.warn(
"The installed version of PyGEOS is too old ({0} installed, 0.8 required),"
" and thus GeoPandas will not use PyGEOS.".format(pygeos.__version__),
UserWarning,
)
HAS_PYGEOS = False
except ImportError:
HAS_PYGEOS = False
def set_use_pygeos(val=None):
"""
Set the global configuration on whether to use PyGEOS or not.
The default is use PyGEOS if it is installed. This can be overridden
with an environment variable USE_PYGEOS (this is only checked at
first import, cannot be changed during interactive session).
Alternatively, pass a value here to force a True/False value.
"""
global USE_PYGEOS
global USE_SHAPELY_20
global PYGEOS_SHAPELY_COMPAT
env_use_pygeos = os.getenv("USE_PYGEOS", None)
if val is not None:
USE_PYGEOS = bool(val)
else:
if USE_PYGEOS is None:
USE_PYGEOS = HAS_PYGEOS
if env_use_pygeos is not None:
USE_PYGEOS = bool(int(env_use_pygeos))
# validate the pygeos version
if USE_PYGEOS:
try:
import pygeos # noqa
# validate the pygeos version
if not Version(pygeos.__version__) >= Version("0.8"):
if SHAPELY_GE_20:
USE_PYGEOS = False
warnings.warn(
"The PyGEOS version is too old, and Shapely >= 2 is installed, "
"thus using Shapely by default and not PyGEOS."
)
else:
raise ImportError(
"PyGEOS >= 0.8 is required, version {0} is installed".format(
pygeos.__version__
)
)
# Check whether Shapely and PyGEOS use the same GEOS version.
# Based on PyGEOS from_shapely implementation.
from shapely.geos import geos_version_string as shapely_geos_version
from pygeos import geos_capi_version_string
# shapely has something like: "3.6.2-CAPI-1.10.2 4d2925d6"
# pygeos has something like: "3.6.2-CAPI-1.10.2"
if not shapely_geos_version.startswith(geos_capi_version_string):
warnings.warn(
"The Shapely GEOS version ({}) is incompatible with the GEOS "
"version PyGEOS was compiled with ({}). Conversions between both "
"will be slow.".format(
shapely_geos_version, geos_capi_version_string
)
)
PYGEOS_SHAPELY_COMPAT = False
else:
PYGEOS_SHAPELY_COMPAT = True
except ImportError:
raise ImportError(INSTALL_PYGEOS_ERROR)
if USE_PYGEOS and env_use_pygeos is None and SHAPELY_GE_20:
warnings.warn(
"Shapely 2.0 is installed, but because PyGEOS is also installed, "
"GeoPandas will still use PyGEOS by default for now. To force to use and "
"test Shapely 2.0, you have to set the environment variable USE_PYGEOS=0. "
"You can do this before starting the Python process, or in your code "
"before importing geopandas:"
"\n\nimport os\nos.environ['USE_PYGEOS'] = '0'\nimport geopandas\n\n"
"In a future release, GeoPandas will switch to using Shapely by default. "
"If you are using PyGEOS directly (calling PyGEOS functions on geometries "
"from GeoPandas), this will then stop working and you are encouraged to "
"migrate from PyGEOS to Shapely 2.0 "
"(https://shapely.readthedocs.io/en/latest/migration_pygeos.html).",
stacklevel=6,
)
USE_SHAPELY_20 = (not USE_PYGEOS) and SHAPELY_GE_20
set_use_pygeos() I think that the (having not done any profiling of my own....) core slowdowns are likely happening in the file read/write, and in the strip_output() method (along with the equivalent zeppelin method. |
My testing showed even starting python was too slow if it needed to be done 100s of times (once per file). That being said, there is also a git protocol to start the binary once and reuse that proces; I could see this python overhead time being negligible then*. Doing this easily depends on maturin support for shipping both python and bin (PyO3/maturin#368), calling nbstripout-fast from python is possible. Today it's setup to use python bindings during unit testing and creating a bin with wheels for release. *This may speedup nbstripout without nbstripout-fast |
I think to automatically select I'd prefer to go about this the other way around: how about we refactor the installation commands to be modular enough so |
That seems like a good starting point given where the projects are at now |
If you want to contribute to this refactoring let me know. Might be able to find some time over the holiday break otherwise. |
Happy to review if you do find some time to work on this! |
Coming from #33 (comment).
@kynan
Rust is not ideal for everyone because 1) we can't have wheels for every OS and 2) it's harder to tinker with. I was thinking something along the lines of if this project added nbstripout-fast as an optional dependency that can gracefully fail then IFF nbstripout-fast installed and the options you want to use overlap (most do), it calls nbstripout-fast, if not it continues to call the python code (of course we can add a flag to force one or the other).
Mostly I just wanted to put the thought out there and see if there was any traction. I totally understand if this does not seem like something you want to do for this project as it's not a super clean design. Open to other ideas as well.
The text was updated successfully, but these errors were encountered: