Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Install Ludwig Package in Dataproc Cluster Jupyter Notebook #4030

Open
ayyappagundu opened this issue Oct 21, 2024 · 3 comments
Open
Labels
bug Something isn't working dependency

Comments

@ayyappagundu
Copy link

Description
I'm encountering an error when installing ludwig[distributed] in a Jupyter Notebook environment running on a Dataproc cluster. The installation seems to proceed normally until it attempts to install scikit-learn, at which point the process fails.

** Steps to Reproduce**
Launch a Dataproc cluster with a Jupyter Notebook environment.
Open a new Jupyter Notebook within the cluster.
Execute the command:
!pip install ludwig[distributed]

Error
Collecting scikit-learn (from ludwig[distributed])
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.5.2/scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.5.1/scikit_learn-1.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.2.0/scikit_learn-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.5 MB)
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.1.3/scikit_learn-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32.0 MB)
Using cached https://nexus.onedev.xxxx.biz/repository/ds-pypi-group/packages/scikit-learn/1.1.2/scikit-learn-1.1.2.tar.gz (7.0 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [2269 lines of output]
Partial import of sklearn during the build process.
setup.py:128: DeprecationWarning:

    `numpy.distutils` is deprecated since NumPy 1.23.0, as a result
    of the deprecation of `distutils` itself. It will be removed for
    Python >= 3.12. For older Python versions it will remain present.
    It is recommended to use `setuptools < 60.0` for those Python versions.
    For more details, see:
      https://numpy.org/devdocs/reference/distutils_status_migration.html
  1. Declare '_subtract_histograms' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
    2. Use an 'int' return type on '_subtract_histograms' to allow an error code to be returned.

    Error compiling Cython file:

    ...
    if n_used_bins <= 1:
    free(cat_infos)
    return

           qsort(cat_infos, n_used_bins, sizeof(categorical_info),
                 compare_cat_infos)
                 ^
    

    sklearn/ensemble/_hist_gradient_boosting/splitting.pyx:920:14: Cannot assign type 'int (const void *, const void ) except? -1 nogil' to 'int ()(const void *, const void *) noexcept nogil'. Exception values are incompatible. Suggest adding 'noexcept' to the type of 'compare_cat_infos'.
    Traceback (most recent call last):
    File "/tmp/pip-build-env-357_itq6/overlay/lib/python3.11/site-packages/Cython/Build/Dependencies.py", line 1345, in cythonize_one_helper

     File "/opt/conda/miniconda3/lib/python3.11/multiprocessing/pool.py", line 774, in get
       raise self._value
    

    Cython.Compiler.Errors.CompileError: sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Environment
Dataproc Cluster: image-version 2.2-debian12
Jupyter Notebook (Testing the package on jupyter notebook using dataproc cluster)
Python Version: 3.11.8
! pip install ludwig[distributed] and tried with this package also ! pip install ludwig

Package Repository: nexus.onedev.xxxx.biz xxxx - masked for security (my organization name) private repository configured for installing pacakages
Additional context
I tried with multiple environments also
Python 3.10.8 & Python 3.8.15 by downgrading the image version of dataproc cluster (GCP)

@mhabedank
Copy link
Collaborator

Hi @ayyappagundu We seem to have some major problems with our dependencies. I try to get hold of it. I hope we can fix that in the nearer future. Thanks for the ticket.

@mhabedank mhabedank removed their assignment Oct 21, 2024
@mhabedank mhabedank added the bug Something isn't working label Oct 21, 2024
@ayyappagundu
Copy link
Author

Hi! Any update on this?

@mhabedank
Copy link
Collaborator

Hi @ayyappagundu we are moving from requirements.txt to pyproject.toml with poetry and hatch. Also we need to pin down a lot of dependencies. That acutally cost us some time, but we are on a good way.

In addition, we will probably have to overhaul our distribution backend as we are using outdated versions of ray. Is it important that you distribute your task or can you also run it on a compute node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dependency
Projects
None yet
Development

No branches or pull requests

2 participants