Unable to Install Ludwig Package in Dataproc Cluster Jupyter Notebook #4030

ayyappagundu · 2024-10-21T15:56:51Z

Description
I'm encountering an error when installing ludwig[distributed] in a Jupyter Notebook environment running on a Dataproc cluster. The installation seems to proceed normally until it attempts to install scikit-learn, at which point the process fails.

** Steps to Reproduce**
Launch a Dataproc cluster with a Jupyter Notebook environment.
Open a new Jupyter Notebook within the cluster.
Execute the command:
!pip install ludwig[distributed]

Error
Collecting scikit-learn (from ludwig[distributed])
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.5.2/scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.5.1/scikit_learn-1.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.2.0/scikit_learn-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.5 MB)
Using cached https://nexus.onedev.neustar.biz/repository/ds-pypi-group/packages/scikit-learn/1.1.3/scikit_learn-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32.0 MB)
Using cached https://nexus.onedev.xxxx.biz/repository/ds-pypi-group/packages/scikit-learn/1.1.2/scikit-learn-1.1.2.tar.gz (7.0 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [2269 lines of output]
Partial import of sklearn during the build process.
setup.py:128: DeprecationWarning:

    `numpy.distutils` is deprecated since NumPy 1.23.0, as a result
    of the deprecation of `distutils` itself. It will be removed for
    Python >= 3.12. For older Python versions it will remain present.
    It is recommended to use `setuptools < 60.0` for those Python versions.
    For more details, see:
      https://numpy.org/devdocs/reference/distutils_status_migration.html

Declare '_subtract_histograms' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
2. Use an 'int' return type on '_subtract_histograms' to allow an error code to be returned.

Error compiling Cython file:

...
if n_used_bins <= 1:
free(cat_infos)
return
```
       qsort(cat_infos, n_used_bins, sizeof(categorical_info),
             compare_cat_infos)
             ^
```
sklearn/ensemble/_hist_gradient_boosting/splitting.pyx:920:14: Cannot assign type 'int (const void *, const void ) except? -1 nogil' to 'int ()(const void *, const void *) noexcept nogil'. Exception values are incompatible. Suggest adding 'noexcept' to the type of 'compare_cat_infos'.
Traceback (most recent call last):
File "/tmp/pip-build-env-357_itq6/overlay/lib/python3.11/site-packages/Cython/Build/Dependencies.py", line 1345, in cythonize_one_helper
```
 File "/opt/conda/miniconda3/lib/python3.11/multiprocessing/pool.py", line 774, in get
   raise self._value
```
Cython.Compiler.Errors.CompileError: sklearn/ensemble/_hist_gradient_boosting/splitting.pyx
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

Environment
Dataproc Cluster: image-version 2.2-debian12
Jupyter Notebook (Testing the package on jupyter notebook using dataproc cluster)
Python Version: 3.11.8
! pip install ludwig[distributed] and tried with this package also ! pip install ludwig

Package Repository: nexus.onedev.xxxx.biz xxxx - masked for security (my organization name) private repository configured for installing pacakages
Additional context
I tried with multiple environments also
Python 3.10.8 & Python 3.8.15 by downgrading the image version of dataproc cluster (GCP)

The text was updated successfully, but these errors were encountered:

mhabedank · 2024-10-21T18:08:44Z

Hi @ayyappagundu We seem to have some major problems with our dependencies. I try to get hold of it. I hope we can fix that in the nearer future. Thanks for the ticket.

ayyappagundu · 2024-11-08T05:24:36Z

Hi! Any update on this?

mhabedank · 2024-11-08T08:15:44Z

Hi @ayyappagundu we are moving from requirements.txt to pyproject.toml with poetry and hatch. Also we need to pin down a lot of dependencies. That acutally cost us some time, but we are on a good way.

In addition, we will probably have to overhaul our distribution backend as we are using outdated versions of ray. Is it important that you distribute your task or can you also run it on a compute node?

mhabedank added ray dependency labels Oct 21, 2024

mhabedank self-assigned this Oct 21, 2024

mhabedank removed the ray label Oct 21, 2024

mhabedank removed their assignment Oct 21, 2024

mhabedank added the bug Something isn't working label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Install Ludwig Package in Dataproc Cluster Jupyter Notebook #4030

Unable to Install Ludwig Package in Dataproc Cluster Jupyter Notebook #4030

ayyappagundu commented Oct 21, 2024

Error compiling Cython file:

mhabedank commented Oct 21, 2024

ayyappagundu commented Nov 8, 2024

mhabedank commented Nov 8, 2024

Unable to Install Ludwig Package in Dataproc Cluster Jupyter Notebook #4030

Unable to Install Ludwig Package in Dataproc Cluster Jupyter Notebook #4030

Comments

ayyappagundu commented Oct 21, 2024

Error compiling Cython file:

mhabedank commented Oct 21, 2024

ayyappagundu commented Nov 8, 2024

mhabedank commented Nov 8, 2024