Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNT] Add joblib backend option and set default to all parallelized estimators #1797

Open
baraline opened this issue Jul 14, 2024 · 9 comments
Labels
API design API design & software architecture good first issue Good for newcomers maintenance Continuous integration, unit testing & package distribution

Comments

@baraline
Copy link
Member

baraline commented Jul 14, 2024

Describe the issue

Some of the estimators that use joblib for parallelization use process-based backend, while other use threads-based backend. Ideally, we want this to be a parameter tunable by the users.

Suggest a potential alternative/fix

Including a joblib_backend parameter which would default to threading (from discussions with Matthew), and use this parameter to set the joblib backend parameter during Parallel calls would fix the issue.

@baraline baraline added good first issue Good for newcomers API design API design & software architecture maintenance Continuous integration, unit testing & package distribution labels Jul 14, 2024
@TonyBagnall
Copy link
Contributor

I remember we looked at this in some detail when debugging some rocket features, was there some interaction with its use of prange? I cant remember!

def _static_transform_uni(X, parameters, indices):
    """Transform a 2D collection of univariate time series.

    Implemented separately to the multivariate version for numba efficiency reasons.
    See issue #1778.
    """
    n_cases, n_timepoints = X.shape
    (
        _,
        _,
        dilations,
        n_features_per_dilation,
        biases,
    ) = parameters
    n_kernels = len(indices)
    n_dilations = len(dilations)
    f = n_kernels * np.sum(n_features_per_dilation)
    features = np.zeros((n_cases, f), dtype=np.float32)
    for i in prange(n_cases):

@CodeLionX
Copy link
Member

I think adding this parameter to so many APIs would clutter the parameter lists. What are your thoughts on using joblib.parallel_config?
This means that all aeon code does not specify any parallel backend in their calls to joblib.Parallel (if it does not require a certain backend) and the user can choose its backend by wrapping calls within:

with joblib.parallel_config(backend="loky"):
   # aeon code

The same is actually possible for n_jobs etc.

@baraline
Copy link
Member Author

True, forgot about this option ! That or we set a global variable such as AEON_JOBLIB_BACKEND and document its usage

@CodeLionX
Copy link
Member

A custom env var AEON_JOBLIB_BACKEND IMO just makes sense if it is also used for other parallel stuff besides joblib. Otherwise, we can use the existing (documented and potentially already known) joblib-facilities.

@MatthewMiddlehurst
Copy link
Member

MatthewMiddlehurst commented Jul 17, 2024

Sounds like a good idea, would have to be documented though. Not sure about removing n_jobs (not that it was really suggested), but would be good to tidy up the other bits.

@baraline
Copy link
Member Author

I'm OK with the use of the parallel_config option and let user warp estimators if they want to change from the default. But if we want to default to threads, we need to do the following, which look like parallel_config doesn't overwrite :

with parallel_config(backend='loky'):
    p = Parallel(backend='threading')
    print(p._backend)

You obtain <joblib._parallel_backends.ThreadingBackend object at 0x000001D91BB731F0>

This means that if we want the parallel_config option to work, we would need to use default Parallel(), which use the loky backend. This is exactly the one we want to move away from for multiple reasons (check https://scikit-learn.fondation-inria.fr/joblib-sprint/ for some).

Or did I miss something ?

@CodeLionX
Copy link
Member

ah, damn ... yes, Parallel() uses the loky-backend by default. Then, I don't see how parallel_config would help us here.

@baraline
Copy link
Member Author

So for options we have :

  • Add a parameter to specify backend, which defaults to threads for all estimators with parallel capabilities
  • Add an environment variable to specify backend package wide.

With an equal level of documentation, I think the first options would allow us to be more flexible if for whatever reason we need to use processes for some estimators and threads for others. Also it should be less prone to causing issues with existing methods.

But true, this adds a bit of parameter bloat to the API... We could introduce a dictionary regrouping all joblib params and use it as kwargs when calling parallel ?

Other ideas are welcome !

@CodeLionX
Copy link
Member

I'm also in favor of 1. How many joblib parameters do we have? If there are just 2, I would not use kwargs for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture good first issue Good for newcomers maintenance Continuous integration, unit testing & package distribution
Projects
None yet
Development

No branches or pull requests

4 participants