Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wip): cupy histograms #1095

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from
Draft

feat(wip): cupy histograms #1095

wants to merge 11 commits into from

Conversation

lgray
Copy link
Collaborator

@lgray lgray commented May 15, 2024

work in progress on using cupy in the old coffea-hist package as a demonstrator

@martindurant
Copy link

(cf https://docs.cupy.dev/en/stable/reference/generated/cupy.histogramdd.html , which is the standard regular-array API conforming to the numpy version - no boost or anything)

@lgray
Copy link
Collaborator Author

lgray commented Jun 3, 2024

@martindurant - cool, it appears to use the same techniques (fill by atomics) so it'll be subject to the same scaling limitations I'm seeing. However, the number of calls to fill is a bit more lean so maybe it's worth backing an implementation with it. I'll have to try some benchmarks.

Otherwise - there's significant functionality missing from the cupy hists that we'll still have to add on top, if it turns out to run faster in the first place.

@martindurant
Copy link

Otherwise - there's significant functionality missing from the cupy hists that we'll still have to add on top, if it turns out to run faster in the first place.

Yes, I expect this is the case. dask-histogram also uses boost and adds a numpy- compatible API on top for those that expect it; and of course, it's designed to work with awkward. I expect there's an amount of sharing and refactoring that can eventually be done.

@lgray
Copy link
Collaborator Author

lgray commented Jun 3, 2024

Yeah - I think it is all possible. Right now is really getting all the pipe-fittings in place. I'll let you know if there's any clear win in the benchmarks.

@lgray
Copy link
Collaborator Author

lgray commented Jun 3, 2024

@Saransh-cpp: @jpivarski and I talked last Friday and it came up that you might be interested in taking this "pilot project" and turning it into a full-blow UHI compatible histogramming interface (a la scikit-hep/hist), but for cupy/cuda histograms.

What's in this PR has the necessary functionality for HEP and we can convert to scikit-hep/hist, but it would be nice to have a cohesive ecosystem and only convert to CPU memory-space at the last moment. This would grant us more or less infinite scaling.

We also has some ideas towards warpwise-distributed histograms where a (collection) of warps would tend to a sub-range of bins so that filling can be done more in parallel. This old implementation description demonstrates that if you stick to a warp (i.e. 32 bins) and replicate histograms to do filling parallel you can reach 10GB/s filling rates, because there's no use of atomics.

This also has interesting parallels to cluster-distributed histograms where a (relatively enormous) histogram could be distributed across a whole dask cluster and achieve scaling to 100s of GBs in size or more. This would effectively remove scaling limitations for histograms for the foreseeable future and is probably important for achieving precision HL-LHC analyses.

Anyway - please let us know if you are interested in turning this into a more mature package and possibly adding features to it! We're happy to answer any questions you may have.

@Saransh-cpp
Copy link
Contributor

HI @lgray, apologies for the late reply. I have submitted a proposal for the cuda-histogram project, and my contract here has been extended for two months (15th June - 15th August). I will start working on this project from the 15th of June. I will fly to India on the 18th of June to get my CUDA laptop and return to CERN on the 1st of July. I will be working remotely for this period.

I still need to go through this PR in detail, but I will pick it up very soon and discuss the work in awkward-dask meetings. Thank you again for the ping! :)

@Saransh-cpp
Copy link
Contributor

Hi @lgray @jpivarski, I am experimenting with inheriting hist axis to make the axis classes in this PR match the existing APIs. Could you please let me know if the axis methods and properties are supposed to return a cupy array or is a numpy array fine. For instance, right now the axis class Bin does include methods that return a cupy array. Once I inherit the hist axis class, the methods in Bin override the inherited hist.axis.Regular methods and return a cupy array, but the ones that are not present in Bin simply return a numpy array. Should I update Bin to have as many methods as hist.axis.Regular has and make them all return a cupy array, or should I eliminate the methods in Bin and just depend on the inherited (numpy array returning) methods -

In [1]: import cuda_histogram

In [2]: cuda_histogram.hist_tools.Regular(10, 0, 1)  # the repr is inherited, works!
Out[2]: Regular(10, 0, 1)

In [3]: cuda_histogram.hist_tools.Regular(10, 0, 1).extent  # inherited property, works!
Out[3]: 12

In [4]: type(cuda_histogram.hist_tools.Regular(10, 0, 1).edges)  # overriden by `Bin` to return a cupy array
Out[4]: cupy.ndarray

In [5]: type(cuda_histogram.hist_tools.Regular(10, 0, 1).widths)  # not overriden, inherited, thus a numpy array
Out[5]: numpy.ndarray

The class inheritance looks like this right now -

class Bin(hist.axis.Regular, family="cuda_histogram"):  # has common methods for Regular and Variable axis but I might split it
    ...


class Regular(Bin, family="cuda_histogram"):
    ...

@lgray
Copy link
Collaborator Author

lgray commented Jul 23, 2024

@Saransh-cpp I've been mulling over this for a while, and I think that really only the filling should be done on the GPU, from the user's perspective.

I would say that for any other user facing thing like axes and such it should produce np arrays, since axes and other arrays tend to be very small and shipping them to/from GPU is inefficient. fill should be the only thing that accepts cupy arrays and should complain if receiving anything else.

What do you think?

@jpivarski
Copy link
Contributor

If the histogram bin edges/descriptions are immutable, you could have a copy of them on both CPU and GPU. Then they'd be available on the GPU for the filling operations, and available on the CPU for user queries. The only thing that would need to be transferred is the bin data. It can be a new cudaMalloc on the GPU at the start of a fill, then copy to the CPU and added to a running total at the end of a fill. Since fill is a high level function called from Python, the overhead of 1 cudaMalloc and 1 cudaMemcpy per fill would be hidden under all the Python overhead. Outside of a fill, the source of truth for histogram data would all be on the CPU.

@lgray
Copy link
Collaborator Author

lgray commented Jul 24, 2024

@jpivarski that's just about what I'm thinking, but fill shouldn't trigger an xfer to CPU (as then on multiple fills you xfer to CPU each fill which syncronizes the device). So it should xfer to CPU on any non-fill access of the bin/weight contents, IMO.

I think this is possible as well?

@jpivarski
Copy link
Contributor

That would work. Although if fill is a Python function, the overhead is already high enough that a transfer from GPU to CPU would matter all that much.

@lgray
Copy link
Collaborator Author

lgray commented Jul 24, 2024

Multiple fills being able to proceed in parallel is better than sync'ing on each call, I'd say.

@Saransh-cpp
Copy link
Contributor

Thanks for the detailed discussion, @lgray @jpivarski! I have been working on the API and right now it looks like this -

In [1]: import cuda_histogram; import numpy as np

In [2]: a = cuda_histogram.axis.Regular(10, 0, 1, name="r")

In [3]: c = cuda_histogram.Hist(a)

In [4]: a, c
Out[4]:
(Regular(10, 0, 1, name='r', label='Axis 0'),
 Hist(Regular(10, 0, 1), storage=Double()))

In [5]: c.fill(r=np.random.normal(size=1_000_000))  # the signature of fill needs to be changed to match other libraries

In [6]: c.values()
Out[6]: array([39441., 38312., 37423., 35879., 34110., 32467., 30292.])

In [7]: type(c.values())
Out[7]: cupy.ndarray

In [8]: type(a.edges)
Out[8]: numpy.ndarray

Only the fill method executes on GPU with this design. Could you please let me know if this is the right direction or should I change something in this example?

@lgray lgray force-pushed the jitters branch 4 times, most recently from 35285fa to 8100467 Compare August 8, 2024 21:51
@Saransh-cpp
Copy link
Contributor

Hi @lgray, the flow behavior in this PR works like this right now -

In [1]: import cuda_histogram; import numpy as np; import hist

In [2]: a = cuda_histogram.axis.Bin("n", "l", 10, 0, 1)

In [3]: c = cuda_histogram.Hist("c", a)

In [4]: c.fill(n=np.random.normal(size=1_000_000))

In [5]: c.values()
Out[5]:
{(): array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
        72873., 73828.])}

In [6]: c.values("over")
Out[6]:
{(): (array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
         72873., 73828.]),
  array([59481., 62261., 63944., 65914., 67848., 69844., 70480., 71625.,
         72873., 73828.]))}

I am not sure if this is the intended behavior. The flow behaviors in other histogramming libraries for a 1D histogram just adds overflow and underflow values to the end of the array instead of duplicating the entire array. Could you please let me know if this is the intended behavior?

@lgray
Copy link
Collaborator Author

lgray commented Aug 14, 2024

Yes, coffea histograms also have "NaN-flow" which is a special overflow bin for NaNs, so that counts and weights don't get affected by floating point problems. I personally like this feature, but none of the other histogram libraries care to implement it.

Insofar as the filling workflow you've shown, does it work without copies using GPU-only arrays?

@Saransh-cpp
Copy link
Contributor

Yes! The fill method uses the exact logic in this PR and makes it compatible with other histogramming libraries -

Here is a more detailed usage example - https://github.com/Saransh-cpp/cuda-histogram?tab=readme-ov-file#usage

I have updated the repository with documentation, a few tests, and CI. The API is now very similar to boost-histogram's API. The Categorical axis is the only major thing missing (or left for refactoring).

I have requested to transfer the repository to Scikit-HEP and will create a v0.1.0 release once it is transferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants