njit to speed up functions on awkward arrays #1824

fspinna · 2022-10-24T12:45:26Z

fspinna
Oct 24, 2022

Hi everyone,
I'm relatively new to Numba and even newer to awkward arrays. I'm trying to understand how to use njitted functions with awkward arrays as input. For example, let's say that I've a list of numpy arrays having different lengths:

import numpy as np
import numba as nb
import awkward as ak

n_rep = 100
lists = [np.array([1, 2, 3]), np.array([1,2])] * n_rep
ak_array = ak.Array(lists)
lists_numba = nb.typed.List(lists)  # numba typed list

I want to take the mean with axis=1. Of course, I can do it natively with awkward arrays, but to do it with numba I have to iterate over each np.array in the list, so the function will look like something like this:

@nb.njit
def nb_mean(arr):
  means = np.empty(shape=len(arr))
  for i in range(len(arr)):
    means[i] = np.mean(arr[i])
  return means

nb_mean(lists_numba)

If I pass the awkward array to this function, numba raises this error:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function mean at 0x7f05dd254b00>) found for signature:
 
 >>> mean(ak.ArrayView(ak.NumpyArrayType(array(int64, 1d, A), none, {}), None, ()))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload in function 'Numpy_method_redirection.generic': File: numba/core/typing/npydecl.py: Line 379.
    With argument(s): '(ak.ArrayView(ak.NumpyArrayType(array(int64, 1d, A), none, {}), None, ()))':
   Rejected as the implementation raised a specific error:
     TypeError: array does not have a field with key 'mean'
   
   (https://github.com/scikit-hep/awkward-1.0/blob/1.10.1/src/awkward/_connect/_numba/layout.py#L339)
  raised from /usr/local/lib/python3.7/dist-packages/awkward/_connect/_numba/layout.py:339

During: resolving callee type: Function(<function mean at 0x7f05dd254b00>)
During: typing of call at <ipython-input-27-1014e122c504> (8)


File "<ipython-input-27-1014e122c504>", line 8:
def nb_mean(arr):
    <source elided>
  for i in range(len(arr)):
    means[i] = np.mean(arr[i])
    ^

The code works if I first convert the awkward array with np.array as in the following:

@nb.njit
def nb_ak_mean(arr):
  means = np.empty(shape=len(arr))
  for i in range(len(arr)):
    means[i] = np.mean(np.array(arr[i]))  # convert first to np.array then apply np.mean
  return means

So basically, If I try to pass directly ak_array to the np.mean without first converting it with np.array function, numba raises an error.
These are the performance of the 3 approaches:

%timeit ak.mean(ak_array, axis=1)
%timeit nb_ak_mean(ak_array)
%timeit nb_mean(lists_numba)

1.42 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # native awkward
74 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # numba + awkward
7.61 µs ± 41.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # numba + List

The approach using the List object from numba seem to be significantly faster than using awkward arrays. Am I doing something wrong?

Answered by agoose77

Oct 24, 2022

Awkward Array (and Numba's) performance can be measured approximately as time = initial_cost + rate*amount_of_work. In this case, your array is too small for the work-scaling to be properly measured, i.e. the setup costs (initial_cost) are dominating the performance.

If you change your array to be more like ~1,000,000 elements, then the performance difference between the two Numba jitted cases is closer to a factor of 4:¹

%timeit ak.mean(ak_array, axis=1)
%timeit nb_ak_mean(ak_array)
%timeit nb_mean(lists_numba)
763 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
117 ms ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
28.8 ms ± 1.55 ms per loop (mean ± std. dev…

View full answer

ianna · 2022-10-24T13:29:25Z

ianna
Oct 24, 2022
Maintainer

@fspinna - which version of awkward do you use?

>>> ak.__version__
'2.0.0rc1'

1 reply

fspinna Oct 24, 2022
Author

numba == 0.56.3
awkward == 1.10.1
numpy == 1.21.6

agoose77 · 2022-10-24T13:48:51Z

agoose77
Oct 24, 2022
Maintainer

Awkward Array (and Numba's) performance can be measured approximately as time = initial_cost + rate*amount_of_work. In this case, your array is too small for the work-scaling to be properly measured, i.e. the setup costs (initial_cost) are dominating the performance.

If you change your array to be more like ~1,000,000 elements, then the performance difference between the two Numba jitted cases is closer to a factor of 4:¹

%timeit ak.mean(ak_array, axis=1)
%timeit nb_ak_mean(ak_array)
%timeit nb_mean(lists_numba)
763 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
117 ms ± 9.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
28.8 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The rest of this difference probably stems from the use of np.array() instead of np.asarray(). In Numba's case, we have a faster implementation of array conversion for asarray than array; asarray only handles 1D contiguous arrays, and directly interprets them as a numba Array, whereas the np.array function performs a copy (it can handle >1D arrays, and these can be non-contiguous in Awkward). This closes the performance gap, and now the Awkward jitted case is (barely) faster than the numba list jitted case:

%timeit nb_ak_mean(ak_array)
18.3 ms ± 703 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Of course, Numba isn't always able to eek out the best performance if you use NumPy operations. A bare loop is usually the fastest solution, and can help avoid these kinds of quirks.

In general, though, don't worry about this kind of performance difference if it's not bottlenecking your workflow. There are always things you can do to improve performance, e.g. handling raggedness explicitly in your kernel by flattening the array and passing in the sublist lengths, but it's not always worth the extra maintenance burden and code complexity.

Note that this isn't a perfect test; Awkward Array should (in general) have a better memory layout than multiple small NumPy arrays; Awkward usually represents raggedness as variable-length views over a contiguous buffer (certainly not always, but often). This represents a more favourable scenario in terms of memory access. ↩

4 replies

fspinna Oct 24, 2022
Author

Thanks! Perfectly clear. This fits exactly my use case. Tested on my machine and your version is significantly faster than the numba List version.

jpivarski Oct 24, 2022
Maintainer

That's right: if the total time for a Numba-JITted function to run is µs, we don't know how it will scale (the rate part) until we give it a larger problem. This is likely dominated by initial_cost.

Another consideration is that Python lists defined like

lists = [np.array([1, 2, 3]), np.array([1, 2])] * n_rep

are actually repeated references to the same two arrays, not a dataset consisting of 5 × n_rep numbers (as the Awkward Array is, since the ak.Array constructor iterates over the data, copying it into a columnar representation).

These two small arrays are easily small enough to fit in a CPU's cache (even the L1 cache, which is typically between 8 kB and 64 kB), so it does not need to pull data from main memory in the calculation. If the calculation you're performing is simple enough, getting data from main memory can easily become the bottleneck.

Getting estimates from this infographic,

the + in your np.mean takes < 1 clock cycle
the floating point / in your np.mean takes 10‒40 clock cycles (np.sum would be very different from np.mean!)
fetching data from L1 takes 3‒4 clock cycles
fetching data from main memory takes 100‒150 clock cycles

(Note: that's after you already have an n_rep large enough that the process isn't dominated by unboxing/boxing data as you enter and leave the Numba context. I'm assuming you've already applied @agoose77's correction.)

Naively, clock ticks translate into real time by dividing by your clock speed (~3 GHz), but even this is complicated by the fact that the CPU actually runs the process in pipelines and speculatively evaluates based on guesses that can later be invalidated. The main takeaway from these numbers is the relative orders of magnitude.

Very likely, the real dataset you're interested in is not the same two Python objects repeated many times; I'll bet it's a large set of distinct lists, all with different values and lengths. While %timeit is correctly measuring the rate of this test, you're probably interested in a different test. (Supposedly objective performance testing is full of subjective choices!)

Another thing to be aware of is that this line:

    means[i] = np.mean(np.asarray(arr[i]))

does not actually copy the data in the Awkward Array, so it's safe to use even if arr[i] is big. (Note that this is np.asarray, not np.array, as @agoose77 pointed out.) It has to create an ArrayModel struct on the stack, which is something like 64 bytes (depending on how many dimensions the array has), but that's a constant-size metadata specifying pointers to the object, data type, shape and strides, not the variable-sized data in the array. It should be relatively fast. (I can't put a number on it.)

It would be nice if you didn't have to do the explicit cast at all; that's open issue #509.

But as @agoose77 pointed out, vectorized functions inside of the JIT-compiled context aren't advantageous as they are in pure Python. You can sometimes do better with an explicitly defined function because LLVM will recognize that you don't need the full features of an np.ndarray object, just the ability to iterate over the contents, and it might be able to optimize that down into fewer instructions. I haven't tested it, but this might be a faster nb_ak_mean:

@nb.njit
def nb_ak_mean(arr):
  means = np.zeros(shape=len(arr))
  for i, x in enumerate(arr):
    for j in range(len(x)):
      means[i] += x[j]
    means[i] /= len(x)   # maybe you'll get some NaNs from division-by-zero
  return means

Refactoring this into helper functions (one iterates, the other computes the mean) shouldn't be any slower because @nb.njit functions that call @nb.njit functions skip the boxing/unboxing steps and may even be inlined by LLVM. Doing it in one function vs. multiple functions is not a performance/readability tradeoff, but functions that require an ArrayModel struct like np.mean do have to do the extra work of making that struct, small as it may be.

jpivarski Oct 24, 2022
Maintainer

The rest of this difference probably stems from the use of np.array() instead of np.asarray(). In Numba's case, we have a faster implementation of array conversion for asarray than array; asarray only handles 1D contiguous arrays, and directly interprets them as a numba Array, whereas the np.array function performs a copy (it can handle >1D arrays, and these can be non-contiguous in Awkward).

I missed this and it's absolutely true: np.array is a much slower implementation than np.asarray; it's more general and does more runtime conversion. What I was saying applies to np.asarray. In fact, I'm going to change history and fix the text in my comment.

For reference, these are the np.array and np.asarray implementations. The np.array builds up a lot of runtime code to check the sizes of iterables, to see if they are regular enough to be converted into an n-dimensional NumPy array, whereas the np.asarray implementation takes compile-time information and only applies to regular-typed arrays. (In fact, it should someday be expanded to regular n-dimensional arrays; this np.asarray only works on 1D arrays.)

awkward/src/awkward/_connect/numba/arrayview.py

Lines 856 to 1002 in 2bdd114

    
           def array_supported(dtype): 
        
               return dtype in ( 
        
                   numba.types.boolean, 
        
                   numba.types.int8, 
        
                   numba.types.int16, 
        
                   numba.types.int32, 
        
                   numba.types.int64, 
        
                   numba.types.uint8, 
        
                   numba.types.uint16, 
        
                   numba.types.uint32, 
        
                   numba.types.uint64, 
        
                   numba.types.float32, 
        
                   numba.types.float64, 
        
                   numba.types.complex64, 
        
                   numba.types.complex128, 
        
               ) or isinstance(dtype, (numba.types.NPDatetime, numba.types.NPTimedelta)) 
        
           @numba.extending.overload(ak.nplikes.numpy.array) 
        
           def overload_np_array(array, dtype=None): 
        
               if isinstance(array, ArrayViewType): 
        
                   ndim = array.type.ndim 
        
                   inner_dtype = array.type.inner_dtype 
        
                   if ndim is not None and array_supported(inner_dtype): 
        
                       declare_shape = [] 
        
                       compute_shape = [] 
        
                       specify_shape = ["len(array)"] 
        
                       ensure_shape = [] 
        
                       array_name = "array" 
        
                       for i in range(ndim - 1): 
        
                           declare_shape.append(f"shape{i} = -1") 
        
                           compute_shape.append( 
        
                               "{}for x{} in {}:".format("    " * i, i, array_name) 
        
                           ) 
        
                           compute_shape.append("{}    if shape{} == -1:".format("    " * i, i)) 
        
                           compute_shape.append( 
        
                               "{0}        shape{1} = len(x{1})".format("    " * i, i) 
        
                           ) 
        
                           compute_shape.append( 
        
                               "{0}    elif shape{1} != len(x{1}):".format("    " * i, i) 
        
                           ) 
        
                           compute_shape.append( 
        
                               "{}        raise ValueError('cannot convert to NumPy because " 
        
                               "subarray lengths are not regular')".format("    " * i) 
        
                           ) 
        
                           specify_shape.append(f"shape{i}") 
        
                           ensure_shape.append("if shape{0} == -1: shape{0} = 0".format(i)) 
        
                           array_name = f"x{i}" 
        
                       fill_array = [] 
        
                       index = [] 
        
                       array_name = "array" 
        
                       for i in range(ndim): 
        
                           fill_array.append( 
        
                               "{0}for i{1}, x{1} in enumerate({2}):".format( 
        
                                   "    " * i, i, array_name 
        
                               ) 
        
                           ) 
        
                           index.append(f"i{i}") 
        
                           array_name = f"x{i}" 
        
                       fill_array.append( 
        
                           "{}out[{}] = x{}".format("    " * ndim, "][".join(index), ndim - 1) 
        
                       ) 
        
                       return code_to_function( 
        
                           """ 
        
           def array_impl(array, dtype=None): 
        
               {} 
        
               {} 
        
               {} 
        
               out = numpy.zeros(({}), {}) 
        
               {} 
        
               return out 
        
           """.format( 
        
                               "\n    ".join(declare_shape), 
        
                               "\n    ".join(compute_shape), 
        
                               "\n    ".join(ensure_shape), 
        
                               ", ".join(specify_shape), 
        
                               f"numpy.{inner_dtype}" if dtype is None else "dtype", 
        
                               "\n    ".join(fill_array), 
        
                           ), 
        
                           "array_impl", 
        
                           {"numpy": ak.nplikes.numpy}, 
        
                       ) 
        
           @numba.extending.type_callable(ak.nplikes.numpy.asarray) 
        
           def type_asarray(context): 
        
               def typer(arrayview): 
        
                   if ( 
        
                       isinstance(arrayview, ArrayViewType) 
        
                       and isinstance(arrayview.type, ak._connect.numba.layout.NumpyArrayType) 
        
                       and arrayview.type.ndim == 1 
        
                       and array_supported(arrayview.type.inner_dtype) 
        
                   ): 
        
                       return numba.types.Array(arrayview.type.inner_dtype, 1, "C") 
        
               return typer 
        
           @numba.extending.lower_builtin(ak.nplikes.numpy.asarray, ArrayViewType) 
        
           def lower_asarray(context, builder, sig, args): 
        
               rettype, (viewtype,) = sig.return_type, sig.args 
        
               (viewval,) = args 
        
               viewproxy = context.make_helper(builder, viewtype, viewval) 
        
               assert isinstance(viewtype.type, ak._connect.numba.layout.NumpyArrayType) 
        
               whichpos = ak._connect.numba.layout.posat( 
        
                   context, builder, viewproxy.pos, viewtype.type.ARRAY 
        
               ) 
        
               arrayptr = ak._connect.numba.layout.getat( 
        
                   context, builder, viewproxy.arrayptrs, whichpos 
        
               ) 
        
               bitwidth = ak._connect.numba.layout.type_bitwidth(rettype.dtype) 
        
               itemsize = context.get_constant(numba.intp, bitwidth // 8) 
        
               data = numba.core.cgutils.pointer_add( 
        
                   builder, 
        
                   arrayptr, 
        
                   builder.mul(viewproxy.start, itemsize), 
        
                   context.get_value_type(numba.types.CPointer(rettype.dtype)), 
        
               ) 
        
               shape = context.make_tuple( 
        
                   builder, 
        
                   numba.types.UniTuple(numba.types.intp, 1), 
        
                   (builder.sub(viewproxy.stop, viewproxy.start),), 
        
               ) 
        
               strides = context.make_tuple( 
        
                   builder, 
        
                   numba.types.UniTuple(numba.types.intp, 1), 
        
                   (itemsize,), 
        
               ) 
        
               out = numba.np.arrayobj.make_array(rettype)(context, builder) 
        
               numba.np.arrayobj.populate_array( 
        
                   out, 
        
                   data=data, 
        
                   shape=shape, 
        
                   strides=strides, 
        
                   itemsize=itemsize, 
        
                   meminfo=None, 
        
                   parent=None, 
        
               ) 
        
               return out._getvalue()

This Python code runs during Numba's compilation phase.

overload_np_array creates Python code in strings and evaluates them as a new Numba function to check the size of the Awkward iterable—whatever its type may be—allocates a new NumPy array, and copies the data into that new array. That's expensive!
type_asarray checks the type information of the Awkward Array: it can't be a variable-length list
lower_asarray generates LLVM bytecode that only builds the ~64 byte metadata for the lowered NumPy array.

fspinna Oct 24, 2022
Author

Thanks a lot for the details. Very interesting indeed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

njit to speed up functions on awkward arrays #1824

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

njit to speed up functions on awkward arrays #1824

fspinna Oct 24, 2022

Replies: 2 comments · 5 replies

ianna Oct 24, 2022 Maintainer

fspinna Oct 24, 2022 Author

agoose77 Oct 24, 2022 Maintainer

Footnotes

fspinna Oct 24, 2022 Author

jpivarski Oct 24, 2022 Maintainer

jpivarski Oct 24, 2022 Maintainer

fspinna Oct 24, 2022 Author

fspinna
Oct 24, 2022

Replies: 2 comments 5 replies

ianna
Oct 24, 2022
Maintainer

fspinna Oct 24, 2022
Author

agoose77
Oct 24, 2022
Maintainer

fspinna Oct 24, 2022
Author

jpivarski Oct 24, 2022
Maintainer

jpivarski Oct 24, 2022
Maintainer

fspinna Oct 24, 2022
Author