Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Performance of numeric casts #40874

Open
jorisvandenbossche opened this issue Mar 28, 2024 · 11 comments
Open

[C++] Performance of numeric casts #40874

jorisvandenbossche opened this issue Mar 28, 2024 · 11 comments

Comments

@jorisvandenbossche
Copy link
Member

A simple numeric cast in pyarrow / Arrow C++ seems to be quite a bit slower compared to numpy. Often Arrow does more work (safe casting, handling nulls, ..), but I also see this this for a plain unsafe int64 -> float64 cast of a 1d array without nulsl, where Arrow is almost 4-5x slower than numpy. For a simple operation like that, this seems an unexpected big difference.

The python code I used to compare both:

import numpy as np
import pyarrow as pa

nrows = 10**5
arr_np = np.random.randn(nrows*20).astype("int64")
arr = pa.array(arr_np)

%timeit arr_np.astype("float64")
# 1.1 ms ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit arr.cast(pa.float64(), safe=False)
# 4.7 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I see this both with my local development version as with released conda packages.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 28, 2024

Looking into the inner loop in both Arrow and numpy, it seems to be quite similar. In numpy, almost all time is spent in _aligned_contig_cast_long_to_double, which essentially boils down to:

    npy_intp N = dimensions[0];
    char *src = args[0], *dst = args[1];

    while (N--) {
        *(npy_double *)dst = ((npy_double)(*(npy_long *)src));
        dst += sizeof(npy_double);
        src += sizeof(npy_long);
    }

and in Arrow, almost all time is spent in CastPrimitive, which essentially does:

using OutT = typename OutType::c_type;
using InT = typename InType::c_type;
const InT* in_values = arr.GetValues<InT>(1);
OutT* out_values = out->GetValues<OutT>(1);
for (int64_t i = 0; i < arr.length; ++i) {
*out_values++ = static_cast<OutT>(*in_values++);
}

Anybody any insight in why our templated C++ code is so much slower than numpy's C code? Logically it looks very similar, or would there be something in our code that prevents optimizations?

Running perf on the examples, two quick observations (without much experience with this tool and without looking into detail) that might be relevant:

  • The perf flamegraph show big chunk of exc_page_fault in CastPrimitive (a bit more than half of the time spent in CastPrimitive is attributed to that)
  • Drilling down into the annotations of the hotspot, the actual copy of the casted float into the destination buffer (movsd in the assembly) is taking relatively a lot more time in the Arrow version compared to numpy

@pitrou
Copy link
Member

pitrou commented Mar 28, 2024

Well, this really seems to be a problem with jemalloc actually:

>>> arr = pa.array([42]*1_000_000, type=pa.int64())

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.17 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
8.86 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
545 µs ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
8.16 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
802 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
8.81 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

>>> np_arr = arr.to_numpy()
>>> %timeit np_arr.astype('float64')
514 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@pitrou
Copy link
Member

pitrou commented Mar 28, 2024

Trying to prefer mimalloc in #40875

@pitrou
Copy link
Member

pitrou commented Mar 28, 2024

That said, even with mimalloc we're much slower that Numpy on larger arrays, so there's perhaps another issue (are we allocating a null bitmap?):

>>> arr = pa.array([42]*100_000_000, type=pa.int64())
>>> np_arr = arr.to_numpy()

>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
277 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit arr[:10_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
27.6 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
557 µs ± 2.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

>>> %timeit np_arr.astype('float64')
126 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit np_arr[:10_000_000].astype('float64')
12.6 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np_arr[:1_000_000].astype('float64')
466 µs ± 4.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@pitrou
Copy link
Member

pitrou commented Mar 28, 2024

By the way, it seems the mimalloc/jemalloc really occurs for a certain range of sizes, above and below which they seem to have similar perf:

>>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
77 µs ± 955 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
77.1 µs ± 696 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
149 µs ± 744 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
146 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
550 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.21 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
1.89 ms ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
4.92 ms ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
3.76 ms ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
10.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
21.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
22 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
41.6 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
40.4 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@pitrou
Copy link
Member

pitrou commented Mar 28, 2024

By the way, one can also take a look at the C++ results (with mimalloc):

CastInt64ToDoubleUnsafe/524288/1000     148707 ns       148677 ns         4705 items_per_second=3.52635G/s null_percent=0.1 size=524.288k
CastInt64ToDoubleUnsafe/524288/10       148746 ns       148718 ns         4601 items_per_second=3.5254G/s null_percent=10 size=524.288k
CastInt64ToDoubleUnsafe/524288/2        148716 ns       148681 ns         4709 items_per_second=3.52625G/s null_percent=50 size=524.288k
CastInt64ToDoubleUnsafe/524288/1        149002 ns       148976 ns         4699 items_per_second=3.51927G/s null_percent=100 size=524.288k
CastInt64ToDoubleUnsafe/524288/0        146288 ns       146262 ns         4783 items_per_second=3.58459G/s null_percent=0 size=524.288k

and compare them with a corresponding Python benchmark:

>>> %timeit arr[:524288].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
153 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

and then to Numpy:

>>> np_arr = arr.to_numpy()
>>> %timeit np_arr[:524288].astype('float64')
153 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

They are all very similar (~150 µs per iteration, despite different levels of internal boilerplate).

@jorisvandenbossche
Copy link
Member Author

import numpy as np
import pyarrow as pa

results = []
for exp in range(N,10):
    N = 10**exp
    arr = np.random.randint(0, 100, size=N)
    res = %timeit -o arr.astype("float64")
    results.append((N, "numpy", res.average))
    arr = pa.array(arr)
    for pool in [pa.mimalloc_memory_pool(), pa.jemalloc_memory_pool(), pa.system_memory_pool()]:
        res = %timeit -o arr.cast(pa.float64(), safe=False, memory_pool=pool)
        results.append((N, f"pyarrow ({pool.backend_name})", res.average))

df = pd.DataFrame(results, columns=["N", "backend", "time"])

import seaborn.objects as so
so.Plot(df, x="N", y="time", color="backend").add(so.Line()).scale(y="log", x="log").show()

Figure_1

It's indeed a specific size of data where jemalloc is performing worse, and it was just that size as I was testing in the OP ..

For larger data, the different allocators give the same result, but it's still around 2x slower than numpy (eg for the 10^9 size, 2.3 vs 1.3s)

@WillAyd
Copy link
Contributor

WillAyd commented Jun 5, 2024

Tried running smaller examples through godbolt with gcc 14.1 to see if they produced any instruction differences. I think this replicates the NumPy example:

void foo() {
    char *src;
    char *dst;
    while (1) {
        *(double *)dst = ((double)(*(long *)src));
        dst += sizeof(double);
        src += sizeof(long);
    }
}

producing the following asm:

foo:
        push    rbp
        mov     rbp, rsp
.L2:
        mov     rax, QWORD PTR [rbp-8]
        mov     rax, QWORD PTR [rax]
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, rax
        mov     rax, QWORD PTR [rbp-16]
        movsd   QWORD PTR [rax], xmm0
        add     QWORD PTR [rbp-16], 8
        add     QWORD PTR [rbp-8], 8
        jmp     .L2

and this typifies the C++ example:

void foo() {
    using OutT = double; 
    using InT = long; 
    const InT* in_values; 
    OutT* out_values; 
    for (;;) { 
        *out_values++ = static_cast<OutT>(*in_values++); 
    } 
 }

producing the following asm:

foo():
        push    rbp
        mov     rbp, rsp
.L2:
        mov     rax, QWORD PTR [rbp-8]
        lea     rdx, [rax+8]
        mov     QWORD PTR [rbp-8], rdx
        mov     rax, QWORD PTR [rax]
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, rax
        mov     rax, QWORD PTR [rbp-16]
        lea     rdx, [rax+8]
        mov     QWORD PTR [rbp-16], rdx
        movsd   QWORD PTR [rax], xmm0
        jmp     .L2

Maybe the processor just handles the add instructions more effectively than the lea / mov combination in the C++ example?

FWIW if you change the C++ code to do the pointer increment in subsequent expressions rather than doing inline, the asm generated matches the C examples:

void foo() {
    using OutT = double; 
    using InT = long; 
    const InT* in_values; 
    OutT* out_values; 
    for (;;) { 
        *out_values = static_cast<OutT>(*in_values); 
        out_values++;
        in_values++;
    } 
 }

@WillAyd
Copy link
Contributor

WillAyd commented Jun 5, 2024

Maybe there is also the possibility to use memcpy in the case where sizeof(OutT) == sizeof(InT) to help boost performance?

@pitrou
Copy link
Member

pitrou commented Jun 5, 2024

Maybe there is also the possibility to use memcpy in the case where sizeof(OutT) == sizeof(InT) to help boost performance?

When casting int64 to float64?

@WillAyd
Copy link
Contributor

WillAyd commented Jun 5, 2024

Ah sorry - ignore that comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants