-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Performance of numeric casts #40874
Comments
Looking into the inner loop in both Arrow and numpy, it seems to be quite similar. In numpy, almost all time is spent in npy_intp N = dimensions[0];
char *src = args[0], *dst = args[1];
while (N--) {
*(npy_double *)dst = ((npy_double)(*(npy_long *)src));
dst += sizeof(npy_double);
src += sizeof(npy_long);
} and in Arrow, almost all time is spent in arrow/cpp/src/arrow/compute/kernels/scalar_cast_internal.cc Lines 40 to 46 in cf832b8
Anybody any insight in why our templated C++ code is so much slower than numpy's C code? Logically it looks very similar, or would there be something in our code that prevents optimizations? Running
|
Well, this really seems to be a problem with jemalloc actually: >>> arr = pa.array([42]*1_000_000, type=pa.int64())
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.17 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
8.86 µs ± 41 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
545 µs ± 6.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
8.16 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
802 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:10_000].cast(pa.float64(), safe=False, memory_pool=pa.system_memory_pool())
8.81 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> np_arr = arr.to_numpy()
>>> %timeit np_arr.astype('float64')
514 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
Trying to prefer mimalloc in #40875 |
That said, even with mimalloc we're much slower that Numpy on larger arrays, so there's perhaps another issue (are we allocating a null bitmap?): >>> arr = pa.array([42]*100_000_000, type=pa.int64())
>>> np_arr = arr.to_numpy()
>>> %timeit arr.cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
277 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit arr[:10_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
27.6 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
557 µs ± 2.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit np_arr.astype('float64')
126 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit np_arr[:10_000_000].astype('float64')
12.6 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np_arr[:1_000_000].astype('float64')
466 µs ± 4.85 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
By the way, it seems the mimalloc/jemalloc really occurs for a certain range of sizes, above and below which they seem to have similar perf: >>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
77 µs ± 955 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:250_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
77.1 µs ± 696 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
149 µs ± 744 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:500_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
146 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
550 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:1_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
2.21 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
1.89 ms ± 6.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit arr[:2_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
4.92 ms ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
3.76 ms ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:4_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
10.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
21.9 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:8_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
22 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
41.6 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit arr[:16_000_000].cast(pa.float64(), safe=False, memory_pool=pa.jemalloc_memory_pool())
40.4 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
By the way, one can also take a look at the C++ results (with mimalloc):
and compare them with a corresponding Python benchmark: >>> %timeit arr[:524288].cast(pa.float64(), safe=False, memory_pool=pa.mimalloc_memory_pool())
153 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) and then to Numpy: >>> np_arr = arr.to_numpy()
>>> %timeit np_arr[:524288].astype('float64')
153 µs ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) They are all very similar (~150 µs per iteration, despite different levels of internal boilerplate). |
import numpy as np
import pyarrow as pa
results = []
for exp in range(N,10):
N = 10**exp
arr = np.random.randint(0, 100, size=N)
res = %timeit -o arr.astype("float64")
results.append((N, "numpy", res.average))
arr = pa.array(arr)
for pool in [pa.mimalloc_memory_pool(), pa.jemalloc_memory_pool(), pa.system_memory_pool()]:
res = %timeit -o arr.cast(pa.float64(), safe=False, memory_pool=pool)
results.append((N, f"pyarrow ({pool.backend_name})", res.average))
df = pd.DataFrame(results, columns=["N", "backend", "time"])
import seaborn.objects as so
so.Plot(df, x="N", y="time", color="backend").add(so.Line()).scale(y="log", x="log").show() It's indeed a specific size of data where jemalloc is performing worse, and it was just that size as I was testing in the OP .. For larger data, the different allocators give the same result, but it's still around 2x slower than numpy (eg for the 10^9 size, 2.3 vs 1.3s) |
Tried running smaller examples through godbolt with gcc 14.1 to see if they produced any instruction differences. I think this replicates the NumPy example: void foo() {
char *src;
char *dst;
while (1) {
*(double *)dst = ((double)(*(long *)src));
dst += sizeof(double);
src += sizeof(long);
}
} producing the following asm: foo:
push rbp
mov rbp, rsp
.L2:
mov rax, QWORD PTR [rbp-8]
mov rax, QWORD PTR [rax]
pxor xmm0, xmm0
cvtsi2sd xmm0, rax
mov rax, QWORD PTR [rbp-16]
movsd QWORD PTR [rax], xmm0
add QWORD PTR [rbp-16], 8
add QWORD PTR [rbp-8], 8
jmp .L2 and this typifies the C++ example: void foo() {
using OutT = double;
using InT = long;
const InT* in_values;
OutT* out_values;
for (;;) {
*out_values++ = static_cast<OutT>(*in_values++);
}
} producing the following asm: foo():
push rbp
mov rbp, rsp
.L2:
mov rax, QWORD PTR [rbp-8]
lea rdx, [rax+8]
mov QWORD PTR [rbp-8], rdx
mov rax, QWORD PTR [rax]
pxor xmm0, xmm0
cvtsi2sd xmm0, rax
mov rax, QWORD PTR [rbp-16]
lea rdx, [rax+8]
mov QWORD PTR [rbp-16], rdx
movsd QWORD PTR [rax], xmm0
jmp .L2 Maybe the processor just handles the add instructions more effectively than the lea / mov combination in the C++ example? FWIW if you change the C++ code to do the pointer increment in subsequent expressions rather than doing inline, the asm generated matches the C examples: void foo() {
using OutT = double;
using InT = long;
const InT* in_values;
OutT* out_values;
for (;;) {
*out_values = static_cast<OutT>(*in_values);
out_values++;
in_values++;
}
} |
|
When casting int64 to float64? |
Ah sorry - ignore that comment |
A simple numeric cast in pyarrow / Arrow C++ seems to be quite a bit slower compared to numpy. Often Arrow does more work (safe casting, handling nulls, ..), but I also see this this for a plain unsafe int64 -> float64 cast of a 1d array without nulsl, where Arrow is almost 4-5x slower than numpy. For a simple operation like that, this seems an unexpected big difference.
The python code I used to compare both:
I see this both with my local development version as with released conda packages.
The text was updated successfully, but these errors were encountered: