perf: fix O(n²) string allocations in fold operations #2257

rnkrtt · 2025-11-05T21:46:21Z

Description

Fixes O(n²) string allocations in fold operations

Found two places where we're doing string concatenation inside fold, which creates a new string allocation on every iteration. This gets really slow with large data.

Changes:

cairo_pie.rs: switched to pre-allocated string with write! macro
for hex serialization
vm_exception.rs: replaced fold with join() for formatting error
reference lists

Both were doing the same anti-pattern just in different places

…ting

gabrielbosio · 2025-11-06T19:07:46Z

Hi, @rnkrtt! Do you have a benchmark that shows any improvement with this change?

rnkrtt · 2025-11-07T08:37:16Z

@gabrielbosio hey!

Consistent ~3.5x speedup across different data sizes.

Benchmark code (click)

use std::time::Instant;
use std::fmt::Write;

fn old_way(data: &[u8]) -> String {
    data.iter()
        .fold(String::new(), |acc, b| acc + &format!("{:02x}", b))
}

fn new_way(data: &[u8]) -> String {
    data.iter().fold(
        String::with_capacity(data.len() * 2),
        |mut string, b| {
            write!(&mut string, "{:02x}", b).unwrap();
            string
        },
    )
}

fn bench(name: &str, size: usize, iterations: usize, f: fn(&[u8]) -> String) {
    let data: Vec<u8> = (0..size).map(|i| (i % 256) as u8).collect();
    
    // Warmup
    for _ in 0..5 {
        let _ = f(&data);
    }
    
    // Actual measurement
    let start = Instant::now();
    for _ in 0..iterations {
        let _ = f(&data);
    }
    let total_duration = start.elapsed();
    let avg_duration = total_duration / iterations as u32;
    
    println!("{} (size={}): {:?} avg over {} runs", name, size, avg_duration, iterations);
}

fn main() {
    println!("Testing string fold performance...\n");
    
    let configs = [
        (100, 10000),    // size, iterations
        (1000, 1000),
        (5000, 500),
        (10000, 200),
    ];
    
    for (size, iterations) in configs {
        println!("--- Size: {} bytes ---", size);
        bench("Old (fold + concat)", size, iterations, old_way);
        bench("New (fold + write!) ", size, iterations, new_way);
        
        // Calculate speedup
        let data: Vec<u8> = (0..size).map(|i| (i % 256) as u8).collect();
        
        let start = Instant::now();
        for _ in 0..iterations { let _ = old_way(&data); }
        let old_time = start.elapsed();
        
        let start = Instant::now();
        for _ in 0..iterations { let _ = new_way(&data); }
        let new_time = start.elapsed();
        
        let speedup = old_time.as_secs_f64() / new_time.as_secs_f64();
        println!("Speedup: {:.2}x faster\n", speedup);
    }
}

Run with: rustc -O test_string_perf.rs && ./test_string_perf

gabrielbosio · 2025-11-07T19:50:10Z

**Hyper Thereading Benchmark results**




hyperfine -r 2 -n "hyper_threading_main threads: 1" 'RAYON_NUM_THREADS=1 ./hyper_threading_main' -n "hyper_threading_pr threads: 1" 'RAYON_NUM_THREADS=1 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 1
  Time (mean ± σ):     22.910 s ±  0.068 s    [User: 22.000 s, System: 0.907 s]
  Range (min … max):   22.861 s … 22.958 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 1
  Time (mean ± σ):     22.728 s ±  0.005 s    [User: 21.787 s, System: 0.938 s]
  Range (min … max):   22.725 s … 22.731 s    2 runs
 
Summary
  hyper_threading_pr threads: 1 ran
    1.01 ± 0.00 times faster than hyper_threading_main threads: 1




hyperfine -r 2 -n "hyper_threading_main threads: 2" 'RAYON_NUM_THREADS=2 ./hyper_threading_main' -n "hyper_threading_pr threads: 2" 'RAYON_NUM_THREADS=2 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 2
  Time (mean ± σ):     12.321 s ±  0.011 s    [User: 21.943 s, System: 1.020 s]
  Range (min … max):   12.313 s … 12.329 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 2
  Time (mean ± σ):     12.320 s ±  0.069 s    [User: 21.933 s, System: 0.935 s]
  Range (min … max):   12.272 s … 12.369 s    2 runs
 
Summary
  hyper_threading_pr threads: 2 ran
    1.00 ± 0.01 times faster than hyper_threading_main threads: 2




hyperfine -r 2 -n "hyper_threading_main threads: 4" 'RAYON_NUM_THREADS=4 ./hyper_threading_main' -n "hyper_threading_pr threads: 4" 'RAYON_NUM_THREADS=4 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 4
  Time (mean ± σ):      9.801 s ±  0.105 s    [User: 34.475 s, System: 1.165 s]
  Range (min … max):    9.726 s …  9.875 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 4
  Time (mean ± σ):      9.764 s ±  0.061 s    [User: 34.426 s, System: 1.166 s]
  Range (min … max):    9.721 s …  9.807 s    2 runs
 
Summary
  hyper_threading_pr threads: 4 ran
    1.00 ± 0.01 times faster than hyper_threading_main threads: 4




hyperfine -r 2 -n "hyper_threading_main threads: 6" 'RAYON_NUM_THREADS=6 ./hyper_threading_main' -n "hyper_threading_pr threads: 6" 'RAYON_NUM_THREADS=6 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 6
  Time (mean ± σ):      9.461 s ±  0.078 s    [User: 34.762 s, System: 1.189 s]
  Range (min … max):    9.405 s …  9.516 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 6
  Time (mean ± σ):      9.726 s ±  0.209 s    [User: 34.427 s, System: 1.135 s]
  Range (min … max):    9.578 s …  9.874 s    2 runs
 
Summary
  hyper_threading_main threads: 6 ran
    1.03 ± 0.02 times faster than hyper_threading_pr threads: 6




hyperfine -r 2 -n "hyper_threading_main threads: 8" 'RAYON_NUM_THREADS=8 ./hyper_threading_main' -n "hyper_threading_pr threads: 8" 'RAYON_NUM_THREADS=8 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 8
  Time (mean ± σ):      9.547 s ±  0.366 s    [User: 35.092 s, System: 1.134 s]
  Range (min … max):    9.288 s …  9.806 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 8
  Time (mean ± σ):      9.518 s ±  0.151 s    [User: 34.739 s, System: 1.116 s]
  Range (min … max):    9.411 s …  9.625 s    2 runs
 
Summary
  hyper_threading_pr threads: 8 ran
    1.00 ± 0.04 times faster than hyper_threading_main threads: 8




hyperfine -r 2 -n "hyper_threading_main threads: 16" 'RAYON_NUM_THREADS=16 ./hyper_threading_main' -n "hyper_threading_pr threads: 16" 'RAYON_NUM_THREADS=16 ./hyper_threading_pr'
Benchmark 1: hyper_threading_main threads: 16
  Time (mean ± σ):      9.588 s ±  0.041 s    [User: 35.077 s, System: 1.172 s]
  Range (min … max):    9.559 s …  9.616 s    2 runs
 
Benchmark 2: hyper_threading_pr threads: 16
  Time (mean ± σ):      9.397 s ±  0.108 s    [User: 35.462 s, System: 1.215 s]
  Range (min … max):    9.321 s …  9.473 s    2 runs
 
Summary
  hyper_threading_pr threads: 16 ran
    1.02 ± 0.01 times faster than hyper_threading_main threads: 16

gabrielbosio · 2025-11-07T19:52:57Z

Benchmark Results for unmodified programs 🚀

Command	Mean [s]	Min [s]	Max [s]	Relative
`base big_factorial`	1.940 ± 0.011	1.924	1.958	1.00 ± 0.01
`head big_factorial`	1.930 ± 0.022	1.911	1.983	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base big_fibonacci`	1.886 ± 0.017	1.873	1.928	1.01 ± 0.01
`head big_fibonacci`	1.862 ± 0.012	1.843	1.879	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base blake2s_integration_benchmark`	6.629 ± 0.061	6.539	6.704	1.00 ± 0.02
`head blake2s_integration_benchmark`	6.605 ± 0.103	6.465	6.794	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base compare_arrays_200000`	2.025 ± 0.022	1.995	2.074	1.01 ± 0.01
`head compare_arrays_200000`	2.002 ± 0.016	1.981	2.032	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base dict_integration_benchmark`	1.334 ± 0.006	1.326	1.345	1.00 ± 0.01
`head dict_integration_benchmark`	1.328 ± 0.011	1.319	1.355	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base field_arithmetic_get_square_benchmark`	1.131 ± 0.008	1.117	1.143	1.01 ± 0.01
`head field_arithmetic_get_square_benchmark`	1.124 ± 0.009	1.112	1.137	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base integration_builtins`	6.716 ± 0.027	6.683	6.763	1.00 ± 0.01
`head integration_builtins`	6.686 ± 0.065	6.616	6.805	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base keccak_integration_benchmark`	6.798 ± 0.151	6.686	7.190	1.00 ± 0.03
`head keccak_integration_benchmark`	6.777 ± 0.105	6.663	6.946	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base linear_search`	2.019 ± 0.018	1.994	2.047	1.01 ± 0.01
`head linear_search`	2.005 ± 0.016	1.987	2.032	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base math_cmp_and_pow_integration_benchmark`	1.438 ± 0.006	1.428	1.446	1.00 ± 0.01
`head math_cmp_and_pow_integration_benchmark`	1.433 ± 0.013	1.411	1.460	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base math_integration_benchmark`	1.414 ± 0.011	1.395	1.429	1.01 ± 0.01
`head math_integration_benchmark`	1.397 ± 0.008	1.387	1.415	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base memory_integration_benchmark`	1.137 ± 0.014	1.129	1.173	1.00 ± 0.02
`head memory_integration_benchmark`	1.134 ± 0.015	1.121	1.172	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base operations_with_data_structures_benchmarks`	1.459 ± 0.013	1.445	1.488	1.00
`head operations_with_data_structures_benchmarks`	1.463 ± 0.009	1.452	1.477	1.00 ± 0.01

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`base pedersen`	508.6 ± 1.2	507.6	511.3	1.00 ± 0.01
`head pedersen`	508.0 ± 2.9	503.7	511.6	1.00

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`base poseidon_integration_benchmark`	588.3 ± 2.7	584.5	593.6	1.00
`head poseidon_integration_benchmark`	590.7 ± 3.5	583.6	596.1	1.00 ± 0.01

Command	Mean [s]	Min [s]	Max [s]	Relative
`base secp_integration_benchmark`	1.714 ± 0.011	1.703	1.742	1.00
`head secp_integration_benchmark`	1.717 ± 0.013	1.694	1.739	1.00 ± 0.01

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`base set_integration_benchmark`	650.3 ± 2.3	647.6	655.8	1.07 ± 0.01
`head set_integration_benchmark`	607.5 ± 2.6	605.4	613.4	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`base uint256_integration_benchmark`	3.849 ± 0.055	3.786	3.922	1.02 ± 0.02
`head uint256_integration_benchmark`	3.784 ± 0.039	3.743	3.873	1.00

gabrielbosio · 2025-11-07T20:04:07Z

Benchmarks of the repo don't show improvements so it would be great if you can add a benchmark to see how much this change improves execution time in context

rnkrtt added 2 commits November 5, 2025 22:43

perf(vm_exception): replace fold with join for error reference format…

3b17fa3

…ting

perf(cairo_pie): optimize hex serialization with pre-allocated buffer

efee959

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: fix O(n²) string allocations in fold operations #2257

perf: fix O(n²) string allocations in fold operations #2257

Uh oh!

rnkrtt commented Nov 5, 2025

Uh oh!

gabrielbosio commented Nov 6, 2025

Uh oh!

rnkrtt commented Nov 7, 2025 •

edited

Loading

Uh oh!

gabrielbosio commented Nov 7, 2025

Uh oh!

gabrielbosio commented Nov 7, 2025

Uh oh!

gabrielbosio commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: fix O(n²) string allocations in fold operations #2257

Are you sure you want to change the base?

perf: fix O(n²) string allocations in fold operations #2257

Uh oh!

Conversation

rnkrtt commented Nov 5, 2025

Description

Uh oh!

gabrielbosio commented Nov 6, 2025

Uh oh!

rnkrtt commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabrielbosio commented Nov 7, 2025

Uh oh!

gabrielbosio commented Nov 7, 2025

Uh oh!

gabrielbosio commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rnkrtt commented Nov 7, 2025 •

edited

Loading