Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv: Add option to specify custom null values #4795

Merged
merged 2 commits into from
Sep 13, 2023

Conversation

vrongmeal
Copy link
Contributor

Can specify custom strings as NULL values for CSVs. This allows reading a CSV files which have placeholders for NULL values instead of empty strings.

Which issue does this PR close?

Closes #4794

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Sep 7, 2023

Have you run the benchmarks for this change?

@vrongmeal
Copy link
Contributor Author

Have you run the benchmarks for this change?

❯ cargo bench --bench csv_reader
   Compiling arrow-csv v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow-csv)
   Compiling arrow v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow)
    Finished bench [optimized] target(s) in 4.92s
     Running benches/csv_reader.rs (target/release/deps/csv_reader-dadff296bc34e88b)
4096 u64(0) - 128       time:   [238.26 µs 238.78 µs 239.28 µs]
                        change: [+6.9565% +7.2459% +7.5214%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 u64(0) - 1024      time:   [208.08 µs 208.18 µs 208.29 µs]
                        change: [+7.0050% +7.4168% +7.7855%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

4096 u64(0) - 4096      time:   [214.62 µs 215.65 µs 216.80 µs]
                        change: [+7.4572% +7.8687% +8.3141%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 30 outliers among 100 measurements (30.00%)
  16 (16.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  11 (11.00%) high severe

4096 i64(0) - 128       time:   [266.09 µs 266.39 µs 266.66 µs]
                        change: [+5.7398% +5.8570% +5.9624%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 i64(0) - 1024      time:   [237.48 µs 238.31 µs 239.60 µs]
                        change: [+5.0834% +5.3245% +5.6549%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 4096      time:   [245.08 µs 246.59 µs 248.04 µs]
                        change: [+6.1555% +6.7570% +7.3428%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 f32(0) - 128       time:   [257.63 µs 257.84 µs 258.05 µs]
                        change: [+7.4875% +7.5943% +7.6967%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  8 (8.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 1024      time:   [229.77 µs 229.95 µs 230.15 µs]
                        change: [+8.2822% +8.4132% +8.5422%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 f32(0) - 4096      time:   [238.71 µs 239.11 µs 239.45 µs]
                        change: [+6.0796% +6.6259% +7.0354%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild

4096 f64(0) - 128       time:   [280.21 µs 280.50 µs 280.81 µs]
                        change: [+7.8352% +7.9521% +8.0750%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f64(0) - 1024      time:   [252.86 µs 252.96 µs 253.07 µs]
                        change: [+7.8311% +7.9334% +8.0289%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

4096 f64(0) - 4096      time:   [266.61 µs 267.84 µs 269.00 µs]
                        change: [+4.7965% +5.3015% +5.8712%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 22 outliers among 100 measurements (22.00%)
  20 (20.00%) low mild
  2 (2.00%) high mild

4096 string(10, 0) - 128
                        time:   [120.30 µs 120.46 µs 120.62 µs]
                        change: [-0.4443% -0.3186% -0.1828%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(10, 0) - 1024
                        time:   [100.84 µs 100.87 µs 100.90 µs]
                        change: [-0.1108% -0.0352% +0.0550%] (p = 0.38 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(10, 0) - 4096
                        time:   [109.47 µs 110.80 µs 112.02 µs]
                        change: [-3.6030% -2.7357% -1.9150%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(30, 0) - 128
                        time:   [199.71 µs 199.81 µs 199.90 µs]
                        change: [-0.1739% -0.0899% -0.0088%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 string(30, 0) - 1024
                        time:   [178.48 µs 178.53 µs 178.59 µs]
                        change: [-0.0855% +0.0097% +0.0933%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 string(30, 0) - 4096
                        time:   [189.35 µs 190.53 µs 191.52 µs]
                        change: [-3.7107% -2.9717% -2.3294%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(100, 0) - 128
                        time:   [430.35 µs 430.69 µs 431.08 µs]
                        change: [-0.0244% +0.0797% +0.1834%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

4096 string(100, 0) - 1024
                        time:   [465.68 µs 469.26 µs 472.13 µs]
                        change: [-2.7082% -1.3919% -0.1599%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 23 outliers among 100 measurements (23.00%)
  20 (20.00%) low severe
  3 (3.00%) low mild

4096 string(100, 0) - 4096
                        time:   [436.40 µs 437.63 µs 438.81 µs]
                        change: [+0.1291% +0.4030% +0.6693%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  12 (12.00%) low mild
  1 (1.00%) high mild

4096 string(100, 0.5) - 128
                        time:   [310.80 µs 310.96 µs 311.13 µs]
                        change: [-0.6979% -0.5621% -0.4446%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [299.95 µs 301.09 µs 302.20 µs]
                        change: [-1.3131% -0.8438% -0.3311%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0.5) - 4096
                        time:   [300.49 µs 302.11 µs 303.68 µs]
                        change: [-2.2359% -1.6753% -1.1472%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [965.35 µs 966.12 µs 966.97 µs]
                        change: [-0.4355% +0.1689% +0.5690%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [920.97 µs 925.02 µs 929.03 µs]
                        change: [+1.4017% +2.0410% +2.6493%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [894.44 µs 900.48 µs 907.09 µs]
                        change: [+0.4112% +1.0848% +1.7978%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  19 (19.00%) high mild
  2 (2.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [802.37 µs 803.12 µs 803.87 µs]
                        change: [+3.9605% +4.1417% +4.3429%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [732.96 µs 733.86 µs 734.95 µs]
                        change: [+3.7733% +3.9502% +4.1396%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [726.96 µs 728.58 µs 730.19 µs]
                        change: [+4.1967% +4.4437% +4.6878%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@vrongmeal
Copy link
Contributor Author

I'll try to make some improvements here.

@tustvold
Copy link
Contributor

tustvold commented Sep 8, 2023

You may need to help LLVM by pulling the conditional out of the loop for it

@vrongmeal
Copy link
Contributor Author

Updated changes:

❯ cargo bench --bench csv_reader
   Compiling arrow-csv v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow-csv)
   Compiling arrow v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow)
    Finished bench [optimized] target(s) in 4.89s
     Running benches/csv_reader.rs (target/release/deps/csv_reader-dadff296bc34e88b)
4096 u64(0) - 128       time:   [228.45 µs 228.63 µs 228.84 µs]
                        change: [+3.2331% +3.4075% +3.5555%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

4096 u64(0) - 1024      time:   [199.65 µs 199.84 µs 200.02 µs]
                        change: [+3.5994% +3.7003% +3.8060%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 4096      time:   [207.64 µs 208.42 µs 209.10 µs]
                        change: [+1.6002% +2.2305% +2.8814%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 i64(0) - 128       time:   [252.46 µs 252.63 µs 252.85 µs]
                        change: [-1.1848% -1.0121% -0.8452%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 i64(0) - 1024      time:   [226.17 µs 226.28 µs 226.40 µs]
                        change: [-1.3198% -1.2067% -1.1007%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 i64(0) - 4096      time:   [237.37 µs 238.27 µs 239.31 µs]
                        change: [-0.2543% +0.2821% +0.8587%] (p = 0.33 > 0.05)
                        No change in performance detected.

4096 f32(0) - 128       time:   [242.69 µs 242.96 µs 243.22 µs]
                        change: [+0.2529% +0.4053% +0.5640%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 f32(0) - 1024      time:   [216.50 µs 216.94 µs 217.40 µs]
                        change: [+0.8552% +1.0164% +1.1941%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f32(0) - 4096      time:   [218.59 µs 219.23 µs 219.86 µs]
                        change: [-2.4562% -2.1355% -1.7614%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  10 (10.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 f64(0) - 128       time:   [265.63 µs 266.24 µs 266.88 µs]
                        change: [+0.7456% +0.9606% +1.2133%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f64(0) - 1024      time:   [235.71 µs 235.82 µs 235.95 µs]
                        change: [+0.2742% +0.3610% +0.4471%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 4096      time:   [247.69 µs 249.25 µs 250.70 µs]
                        change: [-0.6911% +0.0159% +0.7584%] (p = 0.96 > 0.05)
                        No change in performance detected.
Found 29 outliers among 100 measurements (29.00%)
  12 (12.00%) low severe
  3 (3.00%) low mild
  10 (10.00%) high mild
  4 (4.00%) high severe

4096 string(10, 0) - 128
                        time:   [120.22 µs 120.44 µs 120.66 µs]
                        change: [-0.1170% +0.0390% +0.1963%] (p = 0.63 > 0.05)
                        No change in performance detected.

4096 string(10, 0) - 1024
                        time:   [101.07 µs 101.10 µs 101.15 µs]
                        change: [-0.1678% -0.0755% +0.0136%] (p = 0.11 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 string(10, 0) - 4096
                        time:   [118.83 µs 119.73 µs 120.52 µs]
                        change: [+1.7587% +2.7841% +3.7795%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(30, 0) - 128
                        time:   [199.96 µs 200.23 µs 200.45 µs]
                        change: [-0.4955% -0.3786% -0.2600%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(30, 0) - 1024
                        time:   [178.41 µs 178.47 µs 178.54 µs]
                        change: [-0.3514% -0.2803% -0.2062%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

4096 string(30, 0) - 4096
                        time:   [184.26 µs 186.21 µs 188.00 µs]
                        change: [-1.9326% -0.9923% -0.1373%] (p = 0.02 < 0.05)
                        Change within noise threshold.

4096 string(100, 0) - 128
                        time:   [430.75 µs 431.09 µs 431.41 µs]
                        change: [-0.3590% -0.2513% -0.1409%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(100, 0) - 1024
                        time:   [474.17 µs 479.79 µs 484.65 µs]
                        change: [+5.4907% +7.2059% +8.9017%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(100, 0) - 4096
                        time:   [435.40 µs 436.45 µs 437.52 µs]
                        change: [-2.3235% -2.1234% -1.8919%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  11 (11.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [310.66 µs 312.32 µs 315.34 µs]
                        change: [-2.0074% -1.5844% -0.9803%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [291.23 µs 292.01 µs 292.91 µs]
                        change: [-4.9235% -4.1627% -3.4117%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(100, 0.5) - 4096
                        time:   [305.76 µs 306.96 µs 308.23 µs]
                        change: [-1.0322% -0.4965% +0.0523%] (p = 0.07 > 0.05)
                        No change in performance detected.

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128: Collecting 100 samples in estimated 9.54096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [949.42 µs 950.06 µs 950.69 µs]
                        change: [-0.3684% -0.2576% -0.1468%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024: Collecting 100 samples in estimated 8.4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [887.01 µs 888.57 µs 890.18 µs]
                        change: [-1.5528% -1.4042% -1.2439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096: Collecting 100 samples in estimated 9.4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [900.61 µs 909.35 µs 917.34 µs]
                        change: [-0.2915% +0.6089% +1.5111%] (p = 0.18 > 0.05)
                        No change in performance detected.

Benchmarking 4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128: Collecting 100 samples in estimated 7.7663 s (14096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [770.68 µs 777.47 µs 790.18 µs]
                        change: [-0.1682% +1.5906% +4.4086%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024: Collecting 100 samples in estimated 7.1337 s (4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [706.19 µs 707.05 µs 707.90 µs]
                        change: [-0.6163% -0.3262% -0.0437%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096: Collecting 100 samples in estimated 7.0071 s (4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [694.56 µs 695.28 µs 696.04 µs]
                        change: [-0.8192% -0.6161% -0.4063%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild

@tustvold
Copy link
Contributor

tustvold commented Sep 9, 2023

I also ran the benchmarks and got similar results, I'm honestly somewhat surprised but I guess string parsing inherently has a lot of conditionals.

Edit: Actually running this with slightly more realistic benchmarks, that use smaller integers, this shows a 10% performance regression still... I'll get a PR up with these so that you can take a look

Edit Edit: Benchmarks in #4803

@@ -241,6 +243,11 @@ impl Format {
self
}

pub fn with_nulls(mut self, nulls: HashSet<String>) -> Self {
Copy link
Contributor

@tustvold tustvold Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a Regex might be a more flexible interface, and more consistent with how we express things for schema inference. It might also perform better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I will try this with regex.

@vrongmeal
Copy link
Contributor Author

Regex performs worse in case there's a custom regex provided. I can push a commit with the regex change. The benchmarks should be similar.

@vrongmeal vrongmeal force-pushed the csv-nulls branch 2 times, most recently from f7c72ae to 4c11b3b Compare September 11, 2023 14:58
@vrongmeal
Copy link
Contributor Author

The latest changes seem to do quite well:

4096 i32_small(0) - 128 time:   [130.33 µs 132.50 µs 137.03 µs]
                        change: [-0.0644% +3.0447% +8.8724%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

Benchmarking 4096 i32_small(0) - 1024: Collecting 100 samples in estimated 5.3598 s (50k iterations4096 i32_small(0) - 1024
                        time:   [102.34 µs 102.38 µs 102.42 µs]
                        change: [+0.7437% +1.1352% +1.3835%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 i32_small(0) - 4096
                        time:   [99.582 µs 99.633 µs 99.689 µs]
                        change: [+0.6403% +0.8823% +1.0898%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 i32(0) - 128       time:   [205.46 µs 205.84 µs 206.24 µs]
                        change: [-3.7421% -2.6800% -1.8625%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 i32(0) - 1024      time:   [179.71 µs 179.89 µs 180.08 µs]
                        change: [-1.7409% -1.1017% -0.5766%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 i32(0) - 4096      time:   [183.32 µs 184.14 µs 184.85 µs]
                        change: [-1.9470% -1.4436% -0.9425%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 u64_small(0) - 128 time:   [130.11 µs 132.99 µs 137.36 µs]
                        change: [-3.7062% -1.0089% +1.6673%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

4096 u64_small(0) - 1024
                        time:   [103.47 µs 104.38 µs 106.41 µs]
                        change: [-0.9524% -0.0361% +1.5091%] (p = 0.96 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 u64_small(0) - 4096
                        time:   [100.65 µs 100.68 µs 100.71 µs]
                        change: [-3.0726% -1.7815% -0.8379%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 128       time:   [219.91 µs 220.18 µs 220.45 µs]
                        change: [-0.5549% -0.4068% -0.2597%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 u64(0) - 1024      time:   [193.99 µs 194.13 µs 194.27 µs]
                        change: [+0.6830% +0.8368% +0.9901%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 u64(0) - 4096      time:   [207.50 µs 208.06 µs 208.60 µs]
                        change: [-0.7799% +2.1297% +3.8490%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 i64_small(0) - 128 time:   [148.42 µs 148.73 µs 149.07 µs]
                        change: [-1.1969% -0.8221% -0.5079%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 i64_small(0) - 1024
                        time:   [122.20 µs 122.26 µs 122.33 µs]
                        change: [-0.4041% -0.2750% -0.1462%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 i64_small(0) - 4096
                        time:   [119.54 µs 119.59 µs 119.64 µs]
                        change: [-0.6484% -0.5039% -0.3624%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 128       time:   [253.10 µs 253.41 µs 253.69 µs]
                        change: [-4.9520% -3.1144% -1.7332%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 i64(0) - 1024      time:   [225.76 µs 225.85 µs 225.95 µs]
                        change: [-1.8082% -1.0685% -0.5455%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 4096      time:   [238.72 µs 238.94 µs 239.18 µs]
                        change: [-2.7638% -1.5168% -0.4835%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 f32_small(0) - 128 time:   [189.77 µs 190.08 µs 190.42 µs]
                        change: [-3.0283% -2.3206% -1.8603%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 f32_small(0) - 1024
                        time:   [163.36 µs 163.72 µs 164.23 µs]
                        change: [-0.8370% -0.5834% -0.3310%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

4096 f32_small(0) - 4096
                        time:   [160.66 µs 160.71 µs 160.76 µs]
                        change: [-8.3509% -4.9169% -2.0822%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 f32(0) - 128       time:   [239.17 µs 239.41 µs 239.68 µs]
                        change: [-5.3499% -3.4406% -2.0248%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 f32(0) - 1024      time:   [212.94 µs 213.14 µs 213.34 µs]
                        change: [-8.6109% -3.8452% -0.4918%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  9 (9.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe

4096 f32(0) - 4096      time:   [220.86 µs 221.13 µs 221.41 µs]
                        change: [-0.4012% -0.0585% +0.2439%] (p = 0.73 > 0.05)
                        No change in performance detected.

4096 f64_small(0) - 128 time:   [190.02 µs 190.37 µs 190.71 µs]
                        change: [-1.2881% -1.1310% -0.9806%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 f64_small(0) - 1024
                        time:   [163.64 µs 163.70 µs 163.76 µs]
                        change: [-0.0629% +0.0076% +0.0864%] (p = 0.85 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

4096 f64_small(0) - 4096
                        time:   [161.39 µs 161.45 µs 161.52 µs]
                        change: [-4.5426% -1.5487% +0.0554%] (p = 0.40 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 128       time:   [265.23 µs 265.51 µs 265.82 µs]
                        change: [-8.5473% -3.9192% -0.5514%] (p = 0.06 > 0.05)
                        No change in performance detected.

4096 f64(0) - 1024      time:   [236.95 µs 237.05 µs 237.14 µs]
                        change: [-0.1151% -0.0326% +0.0409%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 f64(0) - 4096      time:   [246.06 µs 247.75 µs 249.53 µs]
                        change: [-0.9090% -0.2578% +0.4224%] (p = 0.44 > 0.05)
                        No change in performance detected.

4096 string(10, 0) - 128
                        time:   [120.45 µs 120.63 µs 120.79 µs]
                        change: [-1.1682% -1.0363% -0.9072%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(10, 0) - 1024
                        time:   [101.01 µs 101.04 µs 101.08 µs]
                        change: [-0.0422% +0.0156% +0.0662%] (p = 0.58 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(10, 0) - 4096
                        time:   [118.38 µs 118.87 µs 119.37 µs]
                        change: [+7.7898% +8.1893% +8.5642%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(30, 0) - 128
                        time:   [199.88 µs 199.99 µs 200.08 µs]
                        change: [-0.5842% -0.5261% -0.4711%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 string(30, 0) - 1024
                        time:   [178.55 µs 178.81 µs 179.19 µs]
                        change: [-0.1499% -0.0245% +0.0912%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(30, 0) - 4096
                        time:   [194.25 µs 194.96 µs 195.63 µs]
                        change: [+1.5664% +2.1975% +2.7777%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild

4096 string(100, 0) - 128
                        time:   [430.01 µs 430.44 µs 430.87 µs]
                        change: [-0.7295% -0.5852% -0.3922%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0) - 1024
                        time:   [443.96 µs 452.71 µs 460.82 µs]
                        change: [-5.6742% -4.0818% -2.5121%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(100, 0) - 4096
                        time:   [438.85 µs 439.64 µs 440.49 µs]
                        change: [-0.4321% -0.1080% +0.2058%] (p = 0.51 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [312.43 µs 312.86 µs 313.23 µs]
                        change: [-0.4155% -0.2683% -0.1336%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 24 outliers among 100 measurements (24.00%)
  11 (11.00%) low severe
  12 (12.00%) low mild
  1 (1.00%) high mild

4096 string(100, 0.5) - 1024
                        time:   [293.62 µs 294.98 µs 296.50 µs]
                        change: [-2.3134% -1.8461% -1.3810%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(100, 0.5) - 4096
                        time:   [300.82 µs 302.16 µs 303.50 µs]
                        change: [-1.0920% -0.7349% -0.3356%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [950.54 µs 951.49 µs 952.46 µs]
                        change: [-0.0619% +0.0370% +0.1391%] (p = 0.47 > 0.05)
                        No change in performance detected.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [882.83 µs 886.20 µs 889.62 µs]
                        change: [-2.5615% -2.1071% -1.6910%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [879.77 µs 881.35 µs 882.90 µs]
                        change: [-3.4161% -2.7473% -2.0204%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [768.85 µs 769.68 µs 770.49 µs]
                        change: [-1.3762% -0.9834% -0.6629%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [706.64 µs 707.19 µs 707.71 µs]
                        change: [-3.9781% -3.2693% -2.6500%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [700.29 µs 701.35 µs 702.44 µs]
                        change: [-0.9342% -0.4612% -0.0244%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just left some relatively minor comments

@@ -241,6 +242,11 @@ impl Format {
self
}

pub fn with_null_regex(mut self, null_regex: Regex) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps some doc comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I will add doc comments.

@@ -319,6 +325,7 @@ impl Format {
if let Some(t) = self.terminator {
builder.terminator(csv::Terminator::Any(t));
}
// TODO: Null regex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, I thought it was building the arrow CSV reader.

@@ -336,6 +343,7 @@ impl Format {
if let Some(t) = self.terminator {
builder.terminator(csv_core::Terminator::Any(t));
}
// TODO: Null regex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

) -> Result<ArrayRef, ArrowError> {
let mut decimal_builder = PrimitiveBuilder::<T>::with_capacity(rows.len());
for row in rows.iter() {
let s = row.get(col_idx);
if s.is_empty() {
if s.is_empty() || null_regex.is_some_and(|r| r.is_match(s)) {
Copy link
Contributor

@tustvold tustvold Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a null regex is provided should we also treat empty strings as nulls?

Edit: BTW TIL about is_some_and, very nice 👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid checking is_empty in that case. Users can configure the regex to include it as well.

null_regex: Option<Regex>,
}

impl Debug for Decoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll undo this. It was when the struct had is_null closure.

Can specify custom strings as `NULL` values for CSVs as a regular
expression. This allows reading a CSV files which have placeholders for
NULL values instead of empty strings.

Fixes apache#4794

Signed-off-by: Vaibhav <[email protected]>
@vrongmeal
Copy link
Contributor Author

Cleaned up the PR. I made some minor refactors. Here we are with the final benchmarks (not much different from the previous ones):

4096 i32_small(0) - 128 time:   [129.48 µs 129.69 µs 129.93 µs]
                        change: [-0.1022% +0.2640% +0.6342%] (p = 0.16 > 0.05)
                        No change in performance detected.
Found 26 outliers among 100 measurements (26.00%)
  10 (10.00%) low severe
  5 (5.00%) low mild
  7 (7.00%) high mild
  4 (4.00%) high severe

4096 i32_small(0) - 1024
                        time:   [102.46 µs 102.58 µs 102.73 µs]
                        change: [+0.9391% +1.0842% +1.2562%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

4096 i32_small(0) - 4096
                        time:   [100.04 µs 100.18 µs 100.32 µs]
                        change: [+1.5302% +1.6691% +1.8415%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 i32(0) - 128       time:   [206.07 µs 206.22 µs 206.36 µs]
                        change: [+0.2724% +0.3939% +0.5160%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

4096 i32(0) - 1024      time:   [180.19 µs 180.31 µs 180.44 µs]
                        change: [+0.3165% +0.4330% +0.5506%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 i32(0) - 4096      time:   [186.55 µs 186.78 µs 187.02 µs]
                        change: [+0.8841% +1.1445% +1.3982%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 u64_small(0) - 128 time:   [129.82 µs 130.01 µs 130.21 µs]
                        change: [-0.2892% -0.1184% +0.0488%] (p = 0.17 > 0.05)
                        No change in performance detected.

4096 u64_small(0) - 1024
                        time:   [103.63 µs 103.69 µs 103.76 µs]
                        change: [+0.2414% +0.3326% +0.4239%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

4096 u64_small(0) - 4096
                        time:   [101.59 µs 101.64 µs 101.69 µs]
                        change: [+0.8516% +0.9603% +1.0727%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

4096 u64(0) - 128       time:   [221.43 µs 221.57 µs 221.72 µs]
                        change: [-1.7539% +0.3171% +1.5285%] (p = 0.82 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 1024      time:   [194.33 µs 194.44 µs 194.55 µs]
                        change: [+1.6837% +1.8242% +1.9828%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

4096 u64(0) - 4096      time:   [202.89 µs 204.05 µs 205.06 µs]
                        change: [-1.0604% -0.5223% -0.0185%] (p = 0.06 > 0.05)
                        No change in performance detected.

4096 i64_small(0) - 128 time:   [147.68 µs 147.85 µs 148.02 µs]
                        change: [-1.2601% -1.0891% -0.9241%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 i64_small(0) - 1024
                        time:   [122.26 µs 122.33 µs 122.39 µs]
                        change: [-0.3205% -0.2296% -0.1350%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 i64_small(0) - 4096
                        time:   [119.52 µs 119.58 µs 119.65 µs]
                        change: [-0.3122% -0.2236% -0.1442%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 128       time:   [252.20 µs 252.36 µs 252.52 µs]
                        change: [-0.4413% -0.2934% -0.1557%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 i64(0) - 1024      time:   [226.34 µs 226.43 µs 226.53 µs]
                        change: [+0.2713% +0.3469% +0.4166%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

4096 i64(0) - 4096      time:   [239.07 µs 239.31 µs 239.56 µs]
                        change: [-1.4105% -1.2061% -1.0047%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

4096 f32_small(0) - 128 time:   [188.32 µs 188.39 µs 188.46 µs]
                        change: [-1.2398% -1.0624% -0.8853%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

4096 f32_small(0) - 1024
                        time:   [163.03 µs 163.16 µs 163.37 µs]
                        change: [-0.1988% -0.1012% +0.0076%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

4096 f32_small(0) - 4096
                        time:   [162.22 µs 162.64 µs 163.13 µs]
                        change: [+0.8514% +1.0582% +1.2712%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 f32(0) - 128       time:   [239.99 µs 240.17 µs 240.35 µs]
                        change: [-0.1242% -0.0110% +0.0980%] (p = 0.85 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 1024      time:   [214.23 µs 214.37 µs 214.51 µs]
                        change: [+0.1882% +0.3555% +0.5017%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 4096      time:   [220.51 µs 220.78 µs 221.05 µs]
                        change: [+0.0601% +0.3376% +0.6216%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 f64_small(0) - 128 time:   [189.29 µs 189.42 µs 189.54 µs]
                        change: [-0.3008% -0.1870% -0.0795%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f64_small(0) - 1024
                        time:   [164.10 µs 164.15 µs 164.21 µs]
                        change: [+0.2648% +0.3359% +0.4127%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 f64_small(0) - 4096
                        time:   [161.84 µs 162.18 µs 162.65 µs]
                        change: [+0.3241% +0.4859% +0.6660%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 f64(0) - 128       time:   [264.38 µs 264.51 µs 264.63 µs]
                        change: [-0.2875% -0.1436% +0.0010%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 f64(0) - 1024      time:   [236.75 µs 236.82 µs 236.90 µs]
                        change: [-0.1708% -0.0758% +0.0053%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 4096      time:   [250.96 µs 251.21 µs 251.45 µs]
                        change: [-1.6759% -1.4032% -1.1108%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 string(10, 0) - 128
                        time:   [120.43 µs 120.67 µs 120.96 µs]
                        change: [-1.5145% -1.3509% -1.1836%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 string(10, 0) - 1024
                        time:   [100.94 µs 100.97 µs 101.00 µs]
                        change: [-0.1498% -0.0674% +0.0154%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 string(10, 0) - 4096
                        time:   [114.31 µs 114.59 µs 114.87 µs]
                        change: [-2.9515% -2.3473% -1.6384%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high severe

4096 string(30, 0) - 128
                        time:   [200.39 µs 200.64 µs 200.89 µs]
                        change: [-0.0823% +0.0362% +0.1587%] (p = 0.56 > 0.05)
                        No change in performance detected.

4096 string(30, 0) - 1024
                        time:   [179.58 µs 179.79 µs 180.01 µs]
                        change: [+0.6443% +0.7986% +0.9492%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

4096 string(30, 0) - 4096
                        time:   [193.35 µs 193.84 µs 194.27 µs]
                        change: [-1.1721% -0.8889% -0.6075%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(100, 0) - 128
                        time:   [432.10 µs 432.97 µs 433.84 µs]
                        change: [+0.2068% +0.3655% +0.5314%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0) - 1024
                        time:   [438.27 µs 446.28 µs 454.66 µs]
                        change: [-3.6040% -2.1777% -0.9708%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0) - 4096
                        time:   [441.42 µs 442.67 µs 443.87 µs]
                        change: [+1.1279% +1.4411% +1.7686%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [311.91 µs 312.07 µs 312.23 µs]
                        change: [-0.3517% -0.1610% +0.0619%] (p = 0.16 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [298.65 µs 300.69 µs 302.52 µs]
                        change: [+0.9439% +1.5945% +2.2157%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0.5) - 4096
                        time:   [313.77 µs 316.21 µs 318.96 µs]
                        change: [+0.8984% +1.5120% +2.1361%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [950.86 µs 951.80 µs 952.96 µs]
                        change: [+0.2635% +0.4109% +0.5742%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [894.21 µs 897.00 µs 899.93 µs]
                        change: [-1.8062% -1.2546% -0.7216%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [882.42 µs 888.96 µs 899.89 µs]
                        change: [-1.7490% -0.7579% +0.3868%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [777.10 µs 777.90 µs 778.76 µs]
                        change: [+0.3738% +0.5824% +0.7619%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [706.16 µs 707.73 µs 709.08 µs]
                        change: [-0.3864% -0.1618% +0.0398%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [702.01 µs 704.64 µs 707.09 µs]
                        change: [+0.4500% +0.7426% +1.0359%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

@vrongmeal vrongmeal requested a review from tustvold September 12, 2023 12:43
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to review this, been a bit swamped.

Just some very minor docs nits, and then I think this is good to go. Just double-checking the benchmarks on some reference hardware now...

Edit: Benchmarks look good 🎉

arrow-csv/src/reader/mod.rs Outdated Show resolved Hide resolved
arrow-csv/src/reader/mod.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

I took the liberty of applying the docs changes so I can get this in

@vrongmeal
Copy link
Contributor Author

I took the liberty of applying the docs changes so I can get this in

Thank you so much!

@tustvold tustvold merged commit 2075cd1 into apache:master Sep 13, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to specify custom null values for CSV reader
2 participants