csv: Add option to specify custom null values #4795

vrongmeal · 2023-09-07T17:45:56Z

Can specify custom strings as NULL values for CSVs. This allows reading a CSV files which have placeholders for NULL values instead of empty strings.

Which issue does this PR close?

Closes #4794

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-09-07T18:01:37Z

Have you run the benchmarks for this change?

vrongmeal · 2023-09-08T12:02:28Z

Have you run the benchmarks for this change?

❯ cargo bench --bench csv_reader
   Compiling arrow-csv v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow-csv)
   Compiling arrow v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow)
    Finished bench [optimized] target(s) in 4.92s
     Running benches/csv_reader.rs (target/release/deps/csv_reader-dadff296bc34e88b)
4096 u64(0) - 128       time:   [238.26 µs 238.78 µs 239.28 µs]
                        change: [+6.9565% +7.2459% +7.5214%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 u64(0) - 1024      time:   [208.08 µs 208.18 µs 208.29 µs]
                        change: [+7.0050% +7.4168% +7.7855%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

4096 u64(0) - 4096      time:   [214.62 µs 215.65 µs 216.80 µs]
                        change: [+7.4572% +7.8687% +8.3141%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 30 outliers among 100 measurements (30.00%)
  16 (16.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  11 (11.00%) high severe

4096 i64(0) - 128       time:   [266.09 µs 266.39 µs 266.66 µs]
                        change: [+5.7398% +5.8570% +5.9624%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 i64(0) - 1024      time:   [237.48 µs 238.31 µs 239.60 µs]
                        change: [+5.0834% +5.3245% +5.6549%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 4096      time:   [245.08 µs 246.59 µs 248.04 µs]
                        change: [+6.1555% +6.7570% +7.3428%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 f32(0) - 128       time:   [257.63 µs 257.84 µs 258.05 µs]
                        change: [+7.4875% +7.5943% +7.6967%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  8 (8.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 1024      time:   [229.77 µs 229.95 µs 230.15 µs]
                        change: [+8.2822% +8.4132% +8.5422%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 f32(0) - 4096      time:   [238.71 µs 239.11 µs 239.45 µs]
                        change: [+6.0796% +6.6259% +7.0354%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild

4096 f64(0) - 128       time:   [280.21 µs 280.50 µs 280.81 µs]
                        change: [+7.8352% +7.9521% +8.0750%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f64(0) - 1024      time:   [252.86 µs 252.96 µs 253.07 µs]
                        change: [+7.8311% +7.9334% +8.0289%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

4096 f64(0) - 4096      time:   [266.61 µs 267.84 µs 269.00 µs]
                        change: [+4.7965% +5.3015% +5.8712%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 22 outliers among 100 measurements (22.00%)
  20 (20.00%) low mild
  2 (2.00%) high mild

4096 string(10, 0) - 128
                        time:   [120.30 µs 120.46 µs 120.62 µs]
                        change: [-0.4443% -0.3186% -0.1828%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(10, 0) - 1024
                        time:   [100.84 µs 100.87 µs 100.90 µs]
                        change: [-0.1108% -0.0352% +0.0550%] (p = 0.38 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(10, 0) - 4096
                        time:   [109.47 µs 110.80 µs 112.02 µs]
                        change: [-3.6030% -2.7357% -1.9150%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(30, 0) - 128
                        time:   [199.71 µs 199.81 µs 199.90 µs]
                        change: [-0.1739% -0.0899% -0.0088%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 string(30, 0) - 1024
                        time:   [178.48 µs 178.53 µs 178.59 µs]
                        change: [-0.0855% +0.0097% +0.0933%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 string(30, 0) - 4096
                        time:   [189.35 µs 190.53 µs 191.52 µs]
                        change: [-3.7107% -2.9717% -2.3294%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(100, 0) - 128
                        time:   [430.35 µs 430.69 µs 431.08 µs]
                        change: [-0.0244% +0.0797% +0.1834%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

4096 string(100, 0) - 1024
                        time:   [465.68 µs 469.26 µs 472.13 µs]
                        change: [-2.7082% -1.3919% -0.1599%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 23 outliers among 100 measurements (23.00%)
  20 (20.00%) low severe
  3 (3.00%) low mild

4096 string(100, 0) - 4096
                        time:   [436.40 µs 437.63 µs 438.81 µs]
                        change: [+0.1291% +0.4030% +0.6693%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  12 (12.00%) low mild
  1 (1.00%) high mild

4096 string(100, 0.5) - 128
                        time:   [310.80 µs 310.96 µs 311.13 µs]
                        change: [-0.6979% -0.5621% -0.4446%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [299.95 µs 301.09 µs 302.20 µs]
                        change: [-1.3131% -0.8438% -0.3311%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0.5) - 4096
                        time:   [300.49 µs 302.11 µs 303.68 µs]
                        change: [-2.2359% -1.6753% -1.1472%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [965.35 µs 966.12 µs 966.97 µs]
                        change: [-0.4355% +0.1689% +0.5690%] (p = 0.61 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [920.97 µs 925.02 µs 929.03 µs]
                        change: [+1.4017% +2.0410% +2.6493%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [894.44 µs 900.48 µs 907.09 µs]
                        change: [+0.4112% +1.0848% +1.7978%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  19 (19.00%) high mild
  2 (2.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [802.37 µs 803.12 µs 803.87 µs]
                        change: [+3.9605% +4.1417% +4.3429%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [732.96 µs 733.86 µs 734.95 µs]
                        change: [+3.7733% +3.9502% +4.1396%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [726.96 µs 728.58 µs 730.19 µs]
                        change: [+4.1967% +4.4437% +4.6878%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

vrongmeal · 2023-09-08T12:03:26Z

I'll try to make some improvements here.

tustvold · 2023-09-08T12:04:56Z

You may need to help LLVM by pulling the conditional out of the loop for it

vrongmeal · 2023-09-08T14:24:00Z

Updated changes:

❯ cargo bench --bench csv_reader
   Compiling arrow-csv v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow-csv)
   Compiling arrow v46.0.0 (/Users/vrongmeal/Projects/arrow-rs/arrow)
    Finished bench [optimized] target(s) in 4.89s
     Running benches/csv_reader.rs (target/release/deps/csv_reader-dadff296bc34e88b)
4096 u64(0) - 128       time:   [228.45 µs 228.63 µs 228.84 µs]
                        change: [+3.2331% +3.4075% +3.5555%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

4096 u64(0) - 1024      time:   [199.65 µs 199.84 µs 200.02 µs]
                        change: [+3.5994% +3.7003% +3.8060%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 4096      time:   [207.64 µs 208.42 µs 209.10 µs]
                        change: [+1.6002% +2.2305% +2.8814%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 i64(0) - 128       time:   [252.46 µs 252.63 µs 252.85 µs]
                        change: [-1.1848% -1.0121% -0.8452%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 i64(0) - 1024      time:   [226.17 µs 226.28 µs 226.40 µs]
                        change: [-1.3198% -1.2067% -1.1007%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 i64(0) - 4096      time:   [237.37 µs 238.27 µs 239.31 µs]
                        change: [-0.2543% +0.2821% +0.8587%] (p = 0.33 > 0.05)
                        No change in performance detected.

4096 f32(0) - 128       time:   [242.69 µs 242.96 µs 243.22 µs]
                        change: [+0.2529% +0.4053% +0.5640%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 f32(0) - 1024      time:   [216.50 µs 216.94 µs 217.40 µs]
                        change: [+0.8552% +1.0164% +1.1941%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f32(0) - 4096      time:   [218.59 µs 219.23 µs 219.86 µs]
                        change: [-2.4562% -2.1355% -1.7614%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  10 (10.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 f64(0) - 128       time:   [265.63 µs 266.24 µs 266.88 µs]
                        change: [+0.7456% +0.9606% +1.2133%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f64(0) - 1024      time:   [235.71 µs 235.82 µs 235.95 µs]
                        change: [+0.2742% +0.3610% +0.4471%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 4096      time:   [247.69 µs 249.25 µs 250.70 µs]
                        change: [-0.6911% +0.0159% +0.7584%] (p = 0.96 > 0.05)
                        No change in performance detected.
Found 29 outliers among 100 measurements (29.00%)
  12 (12.00%) low severe
  3 (3.00%) low mild
  10 (10.00%) high mild
  4 (4.00%) high severe

4096 string(10, 0) - 128
                        time:   [120.22 µs 120.44 µs 120.66 µs]
                        change: [-0.1170% +0.0390% +0.1963%] (p = 0.63 > 0.05)
                        No change in performance detected.

4096 string(10, 0) - 1024
                        time:   [101.07 µs 101.10 µs 101.15 µs]
                        change: [-0.1678% -0.0755% +0.0136%] (p = 0.11 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 string(10, 0) - 4096
                        time:   [118.83 µs 119.73 µs 120.52 µs]
                        change: [+1.7587% +2.7841% +3.7795%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(30, 0) - 128
                        time:   [199.96 µs 200.23 µs 200.45 µs]
                        change: [-0.4955% -0.3786% -0.2600%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(30, 0) - 1024
                        time:   [178.41 µs 178.47 µs 178.54 µs]
                        change: [-0.3514% -0.2803% -0.2062%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

4096 string(30, 0) - 4096
                        time:   [184.26 µs 186.21 µs 188.00 µs]
                        change: [-1.9326% -0.9923% -0.1373%] (p = 0.02 < 0.05)
                        Change within noise threshold.

4096 string(100, 0) - 128
                        time:   [430.75 µs 431.09 µs 431.41 µs]
                        change: [-0.3590% -0.2513% -0.1409%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(100, 0) - 1024
                        time:   [474.17 µs 479.79 µs 484.65 µs]
                        change: [+5.4907% +7.2059% +8.9017%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(100, 0) - 4096
                        time:   [435.40 µs 436.45 µs 437.52 µs]
                        change: [-2.3235% -2.1234% -1.8919%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  11 (11.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [310.66 µs 312.32 µs 315.34 µs]
                        change: [-2.0074% -1.5844% -0.9803%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [291.23 µs 292.01 µs 292.91 µs]
                        change: [-4.9235% -4.1627% -3.4117%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(100, 0.5) - 4096
                        time:   [305.76 µs 306.96 µs 308.23 µs]
                        change: [-1.0322% -0.4965% +0.0523%] (p = 0.07 > 0.05)
                        No change in performance detected.

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128: Collecting 100 samples in estimated 9.54096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [949.42 µs 950.06 µs 950.69 µs]
                        change: [-0.3684% -0.2576% -0.1468%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024: Collecting 100 samples in estimated 8.4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [887.01 µs 888.57 µs 890.18 µs]
                        change: [-1.5528% -1.4042% -1.2439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

Benchmarking 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096: Collecting 100 samples in estimated 9.4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [900.61 µs 909.35 µs 917.34 µs]
                        change: [-0.2915% +0.6089% +1.5111%] (p = 0.18 > 0.05)
                        No change in performance detected.

Benchmarking 4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128: Collecting 100 samples in estimated 7.7663 s (14096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [770.68 µs 777.47 µs 790.18 µs]
                        change: [-0.1682% +1.5906% +4.4086%] (p = 0.21 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024: Collecting 100 samples in estimated 7.1337 s (4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [706.19 µs 707.05 µs 707.90 µs]
                        change: [-0.6163% -0.3262% -0.0437%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking 4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096: Collecting 100 samples in estimated 7.0071 s (4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [694.56 µs 695.28 µs 696.04 µs]
                        change: [-0.8192% -0.6161% -0.4063%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild

tustvold · 2023-09-09T15:55:00Z

I also ran the benchmarks and got similar results, I'm honestly somewhat surprised but I guess string parsing inherently has a lot of conditionals.

Edit: Actually running this with slightly more realistic benchmarks, that use smaller integers, this shows a 10% performance regression still... I'll get a PR up with these so that you can take a look

Edit Edit: Benchmarks in #4803

tustvold · 2023-09-09T16:37:58Z

arrow-csv/src/reader/mod.rs

@@ -241,6 +243,11 @@ impl Format {
        self
    }

+    pub fn with_nulls(mut self, nulls: HashSet<String>) -> Self {


I wonder if a Regex might be a more flexible interface, and more consistent with how we express things for schema inference. It might also perform better

Makes sense! I will try this with regex.

vrongmeal · 2023-09-11T14:12:41Z

Regex performs worse in case there's a custom regex provided. I can push a commit with the regex change. The benchmarks should be similar.

vrongmeal · 2023-09-11T14:58:43Z

The latest changes seem to do quite well:

4096 i32_small(0) - 128 time:   [130.33 µs 132.50 µs 137.03 µs]
                        change: [-0.0644% +3.0447% +8.8724%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

Benchmarking 4096 i32_small(0) - 1024: Collecting 100 samples in estimated 5.3598 s (50k iterations4096 i32_small(0) - 1024
                        time:   [102.34 µs 102.38 µs 102.42 µs]
                        change: [+0.7437% +1.1352% +1.3835%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 i32_small(0) - 4096
                        time:   [99.582 µs 99.633 µs 99.689 µs]
                        change: [+0.6403% +0.8823% +1.0898%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 i32(0) - 128       time:   [205.46 µs 205.84 µs 206.24 µs]
                        change: [-3.7421% -2.6800% -1.8625%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 i32(0) - 1024      time:   [179.71 µs 179.89 µs 180.08 µs]
                        change: [-1.7409% -1.1017% -0.5766%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 i32(0) - 4096      time:   [183.32 µs 184.14 µs 184.85 µs]
                        change: [-1.9470% -1.4436% -0.9425%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 u64_small(0) - 128 time:   [130.11 µs 132.99 µs 137.36 µs]
                        change: [-3.7062% -1.0089% +1.6673%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

4096 u64_small(0) - 1024
                        time:   [103.47 µs 104.38 µs 106.41 µs]
                        change: [-0.9524% -0.0361% +1.5091%] (p = 0.96 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 u64_small(0) - 4096
                        time:   [100.65 µs 100.68 µs 100.71 µs]
                        change: [-3.0726% -1.7815% -0.8379%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 128       time:   [219.91 µs 220.18 µs 220.45 µs]
                        change: [-0.5549% -0.4068% -0.2597%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 u64(0) - 1024      time:   [193.99 µs 194.13 µs 194.27 µs]
                        change: [+0.6830% +0.8368% +0.9901%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 u64(0) - 4096      time:   [207.50 µs 208.06 µs 208.60 µs]
                        change: [-0.7799% +2.1297% +3.8490%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 i64_small(0) - 128 time:   [148.42 µs 148.73 µs 149.07 µs]
                        change: [-1.1969% -0.8221% -0.5079%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 i64_small(0) - 1024
                        time:   [122.20 µs 122.26 µs 122.33 µs]
                        change: [-0.4041% -0.2750% -0.1462%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 i64_small(0) - 4096
                        time:   [119.54 µs 119.59 µs 119.64 µs]
                        change: [-0.6484% -0.5039% -0.3624%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 128       time:   [253.10 µs 253.41 µs 253.69 µs]
                        change: [-4.9520% -3.1144% -1.7332%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 i64(0) - 1024      time:   [225.76 µs 225.85 µs 225.95 µs]
                        change: [-1.8082% -1.0685% -0.5455%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 4096      time:   [238.72 µs 238.94 µs 239.18 µs]
                        change: [-2.7638% -1.5168% -0.4835%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 f32_small(0) - 128 time:   [189.77 µs 190.08 µs 190.42 µs]
                        change: [-3.0283% -2.3206% -1.8603%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 f32_small(0) - 1024
                        time:   [163.36 µs 163.72 µs 164.23 µs]
                        change: [-0.8370% -0.5834% -0.3310%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

4096 f32_small(0) - 4096
                        time:   [160.66 µs 160.71 µs 160.76 µs]
                        change: [-8.3509% -4.9169% -2.0822%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 f32(0) - 128       time:   [239.17 µs 239.41 µs 239.68 µs]
                        change: [-5.3499% -3.4406% -2.0248%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 f32(0) - 1024      time:   [212.94 µs 213.14 µs 213.34 µs]
                        change: [-8.6109% -3.8452% -0.4918%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  9 (9.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe

4096 f32(0) - 4096      time:   [220.86 µs 221.13 µs 221.41 µs]
                        change: [-0.4012% -0.0585% +0.2439%] (p = 0.73 > 0.05)
                        No change in performance detected.

4096 f64_small(0) - 128 time:   [190.02 µs 190.37 µs 190.71 µs]
                        change: [-1.2881% -1.1310% -0.9806%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 f64_small(0) - 1024
                        time:   [163.64 µs 163.70 µs 163.76 µs]
                        change: [-0.0629% +0.0076% +0.0864%] (p = 0.85 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

4096 f64_small(0) - 4096
                        time:   [161.39 µs 161.45 µs 161.52 µs]
                        change: [-4.5426% -1.5487% +0.0554%] (p = 0.40 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 128       time:   [265.23 µs 265.51 µs 265.82 µs]
                        change: [-8.5473% -3.9192% -0.5514%] (p = 0.06 > 0.05)
                        No change in performance detected.

4096 f64(0) - 1024      time:   [236.95 µs 237.05 µs 237.14 µs]
                        change: [-0.1151% -0.0326% +0.0409%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 f64(0) - 4096      time:   [246.06 µs 247.75 µs 249.53 µs]
                        change: [-0.9090% -0.2578% +0.4224%] (p = 0.44 > 0.05)
                        No change in performance detected.

4096 string(10, 0) - 128
                        time:   [120.45 µs 120.63 µs 120.79 µs]
                        change: [-1.1682% -1.0363% -0.9072%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(10, 0) - 1024
                        time:   [101.01 µs 101.04 µs 101.08 µs]
                        change: [-0.0422% +0.0156% +0.0662%] (p = 0.58 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(10, 0) - 4096
                        time:   [118.38 µs 118.87 µs 119.37 µs]
                        change: [+7.7898% +8.1893% +8.5642%] (p = 0.00 < 0.05)
                        Performance has regressed.

4096 string(30, 0) - 128
                        time:   [199.88 µs 199.99 µs 200.08 µs]
                        change: [-0.5842% -0.5261% -0.4711%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

4096 string(30, 0) - 1024
                        time:   [178.55 µs 178.81 µs 179.19 µs]
                        change: [-0.1499% -0.0245% +0.0912%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

4096 string(30, 0) - 4096
                        time:   [194.25 µs 194.96 µs 195.63 µs]
                        change: [+1.5664% +2.1975% +2.7777%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild

4096 string(100, 0) - 128
                        time:   [430.01 µs 430.44 µs 430.87 µs]
                        change: [-0.7295% -0.5852% -0.3922%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 string(100, 0) - 1024
                        time:   [443.96 µs 452.71 µs 460.82 µs]
                        change: [-5.6742% -4.0818% -2.5121%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(100, 0) - 4096
                        time:   [438.85 µs 439.64 µs 440.49 µs]
                        change: [-0.4321% -0.1080% +0.2058%] (p = 0.51 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [312.43 µs 312.86 µs 313.23 µs]
                        change: [-0.4155% -0.2683% -0.1336%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 24 outliers among 100 measurements (24.00%)
  11 (11.00%) low severe
  12 (12.00%) low mild
  1 (1.00%) high mild

4096 string(100, 0.5) - 1024
                        time:   [293.62 µs 294.98 µs 296.50 µs]
                        change: [-2.3134% -1.8461% -1.3810%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(100, 0.5) - 4096
                        time:   [300.82 µs 302.16 µs 303.50 µs]
                        change: [-1.0920% -0.7349% -0.3356%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [950.54 µs 951.49 µs 952.46 µs]
                        change: [-0.0619% +0.0370% +0.1391%] (p = 0.47 > 0.05)
                        No change in performance detected.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [882.83 µs 886.20 µs 889.62 µs]
                        change: [-2.5615% -2.1071% -1.6910%] (p = 0.00 < 0.05)
                        Performance has improved.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [879.77 µs 881.35 µs 882.90 µs]
                        change: [-3.4161% -2.7473% -2.0204%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [768.85 µs 769.68 µs 770.49 µs]
                        change: [-1.3762% -0.9834% -0.6629%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [706.64 µs 707.19 µs 707.71 µs]
                        change: [-3.9781% -3.2693% -2.6500%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [700.29 µs 701.35 µs 702.44 µs]
                        change: [-0.9342% -0.4612% -0.0244%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

tustvold

Looking good, just left some relatively minor comments

tustvold · 2023-09-11T21:51:21Z

arrow-csv/src/reader/mod.rs

@@ -241,6 +242,11 @@ impl Format {
        self
    }

+    pub fn with_null_regex(mut self, null_regex: Regex) -> Self {


Perhaps some doc comments?

Yup, I will add doc comments.

tustvold · 2023-09-11T21:51:25Z

arrow-csv/src/reader/mod.rs

@@ -319,6 +325,7 @@ impl Format {
        if let Some(t) = self.terminator {
            builder.terminator(csv::Terminator::Any(t));
        }
+        // TODO: Null regex


My mistake, I thought it was building the arrow CSV reader.

tustvold · 2023-09-11T21:51:28Z

arrow-csv/src/reader/mod.rs

@@ -336,6 +343,7 @@ impl Format {
        if let Some(t) = self.terminator {
            builder.terminator(csv_core::Terminator::Any(t));
        }
+        // TODO: Null regex


tustvold · 2023-09-11T21:52:20Z

arrow-csv/src/reader/mod.rs

 ) -> Result<ArrayRef, ArrowError> {
    let mut decimal_builder = PrimitiveBuilder::<T>::with_capacity(rows.len());
    for row in rows.iter() {
        let s = row.get(col_idx);
-        if s.is_empty() {
+        if s.is_empty() || null_regex.is_some_and(|r| r.is_match(s)) {


If a null regex is provided should we also treat empty strings as nulls?

Edit: BTW TIL about is_some_and, very nice 👌

Let's avoid checking is_empty in that case. Users can configure the regex to include it as well.

tustvold · 2023-09-11T21:53:34Z

arrow-csv/src/reader/mod.rs

+    null_regex: Option<Regex>,
+}
+
+impl Debug for Decoder {


Why this change?

I'll undo this. It was when the struct had is_null closure.

Can specify custom strings as `NULL` values for CSVs as a regular expression. This allows reading a CSV files which have placeholders for NULL values instead of empty strings. Fixes apache#4794 Signed-off-by: Vaibhav <[email protected]>

vrongmeal · 2023-09-12T12:39:38Z

Cleaned up the PR. I made some minor refactors. Here we are with the final benchmarks (not much different from the previous ones):

4096 i32_small(0) - 128 time:   [129.48 µs 129.69 µs 129.93 µs]
                        change: [-0.1022% +0.2640% +0.6342%] (p = 0.16 > 0.05)
                        No change in performance detected.
Found 26 outliers among 100 measurements (26.00%)
  10 (10.00%) low severe
  5 (5.00%) low mild
  7 (7.00%) high mild
  4 (4.00%) high severe

4096 i32_small(0) - 1024
                        time:   [102.46 µs 102.58 µs 102.73 µs]
                        change: [+0.9391% +1.0842% +1.2562%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

4096 i32_small(0) - 4096
                        time:   [100.04 µs 100.18 µs 100.32 µs]
                        change: [+1.5302% +1.6691% +1.8415%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 i32(0) - 128       time:   [206.07 µs 206.22 µs 206.36 µs]
                        change: [+0.2724% +0.3939% +0.5160%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

4096 i32(0) - 1024      time:   [180.19 µs 180.31 µs 180.44 µs]
                        change: [+0.3165% +0.4330% +0.5506%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 i32(0) - 4096      time:   [186.55 µs 186.78 µs 187.02 µs]
                        change: [+0.8841% +1.1445% +1.3982%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 u64_small(0) - 128 time:   [129.82 µs 130.01 µs 130.21 µs]
                        change: [-0.2892% -0.1184% +0.0488%] (p = 0.17 > 0.05)
                        No change in performance detected.

4096 u64_small(0) - 1024
                        time:   [103.63 µs 103.69 µs 103.76 µs]
                        change: [+0.2414% +0.3326% +0.4239%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

4096 u64_small(0) - 4096
                        time:   [101.59 µs 101.64 µs 101.69 µs]
                        change: [+0.8516% +0.9603% +1.0727%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

4096 u64(0) - 128       time:   [221.43 µs 221.57 µs 221.72 µs]
                        change: [-1.7539% +0.3171% +1.5285%] (p = 0.82 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 u64(0) - 1024      time:   [194.33 µs 194.44 µs 194.55 µs]
                        change: [+1.6837% +1.8242% +1.9828%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

4096 u64(0) - 4096      time:   [202.89 µs 204.05 µs 205.06 µs]
                        change: [-1.0604% -0.5223% -0.0185%] (p = 0.06 > 0.05)
                        No change in performance detected.

4096 i64_small(0) - 128 time:   [147.68 µs 147.85 µs 148.02 µs]
                        change: [-1.2601% -1.0891% -0.9241%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 i64_small(0) - 1024
                        time:   [122.26 µs 122.33 µs 122.39 µs]
                        change: [-0.3205% -0.2296% -0.1350%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

4096 i64_small(0) - 4096
                        time:   [119.52 µs 119.58 µs 119.65 µs]
                        change: [-0.3122% -0.2236% -0.1442%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 i64(0) - 128       time:   [252.20 µs 252.36 µs 252.52 µs]
                        change: [-0.4413% -0.2934% -0.1557%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 i64(0) - 1024      time:   [226.34 µs 226.43 µs 226.53 µs]
                        change: [+0.2713% +0.3469% +0.4166%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

4096 i64(0) - 4096      time:   [239.07 µs 239.31 µs 239.56 µs]
                        change: [-1.4105% -1.2061% -1.0047%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

4096 f32_small(0) - 128 time:   [188.32 µs 188.39 µs 188.46 µs]
                        change: [-1.2398% -1.0624% -0.8853%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

4096 f32_small(0) - 1024
                        time:   [163.03 µs 163.16 µs 163.37 µs]
                        change: [-0.1988% -0.1012% +0.0076%] (p = 0.05 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

4096 f32_small(0) - 4096
                        time:   [162.22 µs 162.64 µs 163.13 µs]
                        change: [+0.8514% +1.0582% +1.2712%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 f32(0) - 128       time:   [239.99 µs 240.17 µs 240.35 µs]
                        change: [-0.1242% -0.0110% +0.0980%] (p = 0.85 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 1024      time:   [214.23 µs 214.37 µs 214.51 µs]
                        change: [+0.1882% +0.3555% +0.5017%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 f32(0) - 4096      time:   [220.51 µs 220.78 µs 221.05 µs]
                        change: [+0.0601% +0.3376% +0.6216%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

4096 f64_small(0) - 128 time:   [189.29 µs 189.42 µs 189.54 µs]
                        change: [-0.3008% -0.1870% -0.0795%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 f64_small(0) - 1024
                        time:   [164.10 µs 164.15 µs 164.21 µs]
                        change: [+0.2648% +0.3359% +0.4127%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

4096 f64_small(0) - 4096
                        time:   [161.84 µs 162.18 µs 162.65 µs]
                        change: [+0.3241% +0.4859% +0.6660%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 f64(0) - 128       time:   [264.38 µs 264.51 µs 264.63 µs]
                        change: [-0.2875% -0.1436% +0.0010%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 f64(0) - 1024      time:   [236.75 µs 236.82 µs 236.90 µs]
                        change: [-0.1708% -0.0758% +0.0053%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 f64(0) - 4096      time:   [250.96 µs 251.21 µs 251.45 µs]
                        change: [-1.6759% -1.4032% -1.1108%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 string(10, 0) - 128
                        time:   [120.43 µs 120.67 µs 120.96 µs]
                        change: [-1.5145% -1.3509% -1.1836%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

4096 string(10, 0) - 1024
                        time:   [100.94 µs 100.97 µs 101.00 µs]
                        change: [-0.1498% -0.0674% +0.0154%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 string(10, 0) - 4096
                        time:   [114.31 µs 114.59 µs 114.87 µs]
                        change: [-2.9515% -2.3473% -1.6384%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  4 (4.00%) high severe

4096 string(30, 0) - 128
                        time:   [200.39 µs 200.64 µs 200.89 µs]
                        change: [-0.0823% +0.0362% +0.1587%] (p = 0.56 > 0.05)
                        No change in performance detected.

4096 string(30, 0) - 1024
                        time:   [179.58 µs 179.79 µs 180.01 µs]
                        change: [+0.6443% +0.7986% +0.9492%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

4096 string(30, 0) - 4096
                        time:   [193.35 µs 193.84 µs 194.27 µs]
                        change: [-1.1721% -0.8889% -0.6075%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(100, 0) - 128
                        time:   [432.10 µs 432.97 µs 433.84 µs]
                        change: [+0.2068% +0.3655% +0.5314%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0) - 1024
                        time:   [438.27 µs 446.28 µs 454.66 µs]
                        change: [-3.6040% -2.1777% -0.9708%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0) - 4096
                        time:   [441.42 µs 442.67 µs 443.87 µs]
                        change: [+1.1279% +1.4411% +1.7686%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

4096 string(100, 0.5) - 128
                        time:   [311.91 µs 312.07 µs 312.23 µs]
                        change: [-0.3517% -0.1610% +0.0619%] (p = 0.16 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

4096 string(100, 0.5) - 1024
                        time:   [298.65 µs 300.69 µs 302.52 µs]
                        change: [+0.9439% +1.5945% +2.2157%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(100, 0.5) - 4096
                        time:   [313.77 µs 316.21 µs 318.96 µs]
                        change: [+0.8984% +1.5120% +2.1361%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 128
                        time:   [950.86 µs 951.80 µs 952.96 µs]
                        change: [+0.2635% +0.4109% +0.5742%] (p = 0.00 < 0.05)
                        Change within noise threshold.

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 1024
                        time:   [894.21 µs 897.00 µs 899.93 µs]
                        change: [-1.8062% -1.2546% -0.7216%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0) - 4096
                        time:   [882.42 µs 888.96 µs 899.89 µs]
                        change: [-1.7490% -0.7579% +0.3868%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 128
                        time:   [777.10 µs 777.90 µs 778.76 µs]
                        change: [+0.3738% +0.5824% +0.7619%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 1024
                        time:   [706.16 µs 707.73 µs 709.08 µs]
                        change: [-0.3864% -0.1618% +0.0398%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

4096 string(20, 0.5), string(30, 0), f64(0), i64(0) - 4096
                        time:   [702.01 µs 704.64 µs 707.09 µs]
                        change: [+0.4500% +0.7426% +1.0359%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

tustvold

Sorry for taking so long to review this, been a bit swamped.

Just some very minor docs nits, and then I think this is good to go. Just double-checking the benchmarks on some reference hardware now...

Edit: Benchmarks look good 🎉

arrow-csv/src/reader/mod.rs

tustvold · 2023-09-13T09:53:22Z

I took the liberty of applying the docs changes so I can get this in

vrongmeal · 2023-09-13T10:05:34Z

I took the liberty of applying the docs changes so I can get this in

Thank you so much!

github-actions bot added the arrow Changes to the arrow crate label Sep 7, 2023

vrongmeal mentioned this pull request Sep 7, 2023

Debug why we can't handle this csv file GlareDB/glaredb#1707

Closed

vrongmeal force-pushed the csv-nulls branch from ebbf1fb to d2ca888 Compare September 8, 2023 14:23

tustvold mentioned this pull request Sep 9, 2023

Improve CSV Reader Benchmark Coverage of Small Primitives #4803

Merged

tustvold reviewed Sep 9, 2023

View reviewed changes

vrongmeal force-pushed the csv-nulls branch 2 times, most recently from f7c72ae to 4c11b3b Compare September 11, 2023 14:58

tustvold reviewed Sep 11, 2023

View reviewed changes

csv: Add option to specify custom null regex

bafae8e

Can specify custom strings as `NULL` values for CSVs as a regular expression. This allows reading a CSV files which have placeholders for NULL values instead of empty strings. Fixes apache#4794 Signed-off-by: Vaibhav <[email protected]>

vrongmeal force-pushed the csv-nulls branch from 4c11b3b to bafae8e Compare September 12, 2023 12:38

vrongmeal requested a review from tustvold September 12, 2023 12:43

tustvold approved these changes Sep 13, 2023

View reviewed changes

arrow-csv/src/reader/mod.rs Outdated Show resolved Hide resolved

arrow-csv/src/reader/mod.rs Outdated Show resolved Hide resolved

Apply suggestions from code review

0523596

tustvold merged commit 2075cd1 into apache:master Sep 13, 2023
22 checks passed

tustvold mentioned this pull request Sep 18, 2023

Add option to specify custom null values for CSV reader #4794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv: Add option to specify custom null values #4795

csv: Add option to specify custom null values #4795

vrongmeal commented Sep 7, 2023

tustvold commented Sep 7, 2023 •

edited

Loading

vrongmeal commented Sep 8, 2023

vrongmeal commented Sep 8, 2023

tustvold commented Sep 8, 2023

vrongmeal commented Sep 8, 2023

tustvold commented Sep 9, 2023 •

edited

Loading

tustvold Sep 9, 2023 •

edited

Loading

vrongmeal Sep 9, 2023

vrongmeal commented Sep 11, 2023

vrongmeal commented Sep 11, 2023

tustvold left a comment

tustvold Sep 11, 2023

vrongmeal Sep 12, 2023

tustvold Sep 11, 2023

vrongmeal Sep 12, 2023

tustvold Sep 11, 2023

tustvold Sep 11, 2023 •

edited

Loading

vrongmeal Sep 12, 2023

tustvold Sep 11, 2023

vrongmeal Sep 12, 2023

vrongmeal commented Sep 12, 2023

tustvold left a comment •

edited

Loading

tustvold commented Sep 13, 2023

vrongmeal commented Sep 13, 2023

csv: Add option to specify custom null values #4795

csv: Add option to specify custom null values #4795

Conversation

vrongmeal commented Sep 7, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Sep 7, 2023 • edited Loading

vrongmeal commented Sep 8, 2023

vrongmeal commented Sep 8, 2023

tustvold commented Sep 8, 2023

vrongmeal commented Sep 8, 2023

tustvold commented Sep 9, 2023 • edited Loading

tustvold Sep 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrongmeal commented Sep 11, 2023

vrongmeal commented Sep 11, 2023

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrongmeal commented Sep 12, 2023

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

tustvold commented Sep 13, 2023

vrongmeal commented Sep 13, 2023

tustvold commented Sep 7, 2023 •

edited

Loading

tustvold commented Sep 9, 2023 •

edited

Loading

tustvold Sep 9, 2023 •

edited

Loading

tustvold Sep 11, 2023 •

edited

Loading

tustvold left a comment •

edited

Loading