Skip to content

Exploring novel algorithms to parse rows of CSV 64 bytes at a time using `pclmulqdq`, vectorized classification, and bit twiddling.

Notifications You must be signed in to change notification settings

friendlymatthew/simdcsv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simdcsv

simdcsv is a CSV parser that evaluates 64 bytes at a time. There are many kinds of CSV files; this project adheres to the format described in RFC 4180.

Introduction

We can classify every character in CSV into the following: a COMMA, QUOTATION, NEW_LINE, OTHER. We can build a perfect lookup table and use vqtbl1q_u8 to classify 16 characters at once. Daniel Lemire calls this "vectorized classification" in the simdjson paper. [code pointer]

Once we classify every character, we can build a bitset for each class. We chunk through 64 characters at a time, building a u64 for every chunk. Here is a naive case:

[//]: # COMMA = 0, QUOTATION = 1, NEW_LINE = 2, OTHER = 3

aaa,bbb,ccc
33303330333

Then the bitsets look like:

comma_bitset = 0b00010001000
other_bitset = 0b11101110111

Now, we can just count the number of leading zeros in the comma bitset to pull the csv entries.

Using a bitset is pretty powerful in cases where one wants to check if there exists a symbol, count the # of symbols, or remove escaped symbols.

Detecting Escaped Quotations and Commas

Consider the csv row: "aaa,norm","b""bb","ccc"

In CSV, quotes are escaped by doubling them (""). The "" in b""bb is field content, not a structural delimiter. We detect escaped pairs by finding adjacent quotes:

let escaped = q & (q << 1);        // Find adjacent quote pairs
let escaped = escaped | (escaped >> 1);  // Mark both quotes in each pair
let valid_quotes = q & !escaped;   // Remove escaped quotes
quote_bitset   = 0b100010010011010000000001
q << 1         = 0b000100100110100000000010
escaped        = 0b000000010000000000000000  // Found the "" pair
escaped | >> 1 = 0b000000011000000000000000  // Both quotes marked
valid_quotes   = 0b100010000011010000000001  // Only structural quotes remain

Marking Inside Quotations

With only structural quotes, we use parallel prefix XOR to mark all bits between quote pairs:

valid_quotes  = 0b100010000011010000000001  // Structural quotes only
inside_quotes = 0b011100011111000011111110  // All bits between quote pairs marked as 1

Masking out commas inside quotes:

comma_bitset  = 0b000001000000100000010000  // Commas at positions 4, 10, 18
valid_commas  = comma_bitset & !inside_quotes
              = 0b000001000000100000000000  // Comma at 4 masked out

Reading

https://www.rfc-editor.org/rfc/rfc4180.html
https://arxiv.org/pdf/1902.08318
https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/

About

Exploring novel algorithms to parse rows of CSV 64 bytes at a time using `pclmulqdq`, vectorized classification, and bit twiddling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages