Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling invalid quotes gracefully #453

Open
mmarkell opened this issue Jan 16, 2025 · 1 comment
Open

handling invalid quotes gracefully #453

mmarkell opened this issue Jan 16, 2025 · 1 comment

Comments

@mmarkell
Copy link

Referencing #421 which has been closed:

I just attempted to open the referenced file in Excel, and it seems to do the job well of determining the quotes that are cell boundaries vs textual quotes.

Image

Out of curiosity, how does this work? It seems quite complicated to make a Regex that reliably solves all such cases, and it's a bit of a chicken-and-egg to build a csv-parser that works on these types of strings if the quote: true option is enabled.

If I want to gracefully handle unescaped / unmatched quotes in the middle of a cell value, what options do I have? I really appreciate your advice!

For reference, right now I'm doing something like this:

function replaceEmbeddedQuotes(
    readStream: Readable,
): Readable {
    // Create a transform stream to process the data
    const transformer = new Transform({
        objectMode: true,
        transform(
            chunk: Buffer | string,
            encoding: string,
            callback: Function
        ) {
            // Convert chunk to string if it's a buffer
            const line = chunk instanceof Buffer ? chunk.toString() : chunk;

            // If a quote is found in the middle of a field, double it
            const processedLine = line.replaceAll(/(\s)"(\s)/g, '$1""$2');

            // Push the processed line to the output stream
            this.push(processedLine);
            callback();
        },
    });

    // Pipe the input stream through our transformer
    return readStream.pipe(transformer);
}

And then passing that readable to csv-parse, but this doesn't handle some cases, like if the extra quote is not surrounded by spaces on each side, and I assume tehre are also cases where a valid end quote could be surrounded by spaces on each side, like
"a " , "b","c"

@wdavidw
Copy link
Member

wdavidw commented Jan 16, 2025

I don't have much time too dig into your quote, but we are not using regular expression. Instead we parse bytes one by one and maintain a state of what we have. I don't have all the implementation details in memory but white spaces are supported around quote and delimiter, see trim. A quick search reveals is a test for it, "with whitespaces around quotes", in test/options.ltrim.coffee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants