-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch issue 70 #121
base: master
Are you sure you want to change the base?
Patch issue 70 #121
Conversation
Any news or thoughts about this? |
My apologies @czaefferer, I have not been able to find the time to review this properly. Hopefully soon. |
This comment has been minimized.
This comment has been minimized.
joe,sam,ja"n | ||
joe,sam,"ja"n" | ||
joe,"sa | ||
"m",jan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test here should actually fail. each line in a csv file indicates a row, and a newline indicates the start of a new row. the result in the snapshot isn't accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you're right there, this is from the RFC 4180 (https://tools.ietf.org/html/rfc4180):
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
As I understand this, a newline within a quoted field does not indicate the start of a new line (I wish it would, that would have saved me so much trouble in the past).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm you may be right there. I'm giving things a big refresh in order to try and alleviate the pain of working through tokenizing things like this, remove some complication. hang tight and I'll follow up when I have that work done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of curious note is this bit:
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
That would seem to indicate that joe,sam,"ja"n"
is invalid, would it not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be invalid. However, if a single double-quote is found which isn't followed by a separator or a newline, it can be identified as invalid, and could then be treated like any other character. This does indeed take the standard a bit more lax, but then again, so do a lot of other options like configurable separators, quote- and escape-characters. To me it would be totally fine if this behaviour would need to be enabled by configuration. But considering how often I see this intentionally used in real world files, I think this is a far better solution than to invalidate the whole record (which, given the alternative, would not be acceptable in my use-cases).
Right now however, the parser does neither invalidate rows nor treat misplaced double-quotes as regular text, usually it merges two rows, but I've also seen it merge 8K out of 40K rows due to a single misplaced double-quote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it certainly doesn't follow the spec at present. I'm working on a proper tokenizer for this that'll make implementing parts of the spec far easier. I work with the PostCSS folks quite a bit and maintain the postcss-less, postcss-values-parser modules which both follow the PostCSS tokenizer model and it's amazing how much easier it is to work with versus an ad-hoc parser like what's in the module at present
FWIW I confirmed that Excel does export as |
This PR contains:
Breaking Changes?
Unescaped quotes that are not preceded or succeeded by a separator or Linebreak are handled like any other character.
Please Describe Your Changes
fixes issue #70