Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix row_count always being 1 when \r used as lineTerminator #6

Merged
merged 1 commit into from
Dec 12, 2024

Conversation

kspurgin
Copy link
Collaborator

@kspurgin kspurgin commented Dec 12, 2024

This PR fixes the fact that row_count was always reported as 1 for CSVs using \r as lineTerminator, even if the dialect explicitly set lineTerminator to \r.

The cause of this issue was the use of #each_line method to split streams/files into lines sent to #parse_line. With no arguments given, #each_linereads lines as determined by line separator $/.1 By default, $/ is set to \n.2

This results in the entire stream/file being sent to #parse_line, which was explicitly only treating \n as a line break.

This was fixed by:

  1. Making all places that checked for line breaks by matching only on \n check for either \n or \r

  2. Deriving the correct $/ value prior to calling #each_line, and then resetting $/ to the default value after the #each_line block is closed.

The latter is currently accomplished by:

  • Using the dialect lineTerminator value if set to an actual String value (e.g. not :auto)
  • Otherwise, get the entire @source as a String. If this includes \n, set $/ to \n. Otherwise, set it to \r

The second step has 2 problems:

  • Performance on gigantic files
  • Potential to blow up or cause other problems if a quoted field value contains \n or \r\n, but the file's actual lineTerminator is \r

I am going ahead with this implementation for now because:

  • Our use case generally should not involve CSV so large that the first is an issue
  • The second just opens up an abyss of pain for a somewhat unlikely use case that isn't working correctly prior to this change, either

I have an idea for an approach that would address both of those issues, but it gets me halfway to writing a CSV parser from scratch and I ain't got time for that at the moment.

Footnotes

  1. https://docs.ruby-lang.org/en/master/IO.html#method-i-each_line

  2. https://docs.ruby-lang.org/en/master/globals_rdoc.html#label-24-2F+-28Input+Record+Separator-29

@kspurgin kspurgin merged commit 26d071e into main Dec 12, 2024
35 checks passed
@kspurgin kspurgin deleted the cr-eol-row-count branch December 12, 2024 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant