-
-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved column separator detection by ignoring quoted sections #276
Conversation
…quoted occurrences of the delimiter
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #276 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 11 11
Lines 380 382 +2
=========================================
+ Hits 380 382 +2 ☔ View full report in Codecov by Sentry. |
lib/smarter_csv/auto_detection.rb
Outdated
@@ -19,7 +19,10 @@ def guess_column_separator(filehandle, options) | |||
count.times do | |||
line = readline_with_counts(filehandle, options) | |||
delimiters.each do |d| | |||
candidates[d] += line.scan(d).count | |||
# Count only non-quoted occurrences of the delimiter | |||
non_quoted_text = line.split(/"[^"]*"|'[^']*'/).join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicastelo :quote_char
can be passed in by the user.
It would be better if this would use options[:quote_char]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tilo Oops, missed that. Implemented the change, thank you!
@nicastelo sorry for the delay - I'll have a look at it this week |
it 'does not detect separators that are between quotes' do | ||
data = SmarterCSV.process( | ||
"#{fixture_path}/separator_chars_between_quotes_no_headers.csv", | ||
options.merge(user_provided_headers: %w[Name Age Job Department Project]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options.merge(user_provided_headers: %w[Name Age Job Department Project]) | |
options.merge(headers_in_file: false, user_provided_headers: %w[Name Age Job Department Project]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important to add headers_in_file: false
, otherwise the first line will be ignored
looks good! 👍 |
Summary:
This pull request enhances the logic used to determine the column separator (delimiter) in CSV files processed by our system. Previously, the method guess_column_separator simply counted occurrences of potential delimiters (such as commas, tabs, semicolons, colons, and pipes) without considering their context. This could lead to misidentification, especially when non-delimiter characters within quoted fields were mistaken for actual delimiters. The updated logic now intelligently ignores delimiters found within quoted sections, leading to more accurate delimiter detection.
Changes: