Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved column separator detection by ignoring quoted sections #276

Merged
merged 4 commits into from
Jul 10, 2024

Conversation

nicastelo
Copy link
Contributor

@nicastelo nicastelo commented Mar 14, 2024

Summary:

This pull request enhances the logic used to determine the column separator (delimiter) in CSV files processed by our system. Previously, the method guess_column_separator simply counted occurrences of potential delimiters (such as commas, tabs, semicolons, colons, and pipes) without considering their context. This could lead to misidentification, especially when non-delimiter characters within quoted fields were mistaken for actual delimiters. The updated logic now intelligently ignores delimiters found within quoted sections, leading to more accurate delimiter detection.

Changes:

  • Modified the guess_column_separator method to exclude text within quotes when counting delimiter occurrences.
  • Added regex-based splitting to remove quoted sections before counting delimiter occurrences in each line.
  • Added tests

@nicastelo nicastelo closed this Mar 14, 2024
@nicastelo nicastelo reopened this Mar 14, 2024
@nicastelo nicastelo marked this pull request as ready for review March 14, 2024 17:48
@nicastelo nicastelo changed the title Improved column separator detection by Ignoring quoted sections Improved column separator detection by ignoring quoted sections Mar 14, 2024
@nicastelo nicastelo closed this Mar 14, 2024
@nicastelo nicastelo reopened this Mar 14, 2024
Copy link

codecov bot commented Mar 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (07160c1) to head (b812d03).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #276   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           11        11           
  Lines          380       382    +2     
=========================================
+ Hits           380       382    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -19,7 +19,10 @@ def guess_column_separator(filehandle, options)
count.times do
line = readline_with_counts(filehandle, options)
delimiters.each do |d|
candidates[d] += line.scan(d).count
# Count only non-quoted occurrences of the delimiter
non_quoted_text = line.split(/"[^"]*"|'[^']*'/).join
Copy link
Owner

@tilo tilo Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicastelo :quote_char can be passed in by the user.
It would be better if this would use options[:quote_char]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tilo Oops, missed that. Implemented the change, thank you!

@nicastelo nicastelo requested a review from tilo March 26, 2024 22:43
@tilo
Copy link
Owner

tilo commented Jul 10, 2024

@nicastelo sorry for the delay - I'll have a look at it this week

it 'does not detect separators that are between quotes' do
data = SmarterCSV.process(
"#{fixture_path}/separator_chars_between_quotes_no_headers.csv",
options.merge(user_provided_headers: %w[Name Age Job Department Project])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options.merge(user_provided_headers: %w[Name Age Job Department Project])
options.merge(headers_in_file: false, user_provided_headers: %w[Name Age Job Department Project])

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important to add headers_in_file: false, otherwise the first line will be ignored

@tilo
Copy link
Owner

tilo commented Jul 10, 2024

looks good! 👍

@tilo tilo merged commit f8048f0 into tilo:main Jul 10, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants