-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
BUG: include files named \s+ with -p/--pathpy or --pathlib (fixes #24)
- Loading branch information
Showing
1 changed file
with
2 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6c3f658
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bugfix causes a regression for what are probably the most frequent uses of the path options and may require multiple additional commandline options to fix.
\s
); which .rstrip() strips\n
,\r\n
) and carriage returns (\r
); which .rstrip() stripsFor example:
The problem fixed here is/was that
(
line.rstrip()
strips all trailing'\r', '\n', ' '
;which reduces files named, for example,
\n\n\n
,\s\s\s
), and\r\n\r\n
(which are valid file names, by the way) to the empty string.In most cases, it seemed safe to assume that -- for the
--pathpy
and--pathlib
options which populatepath = p = Path(line OR line.rstrip())
-- stripping the trailing newline from a line is a safe thing to do. It's the most convenient thing to do; and it may or may not be safe to assume that, in this case, that's what the user expects: a file of records delimited by either newline or carriage-return-newline.So, at first, the simplest solution seems to be to write an
rstrip_one()
which strips only one\n
. And then, for DOS line endings instead strip only one\r\n
two-character line-ending combination. However, because pyline is a stream-processing tool, auto-detecting the line ending from the first line from a list of files which may or may not contain newlines\n
and may or may not end in\n
(Unix) or\r\n
(DOS) may be dangerous because valid filenames would cause line-ending detection to fail unless explicitly specified:I think the correct solution is:
--dos-eol
option which explicitly sets the EOL (end-of-line) to\r\n
;\n
: specify that as--unix-eol
(default:True);rstrip_one()
so that files consisting of all spaces, carriage returns, or newlines aren't reduced to empty string for the path options.See: #28
6c3f658
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the trouble:
Pyline already has a
--shlex
option which doeswords = w = shlex.split(line)
. This may be dangerous if the input data contains newlines within the quoted fields because pyline (and many/most? other line-based tools tools) don't differentiate between quoted and non-quoted EOL sequences which serve as record delimiters.https://en.wikipedia.org/wiki/Newline
This is the actual trouble: newlines are control characters which are also presentational characters. Naieve line-based parsers (such as Pyline v0.3.x) do not differentiate between EOL line feed control characters within quoted strings and EOL record delimiters.
See: #28