Skip to content

Commit

Permalink
BUG: include files named \s+ with -p/--pathpy or --pathlib (fixes #24)
Browse files Browse the repository at this point in the history
  • Loading branch information
westurner committed Apr 30, 2016
1 parent b4a3ec7 commit 6c3f658
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions pyline/pyline.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,9 +324,9 @@ def splitfunc(obj):
rgx = _rgx and _rgx.match(line) or None

p = path = None
if path_tools_pathpy or path_tools_pathlib and line.rstrip():
if path_tools_pathpy or path_tools_pathlib:
try:
p = path = Path(line.strip()) or None
p = path = Path(line) or None
except Exception as e:
log.exception(e)
pass
Expand Down

2 comments on commit 6c3f658

@westurner
Copy link
Owner Author

@westurner westurner commented on 6c3f658 Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bugfix causes a regression for what are probably the most frequent uses of the path options and may require multiple additional commandline options to fix.

  • A filename may consist of only spaces (\s); which .rstrip() strips
  • A filename may contain newlines (\n, \r\n) and carriage returns (\r); which .rstrip() strips

For example:

$ echo -e 'README.rst\n' | pyline -m path 'str(__import__("path").Path(l.rstrip()).exists())'
True

$ echo -e 'README.rst\n' | pyline -m path 'str(__import__("path").Path(l).exists())'
False
$ echo -e 'README.rst\n' | pyline -m path 'str(p.exists())'
False

The problem fixed here is/was that line.rstrip() strips all trailing '\r', '\n', ' ';
which reduces files named, for example, \n\n\n, (\s\s\s), and \r\n\r\n (which are valid file names, by the way) to the empty string.

In most cases, it seemed safe to assume that -- for the --pathpy and --pathlib options which populate path = p = Path(line OR line.rstrip()) -- stripping the trailing newline from a line is a safe thing to do. It's the most convenient thing to do; and it may or may not be safe to assume that, in this case, that's what the user expects: a file of records delimited by either newline or carriage-return-newline.

So, at first, the simplest solution seems to be to write an rstrip_one() which strips only one \n. And then, for DOS line endings instead strip only one \r\n two-character line-ending combination. However, because pyline is a stream-processing tool, auto-detecting the line ending from the first line from a list of files which may or may not contain newlines \n and may or may not end in \n (Unix) or \r\n (DOS) may be dangerous because valid filenames would cause line-ending detection to fail unless explicitly specified:

# This would be autodetected as a Unix line ending
one\n1\r\ntwo\n2\r\n
# This would be autodetected as a DOS line ending
one 1\r\ntwo 2\r\n

I think the correct solution is:

  • to add a --dos-eol option which explicitly sets the EOL (end-of-line) to \r\n;
  • default to \n: specify that as --unix-eol (default:True);
  • and write rstrip_one() so that files consisting of all spaces, carriage returns, or newlines aren't reduced to empty string for the path options.

See: #28

@westurner
Copy link
Owner Author

@westurner westurner commented on 6c3f658 Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it may or may not be safe to assume that, in this case, that's what the user expects: a file of records delimited by either newline or carriage-return-newline.

This is the trouble:

  • EOL sequences within a quoted string MAY NOT be considered field or record delimiters.
  • EOL sequences within a quoted string SHOULD NOT be considered field or record delimiters.
  • EOL sequences within a quoted string MUST NOT be considered field or record delimiters.

Pyline already has a --shlex option which does words = w = shlex.split(line). This may be dangerous if the input data contains newlines within the quoted fields because pyline (and many/most? other line-based tools tools) don't differentiate between quoted and non-quoted EOL sequences which serve as record delimiters.

https://en.wikipedia.org/wiki/Newline

This is the actual trouble: newlines are control characters which are also presentational characters. Naieve line-based parsers (such as Pyline v0.3.x) do not differentiate between EOL line feed control characters within quoted strings and EOL record delimiters.

See: #28

Please sign in to comment.