Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix macOS/BSD incompatibility in general:check-filenames task #423

Merged
merged 1 commit into from
Dec 5, 2023

Conversation

per1234
Copy link
Contributor

@per1234 per1234 commented Dec 2, 2023

The ""Check Files" (Task)" template includes an asset task named general:check-filenames that checks for the presence of non-portable filenames in the project.

Ironically, the task itself was non-portable. The problem was that it used the --perl-regexp flag in the grep command. This flag is not supported by the BSD version of grep used on macOS and BSD machines. This caused the task to fail spuriously with grep: unrecognized option '--perl-regexp' errors when ran on a macOS or BSD machine.

The incompatibility is resolved by changing the --perl-regexp flag to --extended-regexp. This flag, which is supported by the BSD and GNU versions of grep, allows the use of the modern and reasonable capable POSIX ERE syntax on all platforms.

Unfortunately the regular expression used in the previous command relied on one of the additional features only present in the PCRE syntax so the previous pattern can not be used in combination with the --extended-regexp flag. The PCRE-specific syntax was used to check for the presence of a range of characters prohibited by the Windows filename specification:

https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions

Use any character [...] except for the following:

  • Integer value zero, sometimes referred to as the ASCII NUL character.
  • Characters whose integer representations are in the range from 1 through 31

Due to the nature of these characters, they must be represented by code in the regular expression. This was done using the \x{hhh..} syntax supported by PCRE. Neither that syntax nor any of the equivalent escape patterns are supported by POSIX ERE. A solution is offered in the GNU grep documentation:

https://www.gnu.org/software/grep/manual/grep.html#Matching-Non_002dASCII-and-Non_002dprintable-Characters

the command grep "$(printf '\316\233\t\317\211\n')" is a portable albeit hard-to-read alternative

So this approach can be used to represent the character range using octal codes (\000-\037):

https://www.gnu.org/software/coreutils/manual/html_node/printf-invocation.html#printf-invocation:~:text=printf%20interprets%20%E2%80%98%5Cooo%E2%80%99%20in%20format%20as%20an%20octal%20number

As also mentioned in the grep manual:

none of these techniques will let you put a null character directly into a command-line pattern

$ echo "foo" | grep --extended-regexp --regexp='([<>:"/\\|?*'"$(printf "\000-\037")"'])|(.+\.$)'

bash: warning: command substitution: ignored null byte in input
grep: Invalid range end

So the range of characters in the pattern can not include NUL. However, it turns out that even the previous command did not detect this character in filenames although it was included in the regular expression pattern, so this limitation doesn't result in any regression in practice.

Alternative solution

I also considered the alternative of using perl instead of grep. This approach allows the use of the PCRE syntax even on macOS/BSD machines.

This command is equivalent to the previous grep command, with similar performance to the grep --extended-regexp command proposed in this PR:

perl \
  -e \
  '
    chomp($_);
    if ($_ =~ /([<>:"\/\\|?*\x{0000}-\x{001F}])|(.+\.$)/) {
      exit 0;
    };
    exit 1;
  '
  -n

the full command using this approach:

find . \
  -type d -name '.git' -prune -o \
  -type d -name '.licenses' -prune -o \
  -type d -name '__pycache__' -prune -o \
  -type d -name 'node_modules' -prune -o \
  -exec \
    sh \
      -c \
        ' \
          basename "$0" | \
            perl \
              -e \
              '"'"'
                chomp($_);
                if ($_ =~ /([<>:"\/\\|?*\x{0000}-\x{001F}])|(.+\.$)/) {
                  exit 0;
                };
                exit 1;
              '"'"' \
              -n \
          && \
          echo "$0"
        ' \
      '{}' \
    \; \
  -execdir \
    false \
    '{}' \
    + \
|| \
{
  echo
  echo "Prohibited characters found in filenames"
  echo "See:"
  echo "https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions:~:text=except%20for%20the%20following"
  false
}

I am reluctant to introduce the use of perl into the assets because I feel that the maintainers of the assets and the projects they are used in are less likely to have existing familiarity with perl that with grep and also will not likely benefit much from the effort of gaining a familiarity with perl.

The "Check Files" (Task) template includes an asset task named `general:check-filenames` that checks for the presence of
non-portable filenames in the project.

Ironically, the task itself was non-portable. The problem was that it used the `--perl-regexp` flag in the `grep`
command. This flag is not supported by the BSD version of grep used on macOS and BSD machines. This caused the task to
fail spuriously with `grep: unrecognized option '--perl-regexp'` errors when ran on a macOS or BSD machine.

The incompatibility is resolved by changing the `--perl-regexp` flag to `--extended-regexp`. This flag, which is
supported by the BSD and GNU versions of grep, allows the use of the modern and reasonable capable POSIX ERE syntax on
all platforms.

Unfortunately the regular expression used in the previous command relied on one of the additional features only present
in the PCRE syntax. This syntax was used to check for the presence of a range of characters prohibited by the Windows
filename specification:

https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions

> Use any character [...] except for the following:
> - Integer value zero, sometimes referred to as the ASCII NUL character.
> - Characters whose integer representations are in the range from 1 through 31

Due to the nature of these characters, they must be represented by code in the regular expression. This was done using
the `\x{hhh..}` syntax supported by PCRE. Neither that syntax nor any of the equivalent escape patterns are supported by
POSIX ERE. A solution is offered in the GNU grep documentation:

https://www.gnu.org/software/grep/manual/grep.html#Matching-Non_002dASCII-and-Non_002dprintable-Characters

> the command `grep "$(printf '\316\233\t\317\211\n')"` is a portable albeit hard-to-read alternative

As also mentioned there:

> none of these techniques will let you put a null character directly into a command-line pattern

So the range of characters in the pattern can not include NUL. However, it turns out that even the previous command did
not detect this character although it was present by the pattern. So this limitation doesn't result in any regression in
practice.
@per1234 per1234 added type: imperfection Perceived defect in any part of project topic: code Related to content of the project itself labels Dec 2, 2023
@per1234 per1234 self-assigned this Dec 2, 2023
@per1234 per1234 merged commit 76df0e6 into arduino:main Dec 5, 2023
47 checks passed
@per1234 per1234 deleted the portable-grep branch December 5, 2023 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: code Related to content of the project itself type: imperfection Perceived defect in any part of project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants