Skip to content

Commit

Permalink
update regex doc page re strmatch/strmatchx
Browse files Browse the repository at this point in the history
  • Loading branch information
johnkerl committed Dec 19, 2023
1 parent f4f8137 commit 1b14ef9
Show file tree
Hide file tree
Showing 2 changed files with 186 additions and 2 deletions.
96 changes: 95 additions & 1 deletion docs/src/reference-main-regular-expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ name=jane,regex=^j.*e$
name=bull,regex=^b[ou]ll$
</pre>

## Regex captures
## Regex captures for the `=~` operator

Regex captures of the form `\0` through `\9` are supported as follows:

Expand Down Expand Up @@ -145,6 +145,100 @@ false
"\1:\2"
</pre>

## The `strmatch` and `strmatchx` DSL functions

The `=~` and `!=~` operators have been in Miller for a long time, and they will continue to be
supported. They do, however, have some deficiencies. As of Miller 6.11 and beyond, the `strmatch`
and `strmatchx` provide more robust ways to do capturing.

First, some examples.

The `strmatch` function only returns a boolean result, and it doesn't set `\0..\9`:

<pre class="pre-highlight-in-pair">
<b>mlr repl</b>
</pre>
<pre class="pre-non-highlight-in-pair">

[mlr] strmatch("abc", "....")
false

[mlr] strmatch("abc", "...")
true

[mlr] strmatch("abc", "(.).(.)")
true

[mlr] strmatch("[ab:3458]", "([a-z]+):([0-9]+)")
true
</pre>

The `strmatchx` function also doesn't set `\0..\9`, but returns a map-valued result:

<pre class="pre-highlight-in-pair">
<b>mlr repl</b>
</pre>
<pre class="pre-non-highlight-in-pair">

[mlr] strmatchx("abc", "....")
{
"matched": false
}

[mlr] strmatchx("abc", "...")
{
"matched": true,
"full_capture": "abc",
"full_start": 1,
"full_end": 3
}

[mlr] strmatchx("abc", "(.).(.)")
{
"matched": true,
"full_capture": "abc",
"full_start": 1,
"full_end": 3,
"captures": ["a", "c"],
"starts": [1, 3],
"ends": [1, 3]
}

[mlr] "[ab:3458]" =~ "([a-z]+):([0-9]+)"
true

[mlr] "\1"
"ab"

[mlr] "\2"
"3458"

[mlr] strmatchx("[ab:3458]", "([a-z]+):([0-9]+)")
{
"matched": true,
"full_capture": "ab:3458",
"full_start": 2,
"full_end": 8,
"captures": ["ab", "3458"],
"starts": [2, 5],
"ends": [3, 8]
}
</pre>

Notes:

* When there is no match, the result from `strmatchx` only has the `"matched":false` key/value pair.
* When there is a match with no captures, the result from `strmatchx` has the `"matched":true` key/value pair,
as well as `full_capture` (taking the place of `\0` set by `=~`), and `full_start` and `full_end`
which `=~` does not offer.
* When there is a match with no captures, the result from `strmatchx` also has the `captures` array
whose slots 1, 2, 3, ... are the same as would have been set by `=~` via `\1, \2, \3, ...`.
However, `strmatchx` offers an arbitrary number of captures, not just `\1..\9`.
Additionally, the `starts` and `ends` arrays are indices into the input string.
* Since you hold the return value from `strmatchx`, you can operate on it as you wish --- instead of
relying on the (function-scoped) globals `\0..\9`.
* The price paid is that using `strmatchx` does indeed tend to take more keystrokes than `=~`.

## More information

Regular expressions are those supported by the [Go regexp package](https://pkg.go.dev/regexp), which in turn are of type [RE2](https://github.com/google/re2/wiki/Syntax) except for `\C`:
Expand Down
92 changes: 91 additions & 1 deletion docs/src/reference-main-regular-expressions.md.in
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ GENMD-RUN-COMMAND
mlr filter '$name =~ $regex' data/regex-in-data.dat
GENMD-EOF

## Regex captures
## Regex captures for the `=~` operator

Regex captures of the form `\0` through `\9` are supported as follows:

Expand Down Expand Up @@ -118,6 +118,96 @@ false
"\1:\2"
GENMD-EOF

## The `strmatch` and `strmatchx` DSL functions

The `=~` and `!=~` operators have been in Miller for a long time, and they will continue to be
supported. They do, however, have some deficiencies. As of Miller 6.11 and beyond, the `strmatch`
and `strmatchx` provide more robust ways to do capturing.

First, some examples.

The `strmatch` function only returns a boolean result, and it doesn't set `\0..\9`:

GENMD-CARDIFY-HIGHLIGHT-ONE
mlr repl

[mlr] strmatch("abc", "....")
false

[mlr] strmatch("abc", "...")
true

[mlr] strmatch("abc", "(.).(.)")
true

[mlr] strmatch("[ab:3458]", "([a-z]+):([0-9]+)")
true
GENMD-EOF

The `strmatchx` function also doesn't set `\0..\9`, but returns a map-valued result:

GENMD-CARDIFY-HIGHLIGHT-ONE
mlr repl

[mlr] strmatchx("abc", "....")
{
"matched": false
}

[mlr] strmatchx("abc", "...")
{
"matched": true,
"full_capture": "abc",
"full_start": 1,
"full_end": 3
}

[mlr] strmatchx("abc", "(.).(.)")
{
"matched": true,
"full_capture": "abc",
"full_start": 1,
"full_end": 3,
"captures": ["a", "c"],
"starts": [1, 3],
"ends": [1, 3]
}

[mlr] "[ab:3458]" =~ "([a-z]+):([0-9]+)"
true

[mlr] "\1"
"ab"

[mlr] "\2"
"3458"

[mlr] strmatchx("[ab:3458]", "([a-z]+):([0-9]+)")
{
"matched": true,
"full_capture": "ab:3458",
"full_start": 2,
"full_end": 8,
"captures": ["ab", "3458"],
"starts": [2, 5],
"ends": [3, 8]
}
GENMD-EOF

Notes:

* When there is no match, the result from `strmatchx` only has the `"matched":false` key/value pair.
* When there is a match with no captures, the result from `strmatchx` has the `"matched":true` key/value pair,
as well as `full_capture` (taking the place of `\0` set by `=~`), and `full_start` and `full_end`
which `=~` does not offer.
* When there is a match with no captures, the result from `strmatchx` also has the `captures` array
whose slots 1, 2, 3, ... are the same as would have been set by `=~` via `\1, \2, \3, ...`.
However, `strmatchx` offers an arbitrary number of captures, not just `\1..\9`.
Additionally, the `starts` and `ends` arrays are indices into the input string.
* Since you hold the return value from `strmatchx`, you can operate on it as you wish --- instead of
relying on the (function-scoped) globals `\0..\9`.
* The price paid is that using `strmatchx` does indeed tend to take more keystrokes than `=~`.

## More information

Regular expressions are those supported by the [Go regexp package](https://pkg.go.dev/regexp), which in turn are of type [RE2](https://github.com/google/re2/wiki/Syntax) except for `\C`:
Expand Down

0 comments on commit 1b14ef9

Please sign in to comment.