From 211b15ad4fff43db371e9e22c624a044234c9d9e Mon Sep 17 00:00:00 2001 From: John Kerl Date: Tue, 19 Dec 2023 09:52:16 -0500 Subject: [PATCH] make docs --- docs/src/kubectl-and-helm.md.in | 2 +- docs/src/reference-dsl-time.md.in | 4 +- .../src/reference-main-regular-expressions.md | 42 +++++++++++++++++++ .../reference-main-regular-expressions.md.in | 2 +- docs/src/reference-main-strings.md.in | 2 +- docs/src/release-docs.md.in | 2 +- docs/src/shapes-of-data.md.in | 12 +++--- docs/src/statistics-examples.md.in | 4 +- docs/src/why.md.in | 2 +- 9 files changed, 57 insertions(+), 15 deletions(-) diff --git a/docs/src/kubectl-and-helm.md.in b/docs/src/kubectl-and-helm.md.in index 2f7d7d26f2..14c0facf44 100644 --- a/docs/src/kubectl-and-helm.md.in +++ b/docs/src/kubectl-and-helm.md.in @@ -136,7 +136,7 @@ $ helm list | mlr --itsv --ojson head -n 1 ] GENMD-EOF -A solution here is Miller's +A solution here is Miller's [clean-whitespace verb](reference-verbs.md#clean-whitespace): GENMD-CARDIFY diff --git a/docs/src/reference-dsl-time.md.in b/docs/src/reference-dsl-time.md.in index e2e02c3970..869a584958 100644 --- a/docs/src/reference-dsl-time.md.in +++ b/docs/src/reference-dsl-time.md.in @@ -67,7 +67,7 @@ the [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) format. This was the first (and initially only) human-readable date/time format supported by Miller going all the way back to Miller 1.0.0. -You can get these from epoch-seconds using the +You can get these from epoch-seconds using the [sec2gmt](reference-dsl-builtin-functions.md#sec2gmt) DSL function. (Note that the terms _UTC_ and _GMT_ are used interchangeably in Miller.) We also have [sec2gmtdate](reference-dsl-builtin-functions.md#sec2gmtdate) DSL function. @@ -142,7 +142,7 @@ GENMD-EOF Note that for local times, Miller omits the `T` and the `Z` you see in GMT times. -We also have the +We also have the [gmt2localtime](reference-dsl-builtin-functions.md#gmt2localtime) and [localtime2gmt](reference-dsl-builtin-functions.md#localtime2gmt) convenience functions: diff --git a/docs/src/reference-main-regular-expressions.md b/docs/src/reference-main-regular-expressions.md index c221c48dec..ba6d955ff7 100644 --- a/docs/src/reference-main-regular-expressions.md +++ b/docs/src/reference-main-regular-expressions.md @@ -103,6 +103,48 @@ Regex captures of the form `\0` through `\9` are supported as follows: * Up to nine matches are supported: `\1` through `\9`, while `\0` is the entire match string; `\15` is treated as `\1` followed by an unrelated `5`. +## Resetting captures + +If you use `(...)` in your regular expression, then up to 9 matches are supported for the `=~` +operator, and an arbitrary number of matches are supported for the `match` DSL function. + +* Before any match is done, `"\1"` etc. in a string evaluate to themselves. +* After a successful match is done, `"\1"` etc. in a string evaluate to the matched substring. +* After an unsuccessful match is done, `"\1"` etc. in a string evaluate to the empty string. +* You can match against `null` to reset to the original state. + +
+mlr repl
+
+
+
+[mlr] "\1:\2"
+"\1:\2"
+
+[mlr] "abc" =~ "..."
+true
+
+[mlr] "\1:\2"
+":"
+
+[mlr] "abc" =~ "(.).(.)"
+true
+
+[mlr] "\1:\2"
+"a:c"
+
+[mlr] "abc" =~ "(.)x(.)"
+false
+
+[mlr] "\1:\2"
+":"
+
+[mlr] "abc" =~ null
+
+[mlr] "\1:\2"
+"\1:\2"
+
+ ## More information Regular expressions are those supported by the [Go regexp package](https://pkg.go.dev/regexp), which in turn are of type [RE2](https://github.com/google/re2/wiki/Syntax) except for `\C`: diff --git a/docs/src/reference-main-regular-expressions.md.in b/docs/src/reference-main-regular-expressions.md.in index 434225f35b..d3b0912079 100644 --- a/docs/src/reference-main-regular-expressions.md.in +++ b/docs/src/reference-main-regular-expressions.md.in @@ -83,7 +83,7 @@ GENMD-EOF If you use `(...)` in your regular expression, then up to 9 matches are supported for the `=~` operator, and an arbitrary number of matches are supported for the `match` DSL function. -* Before any match is done, `"\1"` etc. in a string evaluate to themselves. +* Before any match is done, `"\1"` etc. in a string evaluate to themselves. * After a successful match is done, `"\1"` etc. in a string evaluate to the matched substring. * After an unsuccessful match is done, `"\1"` etc. in a string evaluate to the empty string. * You can match against `null` to reset to the original state. diff --git a/docs/src/reference-main-strings.md.in b/docs/src/reference-main-strings.md.in index e675605505..7ad9e431d6 100644 --- a/docs/src/reference-main-strings.md.in +++ b/docs/src/reference-main-strings.md.in @@ -143,4 +143,4 @@ See also [https://en.wikipedia.org/wiki/Escape_sequences_in_C](https://en.wikipe These replacements apply only to strings you key in for the DSL expressions for `filter` and `put`: that is, if you type `\t` in a string literal for a `filter`/`put` expression, it will be turned into a tab character. If you want a backslash followed by a `t`, then please type `\\t`. -However, these replacements are done automatically only for string literals within DSL expressions -- they are not done automatically to fields within your data stream. If you wish to make these replacements, you can do (for example) `mlr put '$field = gsub($field, "\\t", "\t")'`. If you need to make such a replacement for all fields in your data, you should probably use the system `sed` command instead. +However, these replacements are done automatically only for string literals within DSL expressions -- they are not done automatically to fields within your data stream. If you wish to make these replacements, you can do (for example) `mlr put '$field = gsub($field, "\\t", "\t")'`. If you need to make such a replacement for all fields in your data, you should probably use the system `sed` command instead. diff --git a/docs/src/release-docs.md.in b/docs/src/release-docs.md.in index 07dc91719c..e82b427551 100644 --- a/docs/src/release-docs.md.in +++ b/docs/src/release-docs.md.in @@ -1,6 +1,6 @@ # Documents for releases -If your `mlr version` says something like `mlr 6.0.0-dev`, with the `-dev` suffix, you're likely building from source, or you've obtained a recent artifact from GitHub Actions -- +If your `mlr version` says something like `mlr 6.0.0-dev`, with the `-dev` suffix, you're likely building from source, or you've obtained a recent artifact from GitHub Actions -- the page [https://miller.readthedocs.io/en/main](https://miller.readthedocs.io/en/main) contains information for the latest contributions to the [Miller repository](https://github.com/johnkerl/miller). If your `mlr version` says something like `Miller v5.10.2` or `mlr 6.0.0`, without the `-dev` suffix, you're likely using a Miller executable from a package manager -- please see below for the documentation for Miller as of the release you're using. diff --git a/docs/src/shapes-of-data.md.in b/docs/src/shapes-of-data.md.in index c32b0dad18..3636f406d2 100644 --- a/docs/src/shapes-of-data.md.in +++ b/docs/src/shapes-of-data.md.in @@ -17,14 +17,14 @@ Also try `od -xcv` and/or `cat -e` on your file to check for non-printable chara Use the `file` command to see if there are CR/LF terminators (in this case, there are not): GENMD-CARDIFY-HIGHLIGHT-ONE -file data/colours.csv +file data/colours.csv data/colours.csv: Unicode text, UTF-8 text GENMD-EOF Look at the file to find names of fields: GENMD-CARDIFY-HIGHLIGHT-ONE -cat data/colours.csv +cat data/colours.csv KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah @@ -33,13 +33,13 @@ GENMD-EOF Extract a few fields: GENMD-CARDIFY-HIGHLIGHT-ONE -mlr --csv cut -f KEY,PL,TO data/colours.csv +mlr --csv cut -f KEY,PL,TO data/colours.csv GENMD-EOF Use XTAB output format to get a sharper picture of where records/fields are being split: GENMD-CARDIFY-HIGHLIGHT-ONE -mlr --icsv --oxtab cat data/colours.csv +mlr --icsv --oxtab cat data/colours.csv KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah @@ -48,7 +48,7 @@ GENMD-EOF Using XTAB output format makes it clearer that `KEY;DE;...;TR` is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (`--fs semicolon`): GENMD-CARDIFY-HIGHLIGHT-ONE -mlr --icsv --ifs semicolon --oxtab cat data/colours.csv +mlr --icsv --ifs semicolon --oxtab cat data/colours.csv KEY masterdata_colourcode_1 DE Weiß EN White @@ -77,7 +77,7 @@ GENMD-EOF Using the new field-separator, retry the cut: GENMD-CARDIFY-HIGHLIGHT-ONE -mlr --csv --fs semicolon cut -f KEY,PL,TO data/colours.csv +mlr --csv --fs semicolon cut -f KEY,PL,TO data/colours.csv KEY;PL;TO masterdata_colourcode_1;Biały;Alb masterdata_colourcode_2;Czarny;Negru diff --git a/docs/src/statistics-examples.md.in b/docs/src/statistics-examples.md.in index a98ead194c..1da4aa235f 100644 --- a/docs/src/statistics-examples.md.in +++ b/docs/src/statistics-examples.md.in @@ -7,7 +7,7 @@ For one or more specified field names, simply compute p25 and p75, then write th GENMD-RUN-COMMAND mlr --oxtab stats1 -f x -a p25,p75 \ then put '$x_iqr = $x_p75 - $x_p25' \ - data/medium + data/medium GENMD-EOF For wildcarded field names, first compute p25 and p75, then loop over field names with `p25` in them: @@ -19,7 +19,7 @@ mlr --oxtab stats1 --fr '[i-z]' -a p25,p75 \ $["\1_iqr"] = $["\1_p75"] - $["\1_p25"] } }' \ - data/medium + data/medium GENMD-EOF ## Computing weighted means diff --git a/docs/src/why.md.in b/docs/src/why.md.in index 3c83c39c43..e33529ba2f 100644 --- a/docs/src/why.md.in +++ b/docs/src/why.md.in @@ -32,7 +32,7 @@ Eighth thing: It's an **awful lot of fun to write**. In my experience I didn't f Miller is command-line-only by design. People who want a graphical user interface won't find it here. This is in part (a) accommodating my personal preferences, and in part (b) guided by my experience/belief that the command line is very expressive. Steeper learning curve than a GUI, yes. I consider that price worth paying for the tool-niche which Miller occupies. -Another tradeoff: supporting lists of records keeps me supporting only what can be expressed in *all* of those formats. For example, `[1,2,3,4,5]` is valid but unmillerable JSON: the list elements are not records. So Miller can't (and won't) handle arbitrary JSON -- because Miller only handles tabular data which can be expressed in a variety of formats. +Another tradeoff: supporting lists of records keeps me supporting only what can be expressed in *all* of those formats. For example, `[1,2,3,4,5]` is valid but unmillerable JSON: the list elements are not records. So Miller can't (and won't) handle arbitrary JSON -- because Miller only handles tabular data which can be expressed in a variety of formats. A third tradeoff is doing build-from-scratch in a low-level language. It'd be quicker to write (but slower to run) if written in a high-level language. If Miller were written in Python, it would be implemented in significantly fewer lines of code than its current Go implementation. The DSL would just be an `eval` of Python code. And it would run slower, but maybe not enough slower to be a problem for most folks. Later I found out about the [rows](https://github.com/turicas/rows) tool -- if you find Miller useful, you should check out `rows` as well.