Multiple replacements with peg/match #1422

MaxGyver83 · 2024-03-03T20:45:25Z

MaxGyver83
Mar 3, 2024

I would like to replace this (simplified) function with a peg/match version of it:

(defn replace-utf8 [str]
  (->> str
       (string/replace-all "=C3=A4" "ä")
       (string/replace-all "=C3=B6" "ö")
       (string/replace-all "=C3=BC" "ü")))

(test (replace-utf8 "M=C3=B6ller") "Möller")

I tried to do it like in the example in Janet for Mortals: Pegular Expressions:

(defn replace-utf8-peg [str]
  (peg/match ~(any (+
                     (/ "=C3=A4" "ä")
                     (/ "=C3=B6" "ö")
                     (/ "=C3=BC" "ü")
                     1))
             str))

(test (replace-utf8-peg "M=C3=B6ller") @["Möller"])

But it doesn't work. The test (replace-utf8-peg "M=C3=B6ller") gives me only @["\xC3\xB6"] instead of the expected @["Möller"]. Doesn't this 1 in the pattern match any byte except those being replaced!?

Answered by ianthehenry

Mar 3, 2024

1 matches the byte, but doesn't capture it. You can use:

(defn replace-utf8-peg [str]
  (peg/match ~(accumulate (any (+
                     (/ "=C3=A4" "ä")
                     (/ "=C3=B6" "ö")
                     (/ "=C3=BC" "ü")
                     '1)))
             str))

To do what you want (accumulate will join all the captured characters into a string efficiently, instead of giving you an array of characters at the end).

peg/replace is probably a better fit here, though. And assuming that you want to produce utf-8 encoded strings in your final result (which I guess is ambiguous given the code you provided -- I don't know your file encoding), you can just parse the bytes and don'…

View full answer

ianthehenry · 2024-03-03T21:19:58Z

ianthehenry
Mar 3, 2024

1 matches the byte, but doesn't capture it. You can use:

(defn replace-utf8-peg [str]
  (peg/match ~(accumulate (any (+
                     (/ "=C3=A4" "ä")
                     (/ "=C3=B6" "ö")
                     (/ "=C3=BC" "ü")
                     '1)))
             str))

To do what you want (accumulate will join all the captured characters into a string efficiently, instead of giving you an array of characters at the end).

peg/replace is probably a better fit here, though. And assuming that you want to produce utf-8 encoded strings in your final result (which I guess is ambiguous given the code you provided -- I don't know your file encoding), you can just parse the bytes and don't need to enumerate things to decode:

(defn replace-utf8-peg [str]
  (string (peg/replace-all
    ~{:main (* "=" :hex-byte)
      :hex-byte (number (* :hex-digit :hex-digit) 16)
      :hex-digit (range "09" "AF")}
    (fn [_ digit] (string/from-bytes digit))
    str)))

Though the lookup-table might be preferable depending on the circumstances. e.g. if you don't want to think about overlong encodings or illegal encodings or something -- this function doesn't validate that the result is UTF-8, it just trusts it. Whereas your approach parses a subset that is definitely valid.

1 reply

MaxGyver83 Mar 4, 2024
Author

Thanks a lot! My script and the address book it writes to are both in UTF-8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple replacements with peg/match #1422

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multiple replacements with peg/match #1422

MaxGyver83 Mar 3, 2024

Replies: 1 comment · 1 reply

ianthehenry Mar 3, 2024

MaxGyver83 Mar 4, 2024 Author

MaxGyver83
Mar 3, 2024

Replies: 1 comment 1 reply

ianthehenry
Mar 3, 2024

MaxGyver83 Mar 4, 2024
Author