Skip to content

Usage Instructions

TheOnlyTails edited this page Aug 3, 2022 · 14 revisions

Creating a basic regex

To create a basic regex, you have 3 options: regex, regexAsString, and buildRegex.

  • regex is the default option, returning a basic Regex object to use with the Kotlin standard library extensions or the JDK methods.
  • regexAsString returns the raw regex string, which can then be used as you wish.
  • buildRegex returns an object of the internal KetexBuilder type, which is not recommended, but possible if you wish.

Examples:

regex {
  +'A'
} // : Regex

regexAsString {
  +'A'
} // : String

buildRegex {
  +'A'
} // : KetexBuilder

Tokens

Tokens are a special concept used internally by Ketex to represent text fragments.

They can represent:

While some features of the DSL automatically convert those types into the KetexToken type, some (such as Quantifiers) require you to explicitly convert said types into tokens, which can be done with the token extension property.

regex {
  +('A'.token count 5) // here, we use `token` to convert the Char into a token, and add a quantifier to it.
}

To create custom, more complex tokens, you can use the KetexToken interface (though it's very rarely needed):

regex {
  +(KetexToken {
    if (Random().nextBoolean()) "true" else "false"
  })
}

Built-in tokens

Some tokens are built into Ketex, and have special meanings inside regexes (AKA meta-sequences).

Property name Token Description
any . any character - [^\n\r]
newline \n newline character
carriageReturn \r carriage return character
tab \t tab character
word \w word character - [a-zA-Z0-9_]
digit \d digit - [0-9]
whitespace \s whitespace character
unicodeNewlines \R unicode newline character
verticalWhitespace \v vertical whitespace character
horizontalWhitespace \h horizontal whitespace character (spaces, tabs, etc)
index \# reference another matching group by its index
property \p{Property} a character in the given script/block or with the given property

If a token begins with \ followed by a single character, it can be inverted using the ! operator. For example, !word would be equivalent to \W (any character that does not match \w).

regex {
  +word // \w
  +!word // \W
  +digit // \d
  +any // .
  +index(1) // \1
}

Groups

Groups let you isolate part of the full match for later referencing or assert the presence of a match without capturing it.

There are 3 major group types:

  • capture groups (default)
  • non-capture groups
  • positive/negative lookahead/behind

To create a group, use the group function. The type can be set using the type parameter (the types can be found in the KetexGroupType enum).

Capture Groups

Capture groups isolate a part of the match, and allow you to reference it later on, while still being included in the capture.

You can either reference the group by its ID (start at 1), or by its name (if its named).

regex {
  +group { // declare a group matching a single character
    +any
  }
  +index(1) // reference the group and find another match equal to the text matched by group 1
}

To name a group, use the optional name parameter in the group function. Trying to name any other type of group will ignore the name and print an error.

regex {
  +group(name = "anychar") { // will match any single character
    +any
  }
}

Non-capture Groups

Non-capture groups work the same as capture groups, but aren't assigned an ID and cannot be named.

They can be created by passing KetexGroupType.NonCapture to type in group.

regex {
  +"A"
  +group(type = KetexGroupType.NonCapture) {
    +"A"
  }
  +index(1) // this is invalid, the preceding group has no ID
}

Positive/Negative Lookahead/behind

These types of groups allow you to match text without including it in the capture.

Positive Negative
Lookahead only match if exists after preceding token only match if doesn't exists after preceding token
Lookbehind only match if exists before next token only match if doesn't exists before next token

These types of groups can be created by passing the relevant types in KetexGroupType to the type parameter in group.

regex {
  +"A" // this "A" will only be matched if followed by another "A"
  +group(type = KetexGroupType.PositiveLookahead) {
    +"A" // this "A" won't be captured, only the first "A" will
  }
}

Sets

Sets are character classes which will match any characters inside them.

To create a set, use the set function:

regex {
  +set {
    +"a"
    +"1"
  } // turns into "[a1]", will match either 'a' or '1'
}

Additionally, you can use the set property with the get ([]) operator to create an inline set.

This method accepts String, Char, CharRange, and KetexToken, but only one type per set.

regex {
  +set["a", "b", "c"]
  +set['a'..'z', 'A'..'Z', '0'..'9']
  +set[word, whitespace]
}

You can add all types of tokens to sets (including sets themselves).

You can invert sets ([^abcd]) using the not operator (!).

regex {
  +!set {
    +"abcd"
  } // turns into `[^abcd]`
}

Or intersect them with the infix intersect function, which acts the same as a boolean && operator.

regex {
  // turns into `[a-z&&[^g]]`
  // will match any character from 'a' to 'z' except 'g'
  +(set {
    +"a-z"
  } intersect !set { +'g' }) 
}

Quantifiers

Quantifiers specify how many times you expect to see a pattern.

Name Token Description
count {5} specifies an exact amount
between {5,6} specifies a range
atLeast {5,} specifies a min amount
some + matches 1 or more
maybe * matches 0 or more
option ? matches 0 or 1
lazy +? makes the preceding quantifier match as little chars as possible
or a|b matches the 1st token or the 2nd token. can be chained
regex {
  +("A".token count 5) // "A{5}"
  +("A".token between 5..6) // "A{5,6}"
  +("A".token atLeast 5) // "A{5,}"
  +"A".token.some() // "A+"
  +"A".token.maybe() // "A*"
  +"A".token.option() // "A?"
  +"A".token.some().lazy() // "A+?"
  +("A".token or "B".token) // "A|B"
}

Anchors

Anchors allow you to assert the position of matches in the regex.

Name Token Description
start ^ matches the beginning of the string
end $ matches the end of the string
wordBoundry ^ matches the position between a word character (word) and a non-word character (!word)
regex {
  +start // ^
  +wordBoundry // \b
  +end // $
}

Escaping

To avoid accidental adding of meta sequences into the regex, Ketex automatically escapes all meta characters for you.

To add raw content without escaping, do the following:

regex {
  add("This will not be escaped. That dot will match any character.", escape = false)
}