-
Notifications
You must be signed in to change notification settings - Fork 1
Usage Instructions
To create a basic regex, you have 3 options: regex
, regexAsString
, and buildRegex
.
-
regex
is the default option, returning a basicRegex
object to use with the Kotlin standard library extensions or the JDK methods. -
regexAsString
returns the raw regex string, which can then be used as you wish. -
buildRegex
returns an object of the internalKetexBuilder
type, which is not recommended, but possible if you wish.
Examples:
regex {
+'A'
} // : Regex
regexAsString {
+'A'
} // : String
buildRegex {
+'A'
} // : KetexBuilder
Tokens are a special concept used internally by Ketex to represent text fragments.
They can represent:
String
Char
-
CharRange
(uses a set) -
KetexGroup
(implementsKetexToken
internally, no need for any conversion) -
KetexSet
(implementsKetexToken
internally, no need for any conversion)
While some features of the DSL automatically convert those types into the KetexToken
type, some (such as Quantifiers) require you to explicitly convert said types into tokens, which can be done with the token
extension property.
regex {
+('A'.token count 5) // here, we use `token` to convert the Char into a token, and add a quantifier to it.
}
To create custom, more complex tokens, you can use the KetexToken
interface (though it's very rarely needed):
regex {
+(KetexToken {
if (Random().nextBoolean()) "true" else "false"
})
}
Some tokens are built into Ketex, and have special meanings inside regexes (AKA meta-sequences).
Property name | Token | Description |
---|---|---|
any |
. |
any character - [^\n\r]
|
newline |
\n |
newline character |
carriageReturn |
\r |
carriage return character |
tab |
\t |
tab character |
word |
\w |
word character - [a-zA-Z0-9_]
|
digit |
\d |
digit - [0-9]
|
whitespace |
\s |
whitespace character |
unicodeNewlines |
\R |
unicode newline character |
verticalWhitespace |
\v |
vertical whitespace character |
horizontalWhitespace |
\h |
horizontal whitespace character (spaces, tabs, etc) |
index |
\# |
reference another matching group by its index |
property |
\p{Property} |
a character in the given script/block or with the given property |
If a token begins with \
followed by a single character, it can be inverted using the !
operator.
For example, !word
would be equivalent to \W
(any character that does not match \w
).
regex {
+word // \w
+!word // \W
+digit // \d
+any // .
+index(1) // \1
}
Groups let you isolate part of the full match for later referencing or assert the presence of a match without capturing it.
There are 3 major group types:
- capture groups (default)
- non-capture groups
- positive/negative lookahead/behind
To create a group, use the group
function.
The type can be set using the type
parameter (the types can be found in the KetexGroupType
enum).
Capture groups isolate a part of the match, and allow you to reference it later on, while still being included in the capture.
You can either reference the group by its ID (start at 1), or by its name (if its named).
regex {
+group { // declare a group matching a single character
+any
}
+index(1) // reference the group and find another match equal to the text matched by group 1
}
To name a group, use the optional name
parameter in the group
function.
Trying to name any other type of group will ignore the name and print an error.
regex {
+group(name = "anychar") { // will match any single character
+any
}
}
Non-capture groups work the same as capture groups, but aren't assigned an ID and cannot be named.
They can be created by passing KetexGroupType.NonCapture
to type
in group
.
regex {
+"A"
+group(type = KetexGroupType.NonCapture) {
+"A"
}
+index(1) // this is invalid, the preceding group has no ID
}
These types of groups allow you to match text without including it in the capture.
Positive | Negative | |
---|---|---|
Lookahead | only match if exists after preceding token | only match if doesn't exists after preceding token |
Lookbehind | only match if exists before next token | only match if doesn't exists before next token |
These types of groups can be created by passing the relevant types in KetexGroupType
to the type
parameter in group
.
regex {
+"A" // this "A" will only be matched if followed by another "A"
+group(type = KetexGroupType.PositiveLookahead) {
+"A" // this "A" won't be captured, only the first "A" will
}
}
Sets are character classes which will match any characters inside them.
To create a set, use the set
function:
regex {
+set {
+"a"
+"1"
} // turns into "[a1]", will match either 'a' or '1'
}
Additionally, you can use the set
property with the get ([]
) operator to create an inline set.
This method accepts String
, Char
, CharRange
, and KetexToken
, but only one type per set.
regex {
+set["a", "b", "c"]
+set['a'..'z', 'A'..'Z', '0'..'9']
+set[word, whitespace]
}
You can add all types of tokens to sets (including sets themselves).
You can invert sets ([^abcd]
) using the not
operator (!
).
regex {
+!set {
+"abcd"
} // turns into `[^abcd]`
}
Or intersect them with the infix intersect
function, which acts the same as a boolean &&
operator.
regex {
// turns into `[a-z&&[^g]]`
// will match any character from 'a' to 'z' except 'g'
+(set {
+"a-z"
} intersect !set { +'g' })
}
Quantifiers specify how many times you expect to see a pattern.
Name | Token | Description |
---|---|---|
count |
{5} |
specifies an exact amount |
between |
{5,6} |
specifies a range |
atLeast |
{5,} |
specifies a min amount |
some |
+ |
matches 1 or more |
maybe |
* |
matches 0 or more |
option |
? |
matches 0 or 1 |
lazy |
+? |
makes the preceding quantifier match as little chars as possible |
or |
a|b |
matches the 1st token or the 2nd token. can be chained |
regex {
+("A".token count 5) // "A{5}"
+("A".token between 5..6) // "A{5,6}"
+("A".token atLeast 5) // "A{5,}"
+"A".token.some() // "A+"
+"A".token.maybe() // "A*"
+"A".token.option() // "A?"
+"A".token.some().lazy() // "A+?"
+("A".token or "B".token) // "A|B"
}
Anchors allow you to assert the position of matches in the regex.
Name | Token | Description |
---|---|---|
start |
^ |
matches the beginning of the string |
end |
$ |
matches the end of the string |
wordBoundry |
^ |
matches the position between a word character (word ) and a non-word character (!word ) |
regex {
+start // ^
+wordBoundry // \b
+end // $
}
To avoid accidental adding of meta sequences into the regex, Ketex automatically escapes all meta characters for you.
To add raw content without escaping, do the following:
regex {
add("This will not be escaped. That dot will match any character.", escape = false)
}
Visit the API docs for more info: https://ketex.theonlytails.com/