Usage Instructions

Creating a basic regex

To create a basic regex, you have 3 options: regex, regexAsString, and buildRegex.

regex is the default option, returning a basic Regex object to use with the Kotlin standard library extensions or the JDK methods.
regexAsString returns the raw regex string, which can then be used as you wish.
buildRegex returns an object of the internal KetexBuilder type, which is not recommended, but possible if you wish.

Examples:

regex {
  +'A'
} // : Regex

regexAsString {
  +'A'
} // : String

buildRegex {
  +'A'
} // : KetexBuilder

Tokens

Tokens are a special concept used internally by Ketex to represent text fragments.

They can represent:

String
Char
CharRange (uses a set)
KetexGroup (implements KetexToken internally, no need for any conversion)
KetexSet (implements KetexToken internally, no need for any conversion)

While some features of the DSL automatically convert those types into the KetexToken type, some (such as Quantifiers) require you to explicitly convert said types into tokens, which can be done with the token extension property.

regex {
  +('A'.token count 5) // here, we use `token` to convert the Char into a token, and add a quantifier to it.
}

To create custom, more complex tokens, you can use the KetexToken interface (though it's very rarely needed):

regex {
  +(KetexToken {
    if (Random().nextBoolean()) "true" else "false"
  })
}

Built-in tokens

Some tokens are built into Ketex, and have special meanings inside regexes (AKA meta-sequences).

Property name	Token	Description
`any`	`.`	any character - `[^\n\r]`
`newline`	`\n`	newline character
`carriageReturn`	`\r`	carriage return character
`tab`	`\t`	tab character
`word`	`\w`	word character - `[a-zA-Z0-9_]`
`digit`	`\d`	digit - `[0-9]`
`whitespace`	`\s`	whitespace character
`unicodeNewlines`	`\R`	unicode newline character
`verticalWhitespace`	`\v`	vertical whitespace character
`horizontalWhitespace`	`\h`	horizontal whitespace character (spaces, tabs, etc)
`index`	`\#`	reference another matching group by its index
`property`	`\p{Property}`	a character in the given script/block or with the given property

If a token begins with \ followed by a single character, it can be inverted using the ! operator. For example, !word would be equivalent to \W (any character that does not match \w).

regex {
  +word // \w
  +!word // \W
  +digit // \d
  +any // .
  +index(1) // \1
}

Groups

Groups let you isolate part of the full match for later referencing or assert the presence of a match without capturing it.

There are 3 major group types:

capture groups (default)
non-capture groups
positive/negative lookahead/behind

To create a group, use the group function. The type can be set using the type parameter (the types can be found in the KetexGroupType enum).

Capture Groups

Capture groups isolate a part of the match, and allow you to reference it later on, while still being included in the capture.

You can either reference the group by its ID (start at 1), or by its name (if its named).

regex {
  +group { // declare a group matching a single character
    +any
  }
  +index(1) // reference the group and find another match equal to the text matched by group 1
}

To name a group, use the optional name parameter in the group function. Trying to name any other type of group will ignore the name and print an error.

regex {
  +group(name = "anychar") { // will match any single character
    +any
  }
}

Non-capture Groups

Non-capture groups work the same as capture groups, but aren't assigned an ID and cannot be named.

They can be created by passing KetexGroupType.NonCapture to type in group.

regex {
  +"A"
  +group(type = KetexGroupType.NonCapture) {
    +"A"
  }
  +index(1) // this is invalid, the preceding group has no ID
}

Positive/Negative Lookahead/behind

These types of groups allow you to match text without including it in the capture.

	Positive	Negative
Lookahead	only match if exists after preceding token	only match if doesn't exists after preceding token
Lookbehind	only match if exists before next token	only match if doesn't exists before next token

These types of groups can be created by passing the relevant types in KetexGroupType to the type parameter in group.

regex {
  +"A" // this "A" will only be matched if followed by another "A"
  +group(type = KetexGroupType.PositiveLookahead) {
    +"A" // this "A" won't be captured, only the first "A" will
  }
}

Sets

Sets are character classes which will match any characters inside them.

To create a set, use the set function:

regex {
  +set {
    +"a"
    +"1"
  } // turns into "[a1]", will match either 'a' or '1'
}

Additionally, you can use the set property with the get ([]) operator to create an inline set.

This method accepts String, Char, CharRange, and KetexToken, but only one type per set.

regex {
  +set["a", "b", "c"]
  +set['a'..'z', 'A'..'Z', '0'..'9']
  +set[word, whitespace]
}

You can add all types of tokens to sets (including sets themselves).

You can invert sets ([^abcd]) using the not operator (!).

regex {
  +!set {
    +"abcd"
  } // turns into `[^abcd]`
}

Or intersect them with the infix intersect function, which acts the same as a boolean && operator.

regex {
  // turns into `[a-z&&[^g]]`
  // will match any character from 'a' to 'z' except 'g'
  +(set {
    +"a-z"
  } intersect !set { +'g' }) 
}

Quantifiers

Quantifiers specify how many times you expect to see a pattern.

Name	Token	Description
`count`	`{5}`	specifies an exact amount
`between`	`{5,6}`	specifies a range
`atLeast`	`{5,}`	specifies a min amount
`some`	`+`	matches 1 or more
`maybe`	`*`	matches 0 or more
`option`	`?`	matches 0 or 1
`lazy`	`+?`	makes the preceding quantifier match as little chars as possible
`or`	`a\|b`	matches the 1st token or the 2nd token. can be chained

regex {
  +("A".token count 5) // "A{5}"
  +("A".token between 5..6) // "A{5,6}"
  +("A".token atLeast 5) // "A{5,}"
  +"A".token.some() // "A+"
  +"A".token.maybe() // "A*"
  +"A".token.option() // "A?"
  +"A".token.some().lazy() // "A+?"
  +("A".token or "B".token) // "A|B"
}

Anchors

Anchors allow you to assert the position of matches in the regex.

Name	Token	Description
`start`	`^`	matches the beginning of the string
`end`	`$`	matches the end of the string
`wordBoundry`	`^`	matches the position between a word character (`word`) and a non-word character (`!word`)

regex {
  +start // ^
  +wordBoundry // \b
  +end // $
}

Escaping

To avoid accidental adding of meta sequences into the regex, Ketex automatically escapes all meta characters for you.

To add raw content without escaping, do the following:

regex {
  add("This will not be escaped. That dot will match any character.", escape = false)
}

Visit the API docs for more info: https://ketex.theonlytails.com/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly