Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support raw regular expression triggers #6

Open
kirsle opened this issue Mar 10, 2017 · 15 comments
Open

Support raw regular expression triggers #6

kirsle opened this issue Mar 10, 2017 · 15 comments

Comments

@kirsle
Copy link
Member

kirsle commented Mar 10, 2017

Per aichaos/rivescript-js#147 and aichaos/rivescript-python#78, it may be time for RiveScript to re-gain the ~Regexp command from its ancestor Chatbot::Alpha.

I'm increasingly becoming aware that Unicode is hard and regular expression engines are not all created equally. Each programming language has their own little quirks wrt. how meta expressions like the \b word-boundary sequence behaves when matching Unicode symbols.

The RiveScript spec should be amended (and the primary implementations updated) to support a ~ command for writing a raw regular expression. This will enable users to help themselves when they run into regexp matching bugs that +Triggers can't handle, and can't be modified to handle (either because it would break backward compatibility or because the +Trigger already reserves too many regexp special characters for its own use case).

The use of the ~Regexp should be generally discouraged in all documentation and it should be stated that its purpose is only to help with advanced use cases where the +Trigger system is inadequate. You could compare it to the way that database ORM's still allow you to write a raw SQL query by hand, but they strongly encourage you to use the ORM's object model as intended.

Implementation Notes

The ~Regexp should be treated the same as the +Trigger in RiveScript source files (when either command is seen, it becomes the new "root" of the reply data and any following *Condition, -Reply, @Redirect and so on would apply to the most recently seen +Trigger or ~Regexp). In the case that a ~Regexp was used, the functions like triggerRegexp() do not get called and the raw regexp is used as-is. This means of course that you can't use tags like <bot name> inside a ~Regexp.

Captured groups from the regexp that would go into $1..$n will get captured the same for <star1>..<starN>

Syntax Examples

Here are a couple examples how some common triggers would be represented by raw regular expressions:

+Trigger Version ~Regexp Equivalent
+ my name is * ~ my name is (.+?)
+ i am # years old ~ i am (\d+?) years old
`+ [*] (hello hi) [*]`
+ @hello N/A
+ i am <bot name> N/A
@dcsan
Copy link

dcsan commented May 29, 2017

note that superscript uses ~ for wordnet expansion, which seems like a very nice syntax.
https://github.com/superscriptjs/superscript/wiki/Triggers#wordnet-expansion

@kirsle
Copy link
Member Author

kirsle commented May 30, 2017

The other weekend I was feeling motivated to begin implementing this for the JavaScript version, but discovered it will take a bit more planning.

The approach I was going for was that: during parsing, I added a block of code to handle the ~ command, which did mostly what the + one does except it put the line into a regexp attribute instead of trigger. The rest of the parsing just worked as normal because I keep a pointer to the current trigger object that the replies/conditions/etc. insert their things into.

Then I got to the reply sorting algorithm. Initially this was complicated because everything used the trigger attribute (I later thought maybe I should make the regexp attribute be a boolean, and store the raw regexp in the trigger to make the sorting code easier). But besides that, the normal rules for how I sort triggers (atomic > alternatives > optionals > wildcards) don't apply very well to regular expressions. I didn't feel like coming up with a solution at the time so I shelved the project for later.

My ideas now for how to sort the regexps in the same bucket as the triggers might be along the lines of:

  • Just sort them by length. Don't do any special introspection of their contents like you do with normal triggers.
  • Bend the regular expression system by allowing a {weight} tag to be supported, like in triggers. So shorter regexps can still be given higher priority; if they're being sorted purely by length they'd be placed pretty far down the sort list.

The other hard part will be merging the regexp list with the trigger list, since the sorting algorithm creates a whole bunch of different 'buckets' for types of triggers and concatenates them at the end. I think maybe I'll store the regexp list in its own array, and do a final pass-thru of the 'final' sorted trigger array, and inject the regexps in the appropriate places sorted by their length.

If anyone has better ideas, feel free to suggest them. My idea isn't perfect and is bound to have all sorts of issues, but, since the ~Regexp command is really just designed for handling the crazy edge cases, a typical RiveScript brain should be expected to use them very sparingly.

@dcsan
Copy link

dcsan commented May 31, 2017

since the ~Regexp command is really just designed for handling the crazy edge cases,

actually we would use it quite a lot for ALL our bots and triggers that are in Chinese. It's impossible to use the normal simple regex syntax for languages that don't use the space as a word separator.

@dcsan
Copy link

dcsan commented Jun 19, 2017

would it help raise priority if we offered a bounty for this issue? It could be quite useful for us going forward...

@kirsle
Copy link
Member Author

kirsle commented Jul 19, 2017

note that superscript uses ~ for wordnet expansion, which seems like a very nice syntax.
https://github.com/superscriptjs/superscript/wiki/Triggers#wordnet-expansion

That's cool for them. It won't interfere with the ~Regexp command in RiveScript because it's a command symbol vs. some syntax sugar in the data part of the command.

I'm not interested in integrating WordNet into RiveScript though because it would make the implementations diverge, as there isn't a WordNet equivalent for all the other programming languages I use.

I would be open to implementing a pluggable NLP system, though: RiveScript could look for ~words in commands and call out to an NLP plugin to dynamically provide the array of synonyms to use. Then you could hook WordNet up to it while not marrying RiveScript to WordNet.

since the ~Regexp command is really just designed for handling the crazy edge cases,

actually we would use it quite a lot for ALL our bots and triggers that are in Chinese. It's impossible to use the normal simple regex syntax for languages that don't use the space as a word separator.

Good to know; I'll try to find a robust solution to this problem then.

would it help raise priority if we offered a bounty for this issue? It could be quite useful for us going forward...

Sure! I only work on RiveScript once in a while between all my other projects and limited free time. 😄

@mysticBliss
Copy link

mysticBliss commented Oct 5, 2017

I love your work @kirsle , any update on whether RegEx is supported now or any timelines of when this can be implemented?

@lijiarui
Copy link

lijiarui commented Oct 6, 2017

Thanks @kirsle for your work!

I just change from superscript to rivescript for the full doc and simple. Also, I wonder to know any progress about this issue.

In my situation, [*]你好[*] cannot match 哈你好.

@dcsan
Copy link

dcsan commented Oct 7, 2017

@lijiarui 欢迎!

FYI we've had some discussion here about chinese matching:
aichaos/rivescript-js#147

currently it doesn't seem to work well and we haven't found a good workaround, beyond using something like a separate upstream regex filter - but that doesn't scale very well to lots of content.

@lijiarui
Copy link

lijiarui commented Oct 7, 2017

Thanks! @dcsan. And follow your suggestion I tried rivescript these days and found it is really convenient to use but regular expression triggers is really a problem, and looking forward to this to fix.

@alecl
Copy link

alecl commented Dec 17, 2017

I wonder if RegExp is perhaps often overkill as a solution just for optional wildcards not matching in Unicode (below) and adds more risk of error on the end user:

aichaos/rivescript-js#147
aichaos/rivescript-js#253

The Java trigger engine seems to have a workaround perhaps due to its more forgiving regex engine.

For Java only while this syntax is not enough to match the Japanese word for dog (thought it works for English)

+ [*]犬[*]

This workaround seems to work

+ ([*]犬[*]|*犬*|*犬[*]|[*]犬*)

What if we leave ~ to be for RegEx but come up with another syntax just for this wildcard mode. The good symbols are taken but we can use & I suppose

& 犬

could be handled internally as

+ ([*]犬[*]|*犬*|*犬[*]|[*]犬*)

whereas

& (犬|ハムスター)

would end up

+ ([*]犬[*]|*犬*|*犬[*]|[*]犬*|[*]ハムスター[*]|*ハムスター*|*ハムスター[*]|[*]ハムスター*)

If the idea works in Java we can see if the constrained case just for wildcards can be figured out in the other language interpreters as well.

I was thinking of implementing this as a pre-ingestion filter on the Rivescript files themselves and see how it works out without having to touch the internals. Thoughts?

@dcsan
Copy link

dcsan commented Dec 18, 2017

@alecl
Are you using the Java engine then?
agree that using real regex may create many problems, our content authors get easily confused.

@kirsle
Copy link
Member Author

kirsle commented Dec 18, 2017

@alecl an 'alias' to help write those kinds of triggers sounds like a good idea.

What do you think of this sort of syntax?

+ ? 犬 ?
- response

The rule would be: if the trigger text begins and ends with a ? character, it translates the trigger text into that whole set of optional versions that work with Unicode. Parser code example:

var trigger = "? 犬 ?"  // parsed after the + command
var m = trigger.match(/^\?(.+?)\?$/);
if (m) {
    var word = m[1].trim();  // spaces around the ?'s optional
    trigger = "([*]${word}[*]|*${word}*|*${word}[*]|[*]${word}*)";
}

For bot authors, you'd just replace the [*] on either side of the trigger with ? symbols.

@alecl
Copy link

alecl commented Dec 18, 2017

@kirsle I'd have a concern about breaking the prohibition of ? in triggers and by taking away the prohibition to let the alias work may let people put in question marks mistakenly in triggers for the wrong reason.

Do you disagree with adding & for a wildcard trigger instead?

I started going through the code in Parser.java to test it out. Started replacing cmd.equals("+") with the following

private boolean isTrigger(String cmd) {
	return (cmd.equals("+") || cmd.equals("&"));
}

Then in the case statement for parse function handling the & before the + and rewriting the line with the alias expansion including array syntax support and letting the rest of the + code fall through and handle as usual.

case "&": { // & Keyword anywhere alias trigger
	logger.debug("\tKeyword anywhere alias trigger pre-processing: {}", line);

	// Warning: We do not support parentheses in the & triggers for aliases while plain triggers
	// will support lone open or closing parentheses without considering it a mismatched array
	// Java RiveScript doesn't currently support <input> and <reply> in triggers
	// This code will need updating as well once it does.

	// Find any alias values whether individual or in (a|b) array syntax.
	Pattern keywordAliaserTriggerPattern = Pattern.compile("([^{}()|]+)");
	StringBuffer sb = new StringBuffer();
	Matcher matcher = keywordAliaserTriggerPattern.matcher(line);
	while (matcher.find()) {
		String value = matcher.group(1).trim();
		// Skip any control commands like {weight=x} that start with { from being replaced.
		// They are added back by appendTail and are not lost.
		if (value != null && !value.isEmpty() && line.charAt(matcher.start()-1>0 ? matcher.start()-1 : 0) != '{') {
			// Note: we can't use $1 because we trim the value...
			matcher.appendReplacement(sb, "*"+value+"*|[*]"+value+"[*]|*"+value+"[*]|[*]"+value+"*");
		}
	}
	matcher.appendTail(sb);

	line = sb.toString();
	// Note: Intentional fall through to case statement for + trigger below after & trigger alias expansion above
}

@alecl
Copy link

alecl commented Dec 18, 2017

@kirsle I updated the above snippet with a working approach. If agreed I can see about submitting a PR to rivescript-java that also includes unit tests.

@kirsle
Copy link
Member Author

kirsle commented Feb 21, 2018

So I think the simple approach is indeed going to be to add an alias of sorts. I think the ? character might be a more appropriate command for this case than & (which I read as "and"), and the parsers could all be updated pretty easily to handle it.

We can call the command ?Keyword as an alternative to +Trigger, and the Working Draft and tutorials can make a mention about this when they start talking about [*], with mention that the [*] optional wildcard has known issues with Unicode in some programming languages. All the tickets I've seen about this bug have been people wanting keyword support, and this is an easier solution than trying to allow raw regexps while still sorting them intelligently among the normal triggers.

  1. If the command symbol is ?, then take the text and rewrite it to every variation of wildcard to make it work in all cases.
  2. And then, pretend the command symbol was actually + but with the rewritten trigger text, and parse it as normal.

For example, this logic could be done pretty early right after the command symbol has been separated from its text, after here: https://github.com/aichaos/rivescript-js/blob/2d1a81d9cebea682be8c1eab4423a6a2b882d1d3/src/parser.coffee#L161-L166

And so before much else happens, it would set cmd="+" and line=(the rewritten trigger) and the remaining logic would continue without needing modification. It would be as if the bot author had written the long form of the trigger in the first place, but without them actually needing to do that themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants