-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support raw regular expression triggers #6
Comments
note that superscript uses ~ for wordnet expansion, which seems like a very nice syntax. |
The other weekend I was feeling motivated to begin implementing this for the JavaScript version, but discovered it will take a bit more planning. The approach I was going for was that: during parsing, I added a block of code to handle the Then I got to the reply sorting algorithm. Initially this was complicated because everything used the My ideas now for how to sort the regexps in the same bucket as the triggers might be along the lines of:
The other hard part will be merging the regexp list with the trigger list, since the sorting algorithm creates a whole bunch of different 'buckets' for types of triggers and concatenates them at the end. I think maybe I'll store the regexp list in its own array, and do a final pass-thru of the 'final' sorted trigger array, and inject the regexps in the appropriate places sorted by their length. If anyone has better ideas, feel free to suggest them. My idea isn't perfect and is bound to have all sorts of issues, but, since the |
actually we would use it quite a lot for ALL our bots and triggers that are in Chinese. It's impossible to use the normal simple regex syntax for languages that don't use the space as a word separator. |
would it help raise priority if we offered a bounty for this issue? It could be quite useful for us going forward... |
That's cool for them. It won't interfere with the I'm not interested in integrating WordNet into RiveScript though because it would make the implementations diverge, as there isn't a WordNet equivalent for all the other programming languages I use. I would be open to implementing a pluggable NLP system, though: RiveScript could look for
Good to know; I'll try to find a robust solution to this problem then.
Sure! I only work on RiveScript once in a while between all my other projects and limited free time. 😄 |
I love your work @kirsle , any update on whether RegEx is supported now or any timelines of when this can be implemented? |
Thanks @kirsle for your work! I just change from superscript to rivescript for the full doc and simple. Also, I wonder to know any progress about this issue. In my situation, |
@lijiarui 欢迎! FYI we've had some discussion here about chinese matching: currently it doesn't seem to work well and we haven't found a good workaround, beyond using something like a separate upstream regex filter - but that doesn't scale very well to lots of content. |
Thanks! @dcsan. And follow your suggestion I tried rivescript these days and found it is really convenient to use but regular expression triggers is really a problem, and looking forward to this to fix. |
I wonder if RegExp is perhaps often overkill as a solution just for optional wildcards not matching in Unicode (below) and adds more risk of error on the end user: aichaos/rivescript-js#147 The Java trigger engine seems to have a workaround perhaps due to its more forgiving regex engine. For Java only while this syntax is not enough to match the Japanese word for dog (thought it works for English)
This workaround seems to work
What if we leave ~ to be for RegEx but come up with another syntax just for this wildcard mode. The good symbols are taken but we can use & I suppose
could be handled internally as
whereas
would end up
If the idea works in Java we can see if the constrained case just for wildcards can be figured out in the other language interpreters as well. I was thinking of implementing this as a pre-ingestion filter on the Rivescript files themselves and see how it works out without having to touch the internals. Thoughts? |
@alecl |
@alecl an 'alias' to help write those kinds of triggers sounds like a good idea. What do you think of this sort of syntax?
The rule would be: if the trigger text begins and ends with a var trigger = "? 犬 ?" // parsed after the + command
var m = trigger.match(/^\?(.+?)\?$/);
if (m) {
var word = m[1].trim(); // spaces around the ?'s optional
trigger = "([*]${word}[*]|*${word}*|*${word}[*]|[*]${word}*)";
} For bot authors, you'd just replace the |
@kirsle I'd have a concern about breaking the prohibition of ? in triggers and by taking away the prohibition to let the alias work may let people put in question marks mistakenly in triggers for the wrong reason. Do you disagree with adding & for a wildcard trigger instead? I started going through the code in Parser.java to test it out. Started replacing cmd.equals("+") with the following
Then in the case statement for parse function handling the & before the + and rewriting the line with the alias expansion including array syntax support and letting the rest of the + code fall through and handle as usual.
|
@kirsle I updated the above snippet with a working approach. If agreed I can see about submitting a PR to rivescript-java that also includes unit tests. |
So I think the simple approach is indeed going to be to add an alias of sorts. I think the We can call the command
For example, this logic could be done pretty early right after the command symbol has been separated from its text, after here: https://github.com/aichaos/rivescript-js/blob/2d1a81d9cebea682be8c1eab4423a6a2b882d1d3/src/parser.coffee#L161-L166 And so before much else happens, it would set |
Per aichaos/rivescript-js#147 and aichaos/rivescript-python#78, it may be time for RiveScript to re-gain the
~Regexp
command from its ancestorChatbot::Alpha
.I'm increasingly becoming aware that Unicode is hard and regular expression engines are not all created equally. Each programming language has their own little quirks wrt. how meta expressions like the
\b
word-boundary sequence behaves when matching Unicode symbols.The RiveScript spec should be amended (and the primary implementations updated) to support a
~
command for writing a raw regular expression. This will enable users to help themselves when they run into regexp matching bugs that+Triggers
can't handle, and can't be modified to handle (either because it would break backward compatibility or because the+Trigger
already reserves too many regexp special characters for its own use case).The use of the
~Regexp
should be generally discouraged in all documentation and it should be stated that its purpose is only to help with advanced use cases where the+Trigger
system is inadequate. You could compare it to the way that database ORM's still allow you to write a raw SQL query by hand, but they strongly encourage you to use the ORM's object model as intended.Implementation Notes
The
~Regexp
should be treated the same as the+Trigger
in RiveScript source files (when either command is seen, it becomes the new "root" of the reply data and any following*Condition
,-Reply
,@Redirect
and so on would apply to the most recently seen+Trigger
or~Regexp
). In the case that a~Regexp
was used, the functions liketriggerRegexp()
do not get called and the raw regexp is used as-is. This means of course that you can't use tags like<bot name>
inside a~Regexp
.Captured groups from the regexp that would go into
$1..$n
will get captured the same for<star1>..<starN>
Syntax Examples
Here are a couple examples how some common triggers would be represented by raw regular expressions:
+ my name is *
~ my name is (.+?)
+ i am # years old
~ i am (\d+?) years old
+ @hello
+ i am <bot name>
The text was updated successfully, but these errors were encountered: