-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Like function: Error while using double escape char in pattern string #8040
Like function: Error while using double escape char in pattern string #8040
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Hi @mbasmanova and @xumingming, please take a look. Thanks! |
velox/functions/lib/Re2Functions.cpp
Outdated
@@ -1170,7 +1176,16 @@ PatternMetadata determinePatternKind( | |||
} else { | |||
// Record the first fixed pattern start. | |||
if (fixedPatternStart == -1) { | |||
fixedPatternStart = iterator.currentIndex(); | |||
if (iterator.isEscaped()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Thank you for the fix.
currentIndex() method and this logic beaks the abstraction, the promise that iterator handles escape characters transparently.
Can we maybe replace currentIndex() with currentStart() and nextStart() methods that return the start position of the current logical character and start position of the next logical character. Or something similar that doesn't require the caller to handle escape character explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer thanks for the catch!
Um, just like @mbasmanova pointed out, the key here is currentIndex()
breaks the abstraction, current simple way of parsing the pattern is rigid so it depends on currentIndex(). I have an incoming PR which will support optimizations for more relaxed pattern like _a_b%
which will require more flexible parsing of the pattern. In that PR I remove the dependency on currentIndex()
, just rely on current()
to tell me what current char is, so after collecting all the chars from current()
, it is the unescaped
string we are looking for.
So my suggestion is:
- I submit a PR to disable the optimization for escaped scenario for the moment.
- In my PR to support relaxed pattern, fix this issue thoroughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
I looked at the code but found that we don't appear to require for nextStart()
, currentStart()
could be enough. Let's only add one for now to keep the code simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming Hi Mingming, sorry for missing your message. So would you mean to fix the issue in your incoming PR instead? If yes, it should be OK to me but the regression blocks our daily testing which is a little bit hurry for us to have a fix landed. Is it possible to use this PR then rebase your patch onto it, if it works for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer Sure, go ahead.
|
||
int32_t currentIndex_{-1}; | ||
size_t currentStart_{0}; | ||
size_t nextStart_{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextStart_
is just a internal state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
velox/functions/lib/Re2Functions.cpp
Outdated
|
||
// Index of the first % or _ character(not escaped). | ||
int32_t wildcardStart = -1; | ||
size_t wildcardStart = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wildcardStart is an index, if it is not found, it could be negative(-1), while size_t
is unsigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll change it back to int32_t
. Just missed that the values are -1
s.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in 840aefd. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Expression:
Error:
The bug is a regression caused by #7730