-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Integration test test_re_replace_all
fails with a corner case
#9731
Comments
I modified the test to show the input that causes this error:
|
Simpler repro: .with_special_case('a\x85')
Note that |
Adding the triage label back now that there is a summary of the issue. |
Need to investigate why we got \x85 as an input string, since it is not a valid UTF-8 string. |
It looks like this might be related to how Spark/Python interprets the string 'a\x85'.
[C2 85] is the UTF-8 encoded character that is the same as "\u0085", which is Next Line - NEL. The result, when it is pulled back into Spark converts the So it looks like a kind of odd situation where
|
Note that
So the replace in this case is not actually working properly on the GPU. Also, Python 3 strings are by default UTF-8, so |
Note that we transpile the pattern |
cuDF repro: @Test
void testStringReplaceEdgeCase() {
TableDebug debug = TableDebug.builder().build();
RegexProgram target = new RegexProgram(
"[^\n\r\u0085\uc285\u2028\u2029]*(?:\r|\u0085|\uc285|\u2028|\u2029|\r\n)?$");
try (ColumnVector input = ColumnVector.fromStrings("a\n", "a\u0085");
ColumnVector expected = ColumnVector.fromStrings("PRODPROD\nPROD", "PRODPROD\u0085PROD");
Scalar replace = Scalar.fromString("PROD");
ColumnVector output = input.replaceRegex(target, replace)) {
debug.debug("input", input);
debug.debug("output", output);
assertColumnsAreEqual(expected, output);
}
} Output:
|
@NVnavkumar based on the test I posted here, I am not sure if this is really a bug in cuDF or not. What do you think? |
Re-evaluate after implementation of #11554 |
Due to random seed generation for testing, we uncovered a new corner case that
test_re_replace_all
fails:The text was updated successfully, but these errors were encountered: