[BUG] Integration test `test_re_replace_all` fails with a corner case #9731

ttnghia · 2023-11-15T20:42:15Z

Due to random seed generation for testing, we uncovered a new corner case that test_re_replace_all fails:

SPARK_RAPIDS_TEST_DATAGEN_SEED=1700077791 ./integration_tests/run_pyspark_from_build.sh -k test_re_replace_all

...
--- CPU OUTPUT
+++ GPU OUTPUT
@@ -1055,7 +1055,7 @@
 Row(regexp_replace(a, .*$, PROD, 1)='zm\ngy\nPROD\nPROD')
 Row(regexp_replace(a, .*$, PROD, 1)='lv\nvo\nPROD\nPROD')
 Row(regexp_replace(a, .*$, PROD, 1)='tu\n\nPRODPROD')
-Row(regexp_replace(a, .*$, PROD, 1)='PRODPROD\x85PROD')
+Row(regexp_replace(a, .*$, PROD, 1)='PROD\x85PROD')

The text was updated successfully, but these errors were encountered:

andygrove · 2023-11-20T17:26:02Z

I modified the test to show the input that causes this error:

-Row(a='oNÍ[\x87\x01áe>\x85', regexp_replace(a, .*$, PROD, 1)='PRODPROD\x85PROD')
+Row(a='oNÍ[\x87\x01áe>\x85', regexp_replace(a, .*$, PROD, 1)='PROD\x85PROD')

andygrove · 2023-11-20T17:35:28Z

Simpler repro:

    .with_special_case('a\x85')

-Row(a='a\x85', regexp_replace(a, .*$, PROD, 1)='PRODPROD\x85PROD')
+Row(a='a\x85', regexp_replace(a, .*$, PROD, 1)='PROD\x85PROD')

Note that a\x84 works fine, so this appears to be specific to certain characters.

andygrove · 2023-12-19T23:02:45Z

Adding the triage label back now that there is a summary of the issue.

sameerz · 2023-12-26T21:18:42Z

Need to investigate why we got \x85 as an input string, since it is not a valid UTF-8 string.

revans2 · 2023-12-26T22:25:39Z

It looks like this might be related to how Spark/Python interprets the string 'a\x85'.

import pyspark.sql.types

df = spark.createDataFrame(SparkContext.getOrCreate().parallelize([("a\x85")]), pyspark.sql.types.StringType())
spark.conf.set("spark.rapids.sql.enabled", False)
df.selectExpr('CAST(value as BINARY)').show()
+----------+
|     value|
+----------+
|[61 C2 85]|
+----------+

[C2 85] is the UTF-8 encoded character that is the same as "\u0085", which is Next Line - NEL. The result, when it is pulled back into Spark converts the [C2 85] back to latin1 as an "\x85" in python.

So it looks like a kind of odd situation where .*$ is behaving differently for NEL compared to \n.

import pyspark.sql.types
df = spark.createDataFrame(SparkContext.getOrCreate().parallelize([("a\x85"), ("a\n")]), pyspark.sql.types.StringType())
spark.conf.set("spark.rapids.sql.enabled", True)
df.selectExpr("CAST(regexp_replace(value, '.*$', 'P', 1) AS BINARY) as p").show(truncate=False)
+-------------+
|p            |
+-------------+
|[50 C2 85 50]|
|[50 50 0A 50]|
+-------------+

spark.conf.set("spark.rapids.sql.enabled", False)
df.selectExpr("CAST(regexp_replace(value, '.*$', 'P', 1) AS BINARY) as p").show(truncate=False)
+----------------+
|p               |
+----------------+
|[50 50 C2 85 50]|
|[50 50 0A 50]   |
+----------------+

NVnavkumar · 2024-01-09T23:38:02Z

Note that \u0085 is recognized as a valid line terminator: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

A newline (line feed) character ('\n'),
A carriage-return character followed immediately by a newline character ("\r\n"),
A standalone carriage-return character ('\r'),
A next-line character ('\u0085'),
A line-separator character ('\u2028'), or
A paragraph-separator character ('\u2029).

So the replace in this case is not actually working properly on the GPU.

Also, Python 3 strings are by default UTF-8, so a\x85 and a\u0085 are equivalent as strings, Python just displays as 'a\x85', so it's not latin1.

andygrove · 2024-01-23T00:10:43Z

Note that we transpile the pattern .*$ to [^\n\r\u0085\u2028\u2029]*(?:\r|\u0085|\u2028|\u2029|\r\n)?$ before passing to cuDF, so maybe we need to update this to handle NEL. I am investigating.

andygrove · 2024-01-23T00:57:57Z

cuDF repro:

@Test
void testStringReplaceEdgeCase() {
    TableDebug debug = TableDebug.builder().build();
    RegexProgram target = new RegexProgram(
            "[^\n\r\u0085\uc285\u2028\u2029]*(?:\r|\u0085|\uc285|\u2028|\u2029|\r\n)?$");
    try (ColumnVector input = ColumnVector.fromStrings("a\n", "a\u0085");
         ColumnVector expected = ColumnVector.fromStrings("PRODPROD\nPROD", "PRODPROD\u0085PROD");
         Scalar replace = Scalar.fromString("PROD");
         ColumnVector output = input.replaceRegex(target, replace)) {
        debug.debug("input", input);
        debug.debug("output", output);
        assertColumnsAreEqual(expected, output);
    }
}

Output:

GPU COLUMN input - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f7e86000000, length=5, id=-1} VAL: null
COLUMN input - STRING
0 "a
" 610a
1 "a<85>" 61c285
GPU COLUMN output - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f7e86000a00, length=21, id=-1} VAL: null
COLUMN output - STRING
0 "PRODPROD
PROD" 50524f4450524f440a50524f44
1 "PRODPROD" 50524f4450524f44
HOST PADDING SIZE: 32

andygrove · 2024-01-23T01:01:55Z

@NVnavkumar based on the test I posted here, I am not sure if this is really a bug in cuDF or not. What do you think?

NVnavkumar · 2024-10-01T21:24:54Z

Re-evaluate after implementation of #11554

ttnghia added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 15, 2023

abellina mentioned this issue Nov 15, 2023

Follow up from random datagen seed PR #9703

Open

abellina removed the ? - Needs Triage Need team to review and classify label Nov 15, 2023

ttnghia mentioned this issue Nov 15, 2023

Temporarily force datagen_seed=0 for test_re_replace_all to unblock CI #9732

Merged

andygrove self-assigned this Nov 15, 2023

andygrove added the ? - Needs Triage Need team to review and classify label Dec 19, 2023

andygrove removed their assignment Dec 19, 2023

sameerz removed the ? - Needs Triage Need team to review and classify label Dec 26, 2023

mattahrens assigned andygrove Jan 5, 2024

andygrove removed their assignment Apr 1, 2024

andygrove added the ? - Needs Triage Need team to review and classify label Apr 1, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 2, 2024

NVnavkumar mentioned this issue Oct 1, 2024

[FEA] Reimplement $ transpilation using cuDF new line terminator support #11554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Integration test `test_re_replace_all` fails with a corner case #9731

[BUG] Integration test `test_re_replace_all` fails with a corner case #9731

ttnghia commented Nov 15, 2023

andygrove commented Nov 20, 2023

andygrove commented Nov 20, 2023 •

edited

Loading

andygrove commented Dec 19, 2023

sameerz commented Dec 26, 2023

revans2 commented Dec 26, 2023

NVnavkumar commented Jan 9, 2024

andygrove commented Jan 23, 2024

andygrove commented Jan 23, 2024

andygrove commented Jan 23, 2024

NVnavkumar commented Oct 1, 2024

[BUG] Integration test test_re_replace_all fails with a corner case #9731

[BUG] Integration test test_re_replace_all fails with a corner case #9731

Comments

ttnghia commented Nov 15, 2023

andygrove commented Nov 20, 2023

andygrove commented Nov 20, 2023 • edited Loading

andygrove commented Dec 19, 2023

sameerz commented Dec 26, 2023

revans2 commented Dec 26, 2023

NVnavkumar commented Jan 9, 2024

andygrove commented Jan 23, 2024

andygrove commented Jan 23, 2024

andygrove commented Jan 23, 2024

NVnavkumar commented Oct 1, 2024

[BUG] Integration test `test_re_replace_all` fails with a corner case #9731

[BUG] Integration test `test_re_replace_all` fails with a corner case #9731

andygrove commented Nov 20, 2023 •

edited

Loading