[BUG] test_regexp_choice failed #10641

jlowe · 2024-03-27T21:28:56Z

Nightly integration test run had a failure in test_regexp_choice.

[2024-03-27T19:43:44.838Z] FAILED ../../src/main/python/regexp_test.py::test_regexp_choice[DATAGEN_SEED=1711561563, TZ=UTC, INJECT_OOM] - AssertionError: GPU and CPU string values are different at [609, 'regexp_extract(a, (abc1a$|ab2ab$), 1)']

Details

[2024-03-27T19:43:44.835Z] =================================== FAILURES ===================================

[2024-03-27T19:43:44.835Z] ______________________________ test_regexp_choice ______________________________

[2024-03-27T19:43:44.835Z] [gw0] linux -- Python 3.9.19 /opt/conda/bin/python

[2024-03-27T19:43:44.835Z] 

[2024-03-27T19:43:44.835Z]     def test_regexp_choice():

[2024-03-27T19:43:44.835Z]         gen = mk_str_gen('[abcd]{1,3}[0-9]{1,3}[abcd]{1,3}[ \n\t\r]{0,2}')

[2024-03-27T19:43:44.835Z] >       assert_gpu_and_cpu_are_equal_collect(

[2024-03-27T19:43:44.835Z]                 lambda spark: unary_op_df(spark, gen).selectExpr(

[2024-03-27T19:43:44.835Z]                     'rlike(a, "[abcd]|[123]")',

[2024-03-27T19:43:44.835Z]                     'rlike(a, "[^\n\r]|abcd")',

[2024-03-27T19:43:44.835Z]                     'rlike(a, "abd1a$|^ab2a")',

[2024-03-27T19:43:44.835Z]                     'rlike(a, "[a-c]*|[\n]")',

[2024-03-27T19:43:44.835Z]                     'rlike(a, "[a-c]+|[\n]")',

[2024-03-27T19:43:44.835Z]                     'regexp_extract(a, "(abc1a$|^ab2ab|a3abc)", 1)',

[2024-03-27T19:43:44.835Z]                     'regexp_extract(a, "(abc1a$|ab2ab$)", 1)',

[2024-03-27T19:43:44.835Z]                     'regexp_extract(a, "(ab+|^ab)", 1)',

[2024-03-27T19:43:44.835Z]                     'regexp_extract(a, "(ab*|^ab)", 1)',

[2024-03-27T19:43:44.835Z]                     'regexp_replace(a, "[abcd]$|^abc", "@")',

[2024-03-27T19:43:44.835Z]                     'regexp_replace(a, "[ab]$|[cd]$", "@")',

[2024-03-27T19:43:44.835Z]                     'regexp_replace(a, "[ab]+|^cd1", "@")'

[2024-03-27T19:43:44.835Z]                 ),

[2024-03-27T19:43:44.835Z]             conf=_regexp_conf)

[2024-03-27T19:43:44.835Z] 

[2024-03-27T19:43:44.835Z] ../../src/main/python/regexp_test.py:568: 

[2024-03-27T19:43:44.835Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-03-27T19:43:44.835Z] ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect

[2024-03-27T19:43:44.835Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)

[2024-03-27T19:43:44.835Z] ../../src/main/python/asserts.py:517: in _assert_gpu_and_cpu_are_equal

[2024-03-27T19:43:44.835Z]     assert_equal(from_cpu, from_gpu)

[2024-03-27T19:43:44.835Z] ../../src/main/python/asserts.py:107: in assert_equal

[2024-03-27T19:43:44.835Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])

[2024-03-27T19:43:44.835Z] ../../src/main/python/asserts.py:43: in _assert_equal

[2024-03-27T19:43:44.835Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-03-27T19:43:44.835Z] ../../src/main/python/asserts.py:36: in _assert_equal

[2024-03-27T19:43:44.835Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-03-27T19:43:44.835Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-03-27T19:43:44.835Z] 

[2024-03-27T19:43:44.835Z] cpu = 'ab2ab', gpu = 'ab2ab\r'

[2024-03-27T19:43:44.835Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f86b47c1040>

[2024-03-27T19:43:44.835Z] path = [609, 'regexp_extract(a, (abc1a$|ab2ab$), 1)']

[2024-03-27T19:43:44.835Z] 

[2024-03-27T19:43:44.835Z]     def _assert_equal(cpu, gpu, float_check, path):

[2024-03-27T19:43:44.835Z]         t = type(cpu)

[2024-03-27T19:43:44.835Z]         if (t is Row):

[2024-03-27T19:43:44.835Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-03-27T19:43:44.835Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):

[2024-03-27T19:43:44.835Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)

[2024-03-27T19:43:44.836Z]                 for field in cpu.__fields__:

[2024-03-27T19:43:44.836Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-03-27T19:43:44.836Z]             else:

[2024-03-27T19:43:44.836Z]                 for index in range(len(cpu)):

[2024-03-27T19:43:44.836Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-03-27T19:43:44.836Z]         elif (t is list):

[2024-03-27T19:43:44.836Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-03-27T19:43:44.836Z]             for index in range(len(cpu)):

[2024-03-27T19:43:44.836Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-03-27T19:43:44.836Z]         elif (t is tuple):

[2024-03-27T19:43:44.836Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-03-27T19:43:44.836Z]             for index in range(len(cpu)):

[2024-03-27T19:43:44.836Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-03-27T19:43:44.836Z]         elif (t is pytypes.GeneratorType):

[2024-03-27T19:43:44.836Z]             index = 0

[2024-03-27T19:43:44.836Z]             # generator has no zip :( so we have to do this the hard way

[2024-03-27T19:43:44.836Z]             done = False

[2024-03-27T19:43:44.836Z]             while not done:

[2024-03-27T19:43:44.836Z]                 sub_cpu = None

[2024-03-27T19:43:44.836Z]                 sub_gpu = None

[2024-03-27T19:43:44.836Z]                 try:

[2024-03-27T19:43:44.836Z]                     sub_cpu = next(cpu)

[2024-03-27T19:43:44.836Z]                 except StopIteration:

[2024-03-27T19:43:44.836Z]                     done = True

[2024-03-27T19:43:44.836Z]     

[2024-03-27T19:43:44.836Z]                 try:

[2024-03-27T19:43:44.836Z]                     sub_gpu = next(gpu)

[2024-03-27T19:43:44.836Z]                 except StopIteration:

[2024-03-27T19:43:44.836Z]                     done = True

[2024-03-27T19:43:44.836Z]     

[2024-03-27T19:43:44.836Z]                 if done:

[2024-03-27T19:43:44.836Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)

[2024-03-27T19:43:44.836Z]                 else:

[2024-03-27T19:43:44.836Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

[2024-03-27T19:43:44.836Z]     

[2024-03-27T19:43:44.836Z]                 index = index + 1

[2024-03-27T19:43:44.836Z]         elif (t is dict):

[2024-03-27T19:43:44.836Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark

[2024-03-27T19:43:44.836Z]             # so sort the items to do our best with ignoring the order of dicts

[2024-03-27T19:43:44.836Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)

[2024-03-27T19:43:44.836Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)

[2024-03-27T19:43:44.836Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])

[2024-03-27T19:43:44.836Z]         elif (t is int):

[2024-03-27T19:43:44.836Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)

[2024-03-27T19:43:44.836Z]         elif (t is float):

[2024-03-27T19:43:44.836Z]             if (math.isnan(cpu)):

[2024-03-27T19:43:44.836Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)

[2024-03-27T19:43:44.836Z]             else:

[2024-03-27T19:43:44.836Z]                 assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)

[2024-03-27T19:43:44.836Z]         elif isinstance(cpu, str):

[2024-03-27T19:43:44.836Z] >           assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)

[2024-03-27T19:43:44.836Z] E           AssertionError: GPU and CPU string values are different at [609, 'regexp_extract(a, (abc1a$|ab2ab$), 1)']

[2024-03-27T19:43:44.836Z] 

[2024-03-27T19:43:44.836Z] ../../src/main/python/asserts.py:85: AssertionError

The text was updated successfully, but these errors were encountered:

sameerz · 2024-04-24T17:12:13Z

Not sure if we should set a fixed seed for this issue, seems like it needs to be investigated and resolved.

revans2 · 2024-04-25T18:20:14Z

We should resolve all of the failures. Setting a fixed seed is a way for the build to stop failing while we do it.

NVnavkumar · 2024-05-14T22:45:12Z

The particular failure here:

CPU: ... regexp_extract(a, (abc1a$|ab2ab$), 1)='ab2ab'....
GPU: ...regexp_extract(a, (abc1a$|ab2ab$), 1)='ab2ab\r'...

The input string is aab2ab\r\n

NVnavkumar · 2024-06-12T15:04:52Z

A potential fix for this will be impacted by rapidsai/cudf#15961, so this issue will be blocked until that is merged into cuDF

NVnavkumar · 2024-10-01T21:25:32Z

Re-evaluate after implementation of #11554

SurajAralihalli · 2024-10-25T22:06:51Z

To verify results with regex_flags::EXT_NEWLINE (#11554), I applied cuDF's cudf::strings::extract on the input aab2ab\r\n.

Original regex pattern: ab2ab$

Since cuDF doesn’t recognize \r\n as a valid line break, we transpire $ to (?:\r\n)?$ to work around this limitation.

Regex Pattern	Extracted Value	Index
`((ab2ab)(?:\r\n)?$)`	`aab2ab\r\n`	0
`(ab2ab(?:\r\n)?$)`	`aab2ab\r\n`	0

Expected Spark/Java output: aab2ab

It appears cuDF does not handle non-capturing groups (?:\r\n) as expected when they are nested inside another capturing group.

NVnavkumar · 2024-10-28T17:03:11Z

To verify results with regex_flags::EXT_NEWLINE (#11554), I applied cuDF's cudf::strings::extract on the input aab2ab\r\n.

Original regex pattern: ab2ab$

Since cuDF doesn’t recognize \r\n as a valid line break, we transpire $ to (?:\r\n)?$ to work around this limitation.

Regex Pattern Extracted Value Index
((ab2ab)(?:\r\n)?$) aab2ab\r\n 0
(ab2ab(?:\r\n)?$) aab2ab\r\n 0
Expected Spark/Java output: aab2ab

It appears cuDF does not handle non-capturing groups (?:\r\n) as expected when they are nested inside another capturing group.

A couple of things:

cudf extract inherits the API definition from Pandas so it doesn't actually include the whole string as possible extraction. Also Java complicates this by ultimately truncating out the new line characters in the final output for index 0.
Should we even be including the $ in the full capture group that we are adding to make it work with cuDF. I think we should try this with it filtered out like ab2ab$ -> (ab2ab)(?:\r\n)?$ instead.
I think the nested non-capturing group should behave as expected here. Can we confirm what the behavior is in Python/Pandas is? If this behaves as expected for Java, then we should file a bug with cuDF.

SurajAralihalli · 2024-10-28T18:50:35Z

Thanks @NVnavkumar, you're right (ab2ab)(?:\r\n)?$ would be the right regex replacement here since we are looking at index 0. I see cuDf's behavior is as expected.

#11554 doesn't directly help will this issue. To fix this issue we need to transpile differently, one option is ab2ab$ -> ((ab2ab)(?:\r\n)?$) instead of ab2ab$ -> (ab2ab(?:\r\n)?$). This should be addressed in a different PR.

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 27, 2024

mattahrens assigned NVnavkumar Apr 1, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 2, 2024

thirtiseven mentioned this issue Apr 24, 2024

Use fixed seed for some random failed tests #10739

Merged

NVnavkumar added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jun 12, 2024

NVnavkumar mentioned this issue Oct 1, 2024

[FEA] Reimplement $ transpilation using cuDF new line terminator support #11554

Closed

SurajAralihalli mentioned this issue Oct 28, 2024

Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex #11663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_regexp_choice failed #10641

[BUG] test_regexp_choice failed #10641

jlowe commented Mar 27, 2024

sameerz commented Apr 24, 2024

revans2 commented Apr 25, 2024

NVnavkumar commented May 14, 2024

NVnavkumar commented Jun 12, 2024

NVnavkumar commented Oct 1, 2024

SurajAralihalli commented Oct 25, 2024 •

edited

Loading

NVnavkumar commented Oct 28, 2024

SurajAralihalli commented Oct 28, 2024 •

edited

Loading

[BUG] test_regexp_choice failed #10641

[BUG] test_regexp_choice failed #10641

Comments

jlowe commented Mar 27, 2024

sameerz commented Apr 24, 2024

revans2 commented Apr 25, 2024

NVnavkumar commented May 14, 2024

NVnavkumar commented Jun 12, 2024

NVnavkumar commented Oct 1, 2024

SurajAralihalli commented Oct 25, 2024 • edited Loading

NVnavkumar commented Oct 28, 2024

SurajAralihalli commented Oct 28, 2024 • edited Loading

SurajAralihalli commented Oct 25, 2024 •

edited

Loading

SurajAralihalli commented Oct 28, 2024 •

edited

Loading