Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Reimplement $ transpilation using cuDF new line terminator support #11554

Closed
NVnavkumar opened this issue Oct 1, 2024 · 1 comment · Fixed by #11663
Closed

[FEA] Reimplement $ transpilation using cuDF new line terminator support #11554

NVnavkumar opened this issue Oct 1, 2024 · 1 comment · Fixed by #11663
Assignees
Labels
task Work required that improves the product but is not user facing

Comments

@NVnavkumar
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

cuDF added support for multiple new-line characters in rapidsai/cudf#15961, which allows support for the different Java unicode line terminator characters. This requires passing a flag to the cuDF regex APIs to enable this mode, and updating the transpiler to a more simplified implementation of $ (which only needs to add support for the \r\n combination in addition to the individual characters already supported by cuDF:

  • \n line-feed (already supported)
  • \r carriage-return
  • \u0085 next line (NEL)
  • \u2028 line separator
  • \u2029 paragraph separator

Additional context

This might fix failing tests here:

Also, can look into a possible solution for:

@NVnavkumar NVnavkumar added ? - Needs Triage Need team to review and classify feature request New feature or request labels Oct 1, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 8, 2024
@SurajAralihalli
Copy link
Collaborator

SurajAralihalli commented Oct 21, 2024

We should start by adding EXT_LINE to the RegexFlag.java file to enable support for multiple newline characters to JAVA APIs, as introduced in rapidsai/cudf#15961. See RegexFlag.java, line 27, small adjustments will be needed in the cuDF repo.

Update: PR #17139

After this, we can begin migrating the Spark Regex APIs. This involves updating the transpiler function in the RegexParser to remove the workaround used for previous cuDF’s multiple line delimiter limitation, see RegexParser.scala#L850. Additionally, we must modify all Spark GPU Regex API expressions, such as GpuRegExpExtract, to incorporate cuDF’s fix. See stringFunctions.scala#L1419:GpuRegExpExtract.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Oct 23, 2024
This PR introduces the necessary changes to the cuDF jni to support the issue described in [NVIDIA/spark-rapids#11554](NVIDIA/spark-rapids#11554). For further information, refer to the details in the [comment](NVIDIA/spark-rapids#11554 (comment)).

Issue #15961 adds support for handling multiple line delimiters. This PR extends that functionality to JNI, which was previously missing, and also includes a test to validate the changes.

Authors:
  - Suraj Aralihalli (https://github.com/SurajAralihalli)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #17139
@sameerz sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request labels Oct 28, 2024
@SurajAralihalli SurajAralihalli self-assigned this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
4 participants