[FEA] Improve `GpuJsonToStructs` performance #11560

ttnghia · 2024-10-04T18:29:58Z

The performance of our current GpuJsonToStructs is not good. When running the profiling, it looks like this:

In the particular test case for the profiling above, the only useful work is only what to the end of the read_json range (just above 300ms), which is less than 50% of the entire GpuJsonToStructs projection (>800ms). The rest are just overhead, but it consists mostly of hundreds of small kernel calls and stream syncs due to pure copying data from the intermediate result to the final output.

We can do a lot better by reducing the unnecessary overhead, or improving them by a way that they can run in a much less time. If we divide the runtime of GpuJsonToStructs into sections:

The improvement can be done by the following tasks:

Section 1: Improve libcudf's cudf::read_json. This needs help from cudf team. Beyond that, we can improve the performance of this section with some auxiliary work:
- Implement concat_json to join JSON strings given by strings column spark-rapids-jni#2457
- Adopt JSONUtils.concatenateJsonStrings for concatenating JSON strings #11549
Section 2: Improve the process that assembles the output table from cudf::read_json into the output structs column with the desired read schema. Currently, this process may need to copy a lot of columns from the output table of cudf::read_json (hundreds columns), which is a significant overhead. We can see it from the profiling of this section. We can just move them instead. This can be achieved by:
- [FEA] Implement a better JNI function to assemble the output columns from cudf::read_json rapidsai/cudf#17002
- Dependencies:
  - [FEA] The output columns of read_json need to follow depth-first-search order as in the input schema rapidsai/cudf#17090
  - [FEA] read_json should output all-nulls columns for the schema columns that do not exist in the input rapidsai/cudf#17091
Section 3: Improve the conversion step from strings columns into the desired types. Some columns need to be converted but some are just output directly without any conversion. However, instead of being moved into the output, they are again copied and that causes a lot of overhead if the number of strings columns is significant.
- spark-rapids-jni issue: [FEA] Implement a function to convert a list of strings columns into a structs column with desired schema spark-rapids-jni#2468, PR Convert strings columns output from cudf::read_json to other types spark-rapids-jni#2510.

The text was updated successfully, but these errors were encountered:

karthikeyann · 2024-10-07T23:28:16Z

After discussion with @ttnghia, Here are the improvements planned for different sections:

Section 1: @karthikeyann and @shrshi are working on validation, and memory usage reduction here.
Performance optimization of JSON validation rapidsai/cudf#16996
JSON tokenizer memory optimizations rapidsai/cudf#16978
TBD: To eliminate/minimize concat_json, Considering new strings_column input as data source or new json reader option, needs more planning (@shrshi)
Section 2: To eliminate Section 2 completely, @karthikeyann will work on adding new schema interface to support column ordering and all-null columns for non-existent columns.
Only input schema requirement is that this input schema should not require sanitization inside libcudf reader. (that includes UTF-8 matching of column names, duplicate paths, invalid schema, etc).
[FEA] The output columns of read_json need to follow depth-first-search order as in the input schema rapidsai/cudf#17090
[FEA] read_json should output all-nulls columns for the schema columns that do not exist in the input rapidsai/cudf#17091
Section 3: @ttnghia will work to avoid copying columns after parsing. C++ invocation of libcudf reader, and parsing string columns to datatypes and move/replace columns.
TBD: INT, FLOAT, STRING parsing rules - check libcudf compliant with spark requirements. Some types (DECIMAL) may have special cases that can only be handled in spark.

In Spark, the `DecimalType` has a specific number of digits to represent the numbers. However, when creating a data Schema, only type and name of the column are stored, thus we lose that precision information. As such, it would be difficult to reconstruct the original decimal types from cudf's `Schema` instance. This PR adds a `precision` member variable to the `Schema` class in cudf Java, allowing it to store the precision number of the original decimal column. Partially contributes to NVIDIA/spark-rapids#11560. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #17176

ttnghia added cudf_dependency An issue or PR with this label depends on a new feature in cudf epic Issue that encompasses a significant feature or body of work feature request New feature or request improve performance A performance related task/issue labels Oct 4, 2024

ttnghia self-assigned this Oct 4, 2024

This was referenced Oct 17, 2024

Convert strings columns output from cudf::read_json to other types NVIDIA/spark-rapids-jni#2510

Draft

Perform conversion for the columns output from Table.readJSON to other data types using JSONUtils.convertDataTypes() #11618

Draft

karthikeyann mentioned this issue Oct 21, 2024

JSON spark reader plan for 24.12 rapidsai/cudf#17138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve `GpuJsonToStructs` performance #11560

[FEA] Improve `GpuJsonToStructs` performance #11560

ttnghia commented Oct 4, 2024 •

edited

Loading

karthikeyann commented Oct 7, 2024 •

edited by ttnghia

Loading

[FEA] Improve GpuJsonToStructs performance #11560

[FEA] Improve GpuJsonToStructs performance #11560

Comments

ttnghia commented Oct 4, 2024 • edited Loading

karthikeyann commented Oct 7, 2024 • edited by ttnghia Loading

[FEA] Improve `GpuJsonToStructs` performance #11560

[FEA] Improve `GpuJsonToStructs` performance #11560

ttnghia commented Oct 4, 2024 •

edited

Loading

karthikeyann commented Oct 7, 2024 •

edited by ttnghia

Loading