[FEA] Improve GpuJsonToStructs
performance
#11560
Labels
cudf_dependency
An issue or PR with this label depends on a new feature in cudf
epic
Issue that encompasses a significant feature or body of work
feature request
New feature or request
improve
performance
A performance related task/issue
The performance of our current
GpuJsonToStructs
is not good. When running the profiling, it looks like this:In the particular test case for the profiling above, the only useful work is only what to the end of the
read_json
range (just above 300ms), which is less than 50% of the entireGpuJsonToStructs
projection (>800ms). The rest are just overhead, but it consists mostly of hundreds of small kernel calls and stream syncs due to pure copying data from the intermediate result to the final output.We can do a lot better by reducing the unnecessary overhead, or improving them by a way that they can run in a much less time. If we divide the runtime of
GpuJsonToStructs
into sections:The improvement can be done by the following tasks:
cudf::read_json
. This needs help from cudf team. Beyond that, we can improve the performance of this section with some auxiliary work:concat_json
to join JSON strings given by strings column spark-rapids-jni#2457JSONUtils.concatenateJsonStrings
for concatenating JSON strings #11549cudf::read_json
into the output structs column with the desired read schema. Currently, this process may need to copy a lot of columns from the output table ofcudf::read_json
(hundreds columns), which is a significant overhead. We can see it from the profiling of this section. We can just move them instead. This can be achieved by:cudf::read_json
rapidsai/cudf#17002read_json
need to follow depth-first-search order as in the input schema rapidsai/cudf#17090read_json
should output all-nulls columns for the schema columns that do not exist in the input rapidsai/cudf#17091cudf::read_json
to other types spark-rapids-jni#2510.The text was updated successfully, but these errors were encountered: