Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Support map_concat spark function #5093

Closed
wants to merge 2 commits into from

Conversation

Surbhi-Vijay
Copy link
Contributor

@Surbhi-Vijay Surbhi-Vijay commented Mar 24, 2024

Adds support for map_concat spark function as part of #4039

Added supporting test case and updated documentation

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

rui-mo
rui-mo previously approved these changes Mar 25, 2024
Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could be merged after CI passes.

@Surbhi-Vijay Surbhi-Vijay force-pushed the MapConcatSparkFunction branch from 4653a1e to 6dd97b8 Compare March 27, 2024 13:30
Copy link

Run Gluten Clickhouse CI

@Surbhi-Vijay
Copy link
Contributor Author

df.collect() works as expected for map_concat function. (converts to velox plan)

df.show() is still executing in Spark due to the fallback. The reason of fallback is, Spark adds limit of 20 for df.show and tries to cast all the columns to string data type.
(Code snippet from Dataset.scala)

private[sql] def getRows(
      numRows: Int,
      truncate: Int): Seq[Seq[String]] = {
    val newDf = toDF()
    val castCols = newDf.logicalPlan.output.map { col =>
      // Since binary types in top-level schema fields have a specific format to print,
      // so we do not cast them to strings here.
      if (col.dataType == BinaryType) {
        Column(col)
      } else {
        Column(col).cast(StringType)
      }
    }

The CAST function is then transformed into Gluten to CASTTransformer but fails in validation since the cast is not supported for map type in Gluten yet.
(Code snippet from SubstraitToVeloxPlanValidator.cc)

switch (input->type()->kind()) {
    case TypeKind::ARRAY:
    case TypeKind::MAP:
    case TypeKind::ROW:
    case TypeKind::VARBINARY:
      LOG_VALIDATION_MSG("Invalid input type in casting: ARRAY/MAP/ROW/VARBINARY.");
      return false;
    case TypeKind::TIMESTAMP: {
      LOG_VALIDATION_MSG("Casting from TIMESTAMP is not supported or has incorrect result.");
      return false;
    }
    default: {
    }
  }

@Surbhi-Vijay
Copy link
Contributor Author

@rui-mo can you please check this patch again. I have added few comments and rewritten the spark test case since there is no exception thrown for duplicate values. Velox already handles it by keeping last value.

@rui-mo
Copy link
Contributor

rui-mo commented Mar 28, 2024

Hi @Surbhi-Vijay, I notice MapKeyDedupPolicy.EXCEPTION seems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?

@Surbhi-Vijay
Copy link
Contributor Author

Hi @Surbhi-Vijay, I notice MapKeyDedupPolicy.EXCEPTION seems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?

Yes, MapKeyDedupPolicy.EXCEPTION is the default behavior in spark. Exception will have to thrown from Velox if the behavior should be same as Spark. The velox implementation is based on Presto where LAST_WIN is the default behavior.

Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten.

Please let me know your thoughts.

@Surbhi-Vijay
Copy link
Contributor Author

Hi @Surbhi-Vijay, I notice MapKeyDedupPolicy.EXCEPTION seems to be the default behavior of Spark, is that right? In your opinion, do we need to customize the map_concact function in Velox to allow configurable behavior?

Yes, MapKeyDedupPolicy.EXCEPTION is the default behavior in spark. Exception will have to thrown from Velox if the behavior should be same as Spark. The velox implementation is based on Presto where LAST_WIN is the default behavior.

Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten.

Please let me know your thoughts.

@rui-mo @PHILO-HE can you please check this comment? and we can finalize the next set of action for this patch.

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Apr 7, 2024

Maybe we can introduce change in Velox to throw exception based on a config which will only be used in Gluten.
Please let me know your thoughts.

@Surbhi-Vijay, yes, better to introduce a config in Velox to support those two behaviors. I guess Spark's default EXCEPTION behavior is commonly used, so please firstly fix it in Velox. cc @rui-mo

Copy link

github-actions bot commented Jul 4, 2024

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale stale label Jul 4, 2024
@Surbhi-Vijay
Copy link
Contributor Author

I will be updating these PRs in few days.

@github-actions github-actions bot removed the stale stale label Jul 26, 2024
Copy link

github-actions bot commented Sep 9, 2024

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale stale label Sep 9, 2024
Copy link

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

@github-actions github-actions bot closed this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants