Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark needs micro precision in the 'in' function #11755

Open
rui-mo opened this issue Dec 5, 2024 · 4 comments · May be fixed by #11812
Open

Spark needs micro precision in the 'in' function #11755

rui-mo opened this issue Dec 5, 2024 · 4 comments · May be fixed by #11812
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@rui-mo
Copy link
Collaborator

rui-mo commented Dec 5, 2024

Bug description

The InPredicate is registered by Spark sql and might need to provide configurable precision to adapt to both Presto and Spark.

values.emplace_back(simpleValues->valueAt(i).toMillis());

In Velox, the Spark 'in' implementation has nanoseconds precision. In vanilla Spark, the timestamp precision is microseconds.

struct InFunctionOuter {

System information

Velox System Info v0.0.2
Commit: edfb582
CMake Version: 3.28.3
System: Linux-5.4.0-200-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.1.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.1.0
CMake Prefix Path: /usr/local;/usr;/;/usr/local/lib/python3.8/dist-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

@rui-mo rui-mo added bug Something isn't working triage Newly created issue that needs attention. labels Dec 5, 2024
@zuyu
Copy link
Contributor

zuyu commented Dec 6, 2024

Probably toPrecision from #11719 would help.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Dec 10, 2024

Is the issue here that the precision of a Timestamp in Spark is in microseconds, and converting it to milliseconds for comparison would result in a loss of precision? For example, Timestamp(1, 999000) and Timestamp(1, 998000) would be considered the same?

@liujiayi771 I took further look and found Spark had its specific implementation. Fixing it in #11812 to adapt to Spark's micro precision.

@rui-mo rui-mo changed the title Spark needs micro precision in the InPredicate Spark needs micro precision in the 'in' function Dec 10, 2024
@liujiayi771
Copy link
Contributor

@rui-mo Do we still need to adapt Spark's micro in Iceberg equality deletes reader?

@rui-mo
Copy link
Collaborator Author

rui-mo commented Dec 10, 2024

@rui-mo Do we still need to adapt Spark's micro in Iceberg equality deletes reader?

@liujiayi771 I think so, because if milli precision is used, the micros are lost and timestamps with the same millis but different micros will be treated the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants