Core: Fix numeric overflow of timestamp nano literal #11775

ebyhr · 2024-12-13T06:09:32Z

The previous logic leads to overflow when we pass timestamp nanos long value.

amogh-jahagirdar · 2024-12-16T17:12:37Z

api/src/main/java/org/apache/iceberg/expressions/Literals.java

@@ -300,8 +300,7 @@ public <T> Literal<T> to(Type type) {
        case TIMESTAMP:
          return (Literal<T>) new TimestampLiteral(value());
        case TIMESTAMP_NANO:
-          // assume micros and convert to nanos to match the behavior in the timestamp case above
-          return new TimestampLiteral(value()).to(type);
+          return (Literal<T>) new TimestampNanoLiteral(value());


This change seems correct to me, the previous behavior for this was to assume the value was in microseconds and then pass that through to TimestampLiteral but that can overflow and does not actually represent a nanosecond timestamp!

FWIW I still think this is correct but it's worth getting other's perspective on this since we are changing one of the assumptions of how value is interpreted when the type to convert to is nanoseconds.

CC @nastra @epgif @jacobmarble @rdblue thoughts?

I think both before and after this change is correct. In Iceberg we had the assumption that everything is in microseconds. But this doesn't hold anymore now we have nano's. I do think the version after the change is more correct and more closely aligns with my expectations. If we can make sure that folks are not using this yet, I think this change is a good one 👍

I had a chat with @rdblue who reviewed the PR that introduced this, and it is actually on purpose. Spark always passes in microseconds, changing this would break this assumption with Spark. So I think we have to revert this line. That said, I do think we need to check (and raise an error) when it overflows. Easiest way of doing this is by converting it to nano's, and convert is back to micro's and check if it still the same value.

Thank you for sharing the context. What about other query engines? Actually, I found this issue when I was trying to support nanosecond precision in Trino Iceberg connector. As you may know, the max precision in Trino is picos (12).

@Fokko Could you confirm the above question? The current implementation is weird from other query engine's perspectives.

Sorry, the example I gave was not the best one. Previous to V3, Iceberg did not have nanoseconds, so therefore we have the assumption that every long we see, is in microseconds. Otherwise, the following query would change:

CREATE TABLE tbl(ts timestamp); -- We want to have all the events from the future SELECT * FROM tbl WHERE ts > 1739288553127 -- this is interpreted as microseconds -- We need more precision ALTER TABLE tbl MODIFY COLUMN ts timestamp_ns SELECT * FROM tbl WHERE ts > 1739288553127t -- this is interpreted as nanoseconds, changing the result

The trick here is to let Trino construct TimestampNanoLiteral and push that into the evaluator, instead of a plain LongLiteral.

Please forgive my delayed participation here. I'm not familiar with either Spark or Trino, but @epgif and I did author the nanoseconds PR.

Originally, we did not allow conversion to long because the unit of the value was not known. When I implemented Spark filter pushdown, I added the conversion to timestamp because Spark internally uses the same microsecond representation. That set a precedent that longs are converted to timestamp using microseconds.

^^ @rdblue's comment #9008 (comment) is where we switched away from new TimestampNanoLiteral(value()).

we have the assumption that every long we see, is in microseconds

This is the core of the problem, right?

Here is an earlier comment from that same PR #9008 (comment)

Here is the problematic place in Trino:
https://github.com/trinodb/trino/blob/be9ae2f2d61aeee03352c34326416ea7e7fe1354/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java#L112-L121

for (Range range : orderedRanges) { if (range.isSingleValue()) { icebergValues.add(convertTrinoValueToIceberg(type, range.getLowBoundedValue())); } else { rangeExpressions.add(toIcebergExpression(columnName, range)); } } Expression ranges = or(rangeExpressions); Expression values = icebergValues.isEmpty() ? alwaysFalse() : in(columnName, icebergValues);

The convertTrinoValueToIceberg method returns Long (not literal) for timestamp(9) type. The bottom values expression is created with Iceberg's Expressions.in method.

let Trino construct TimestampNanoLiteral and push that into the evaluator, instead of a plain LongLiteral.

TimestampNanoLiteral and other literals' code are package-private. Let me know if there is a method we can use from other repositories.

amogh-jahagirdar · 2024-12-16T17:17:33Z

api/src/test/java/org/apache/iceberg/types/TestConversions.java

    assertThat(Literal.of(400000L).to(TimestampNanoType.withoutZone()).toByteBuffer().array())
-        .isEqualTo(new byte[] {0, -124, -41, 23, 0, 0, 0, 0});
+        .isEqualTo(new byte[] {-128, 26, 6, 0, 0, 0, 0, 0});
    assertThat(Literal.of(400000L).to(TimestampNanoType.withZone()).toByteBuffer().array())


I'm a bit confused how the original assertion was passing? Shouldn't have this always been equivalent to {-128, 26, 6, 0, 0, 0, 0, 0}?

I think the cause is the original logic called DateTimeUtil.microsToNanos method which multiples the value by 1000:

iceberg/api/src/main/java/org/apache/iceberg/expressions/Literals.java

Lines 444 to 445 in b9b61b1

case TIMESTAMP_NANO:

return (Literal<T>) new TimestampNanoLiteral(DateTimeUtil.microsToNanos(value()));

Ah I see the comment on line 107/108. Could we update the assertConversion to instead test against 400000L and then remove the comment. At this point we are no longer having to pass in different values since we Literal.of(someLong).to(TimestampNanos) will always interpret someLong as nanoseconds.

api/src/test/java/org/apache/iceberg/expressions/TestTimestampLiteralConversions.java

github-actions bot added the API label Dec 13, 2024

Core: Fix numeric overflow of timestamp nano literal

efe1d14

ebyhr force-pushed the ebi/timestamp-ns branch from 5228ca6 to efe1d14 Compare December 13, 2024 06:15

amogh-jahagirdar reviewed Dec 16, 2024

View reviewed changes

fixup! Core: Fix numeric overflow of timestamp nano literal

2a9c8b7

amogh-jahagirdar requested a review from nastra December 17, 2024 00:40

fixup! Core: Fix numeric overflow of timestamp nano literal

6828e0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Fix numeric overflow of timestamp nano literal #11775

Core: Fix numeric overflow of timestamp nano literal #11775

ebyhr commented Dec 13, 2024 •

edited

Loading

amogh-jahagirdar Dec 16, 2024

amogh-jahagirdar Dec 17, 2024

Fokko Dec 17, 2024

Fokko Dec 19, 2024

ebyhr Dec 19, 2024

ebyhr Jan 15, 2025

Fokko Feb 11, 2025

jacobmarble Feb 11, 2025 •

edited

Loading

jacobmarble Feb 11, 2025

ebyhr Feb 12, 2025

amogh-jahagirdar Dec 16, 2024

ebyhr Dec 16, 2024

amogh-jahagirdar Dec 17, 2024

	case TIMESTAMP_NANO:
	return (Literal<T>) new TimestampNanoLiteral(DateTimeUtil.microsToNanos(value()));

Core: Fix numeric overflow of timestamp nano literal #11775

Are you sure you want to change the base?

Core: Fix numeric overflow of timestamp nano literal #11775

Conversation

ebyhr commented Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobmarble Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Dec 13, 2024 •

edited

Loading

jacobmarble Feb 11, 2025 •

edited

Loading