Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interval Comparison #5180

Closed
wants to merge 9 commits into from

Conversation

berkaysynnada
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

When intervals are compared, IntervalDayTimeArray and IntervalMonthDayNanoArray exhibit unexpected behavior. This occurs because they are treated as signed integers, but they require special handling since they consist of multiple contiguous time unit blocks.

#[test]
fn test_interval_cmp() {
    let a = IntervalDayTimeArray::from(vec![Some(IntervalDayTimeType::make_value(0, -5))]);
    let b = IntervalDayTimeArray::from(vec![Some(IntervalDayTimeType::make_value(0, 5))]);
    let res = gt(&a, &b).unwrap();
    assert_eq!(&res, &BooleanArray::from(vec![ Some(false)]));
}
...
...
...
assertion `left == right` failed
  left: BooleanArray
[
  true,
]
 right: BooleanArray
[
  false,
]

The first unexpected behavior is observed when the LSD parts of these intervals are negative numbers and the MSD's are 0; under these conditions, intervals with negative values are evaluated as if they are greater than those with positive values.

  #[test]
  fn test_interval_cmp() {
      let a = IntervalMonthDayNanoArray::from(
          vec![Some(IntervalMonthDayNanoType::make_value(0, -10, 0))],
      );
      let b = IntervalMonthDayNanoArray::from(
          vec![Some(IntervalMonthDayNanoType::make_value(0, 20,0))],
      );
      let res = gt(&a, &b).unwrap();
      assert_eq!(&res, &BooleanArray::from(vec![ Some(false)]));
  }

This test also fails in a similar manner.

The second problem arises when comparing relative intervals, such as 1 month and 30 days. In such cases, we can not arrive at an absolute result (e.g. days in a month can change). Therefore, to stay on the safe side, we adopt the following semantics: A comparison returns true if it holds certainly. Otherwise, it returns false. Going back to our example, we return false for both 1 month < 30 days and 1 month > 30 days. It is impossible to impose a total order on intervals, but we can impose a consistent partial order with this logic.

What changes are included in this PR?

IntervalDayTimeArray and IntervalMonthDayNanoArray are separately handled before passing to apply() in compare_op(). They will be represented with milliseconds and nanoseconds, respectively, as the maximum or minimum value that the interval may correspond to.

As we need an exact representation of IntervalMonthDayNano in nanoseconds to prevent overflow cases, a variant for a 128-bit signed integer in DataType is added. Necessary extensions are accordingly implemented.

Are there any user-facing changes?

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Dec 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

I'm not sure about this, intervals cannot be accurately converted to a number of milliseconds, etc... Whilst the previous ordering might have been surprising it is a total order that matches the equality relation, this PR is neither.

Perhaps we might take a step back as to what you're hoping to achieve with this?

@@ -50,6 +50,8 @@ pub enum DataType {
Int32,
/// A signed 64-bit integer.
Int64,
/// A signed 128-bit integer.
Int128,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can unilaterally add array types to the arrow specification

@ozankabak
Copy link

ozankabak commented Dec 7, 2023

Whilst the previous ordering might have been surprising it is a total order that matches the equality relation, this PR is neither.

Right, I agree that any order should match equality relation and there is a bug here. We will discuss with @berkaysynnada shortly.

Perhaps we might take a step back as to what you're hoping to achieve with this?

Sure, thanks for offering to help. Basically the issue we are facing is that the current total order results in wrong branches being taken during pruning operations in downstream use cases (like Datafusion). We are trying to arrive at a partial order that is compatible with equality, and only returns true when the comparison is definite. Such a consistent partial order can be safely used in branching logic. Do you have any ideas how we can achieve that?

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

We are trying to arrive at a partial order that is compatible with equality, and only returns true when the comparison is definite
Do you have any ideas how we can achieve that?

I'm not sure such a relation is possible or at least it is not trivially obvious to me. Parquet for example does not produce interval statistics at all (although arrow-rs has historically been writing them, fixed by #5147) - https://github.com/apache/parquet-format/blob/066f9817332da32bdc6dc6dea833b6ee9c269934/LogicalTypes.md?plain=1#L537-L539

A couple of thoughts:

  • The arrow-ord kernels must be kept consistent with the order used for sorting, which in turn must be a total order, I think this means any partial order would need to be a separate notion
  • Intervals can only be meaningfully evaluated with respect to a temporal value, it therefore seems surprising to me that we would be pruning on intervals at all, as opposed to using them in the context of range analysis for an arithmetic operation involving them

@ozankabak
Copy link

Intervals can only be meaningfully evaluated with respect to a temporal type, it therefore seems surprising to me that we would be pruning on intervals at all, as opposed to using them in the context of range analysis for an arithmetic operation involving them

Yes, this was surprising to us too. The issue is there is nothing that bars a user to have a column whose type is an interval and write an expression involving that. In such cases range analysis gets things wrong because of the order issue.

I'm not sure such a relation is possible or at least it is not trivially obvious to me.

Agreed, definitely not trivial. We will try to iterate on this to see if we can arrive at a relation like that -- and if we fail we will learn something and we can come up with a plan B. @berkaysynnada will post an update here today after we think about this a little more.

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

In such cases range analysis gets things wrong because of the order issue.

Perhaps it could just bail out or something... If it isn't an expected use-case, perhaps we can just fallback to not doing range analysis or something - I have to confess to not have a good grasp of when or how frequently such a situation would arise

@berkaysynnada
Copy link
Contributor Author

berkaysynnada commented Dec 7, 2023

I've revised the implementation and now the comparison semantic is as follows: The interval values on both sides of the comparison work as if they reference a common timestamp. If comparing these intervals with respect to this reference gives a definite answer, like 1 month and 1 month + 1 day, that answer is given as the result. However, if there is an indefinite case, like 1 month and 30 days, the result will be false for both greater than and less than comparisons.

This means that comparisons such as "1 month < 1 month + 1 day" are now returning true. The current status of this PR includes:

  1. Proper handling of negative cases.
  2. Correct evaluations of results such as "1 month < 1 month + 1 day".
  3. Inexact results like "1 month + 30 days < 2 months" or "1 month + 30 days > 2 months" consistently return false for all cases, including greater than and less than equal comparisons.
  4. Equality cases remain unaffected.

@ozankabak
Copy link

Thanks @berkaysynnada. @tustvold, what do you think about this semantics? Seems like it improves the behavior without breaking any use case, but let's try to poke holes and make sure

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

Whilst I appreciate you working on this, I feel I wasn't clear enough before:

  • We cannot change the ordering relation of these kernels to a partial order, as it must match that of sorts, which in turn must be a total order
  • We cannot add new array types that aren't part of the arrow standard

From my perspective this seems like it is trying to frame range analysis as an ordering relation, which seems somewhat contrived? Perhaps this logic could be implemented within the range analysis framework of DataFusion, as opposed to here?

@ozankabak
Copy link

We cannot change the ordering relation of these kernels to a partial order, as it must match that of sorts, which in turn must be a total order

AFAICT the way the proposed ordering relation works will not break any sorts. Since indeterminate cases return false all ways, it will act like a pseudo total order. This is what I'm trying to nail down by discussing with you -- if it breaks sorts, I agree it is a no-go.

We cannot add new array types that aren't part of the arrow standard
Perhaps this logic could be implemented within the range analysis framework of DataFusion, as opposed to here?

At this point, we are trying to get the code to compile so we can iterate to find the right semantics. The purpose here is not rush a merge or make a de-facto change to arrow spec. The code may very well move to somewhere else if this is not the right place at this time.

My personal take is that returning clearly wrong comparison results for interval datatypes should be fixed at this level. The library claims to support these datatypes after all, performs such comparisons without any error or warning, so returning arbitrary (relative to real-world interval semantics) comparison results silently is simply wrong IMHO.

However, it doesn't really matter at this point -- we can always revisit such decisions over time. Let's try to find a good solution together first, and we can place the code somewhere else to get things moving.

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

AFAICT the way the proposed ordering relation works will not break any sorts. Since indeterminate cases return false all ways, it will act like a pseudo total order. This is what I'm trying to nail down by discussing with you -- if it breaks sorts, I agree it is a no-go.

Currently the following is always true regardless of input

let sorted = sort(input, None).unwrap();
let len_one = sorted.len() - 1;
let comparison = lt_eq(&sorted.slice(0, len_one), &sorted.slice(1, len_one)).unwrap();
assert_eq!(comparison.true_count(), len_one)

Or to put it differently the arrow library uses a consistent total ordering relation across its kernels, including things like the row format. Whilst I agree the ordering is surprising, and may violate people's expectations, but this is general problem for intervals where even basic arithmetic axioms don't hold, e.g. a + b - a == b.

@ozankabak
Copy link

Currently the following is always true regardless of input

This will indeed be false with the proposed semantics. I understand that you want to preserve such invariants, even though I'd say it's arguable users would want such an invariant if the underlying datatype doesn't really admit it with its natural ordering semantics. The price to pay (unnatural comparisons) is quite heavy.

There is unfortunately no perfect solution here, the interval data type is too weird 🙁

Whilst I agree the ordering is surprising, and may violate people's expectations, but this is general problem for intervals where even basic arithmetic axioms don't hold, e.g. a + b - a == b.

Such basic arithmetic axioms do not hold for floating point values either but they still have sensible ordering semantics (comparisons coincide with their natural meaning for non-NaN cases, indeterminate comparisons involving NaNs return false all ways). So if you are pointing out yet another weirdness with the interval type, I agree with you -- but these points are not really logically related or support/normalize each other.

Let's fix this in Datafusion in the short term and we can revisit an upstream solution later on when we have more clarity. Maybe you or us will have even better ideas/semantics over time.

@ozankabak ozankabak closed this Dec 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

comparisons coincide with their natural meaning for non-NaN cases, indeterminate comparisons involving NaNs return false all ways

FWIW we use IEEE total ordering for floats, which don't behave as you describe - https://docs.rs/arrow-ord/latest/arrow_ord/cmp/fn.gt.html

@ozankabak
Copy link

ozankabak commented Dec 7, 2023

FWIW we use IEEE total ordering for floats, which don't behave as you describe - https://docs.rs/arrow-ord/latest/arrow_ord/cmp/fn.gt.html

It actually does behave the way I describe. The totalOrder predicate was added to the standard in 2008 and my description is according to pre-2008 behavior. Actually, unless you explicitly use the totalOrder predicate (e.g. simply use something like less-than), it still behaves the way I described (you can check this on your machine too).

Anyway it doesn't matter, we will do the fix downstream. Thanks for the discussions which facilitated an improved semantics for this

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

Actually, unless you explicitly use the totalOrder predicate (e.g. simply use something like less-than), it still behaves the way I described (you can check this on your machine too).

We do indeed, as documented in the link

@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

I know this PR is closed, but I would like to offer two comments:

  1. Since this is the arrow-rs crate, I don't think adding new datatypes like DataType::Int128 is appropriate as they do not appear in the Arrow standard. However, I think the semantics of interval comparison could be implemented without that particular change.
  2. The question of total order / partial order and what invariants are required is still somewhat confusing to me. Can someone give me an example of intervals where this PR's behavior would invalidate the invariant stated by @tustvold ?
let sorted = sort(input, None).unwrap();
let len_one = sorted.len() - 1;
let comparison = lt_eq(&sorted.slice(0, len_one), &sorted.slice(1, len_one)).unwrap();
assert_eq!(comparison.true_count(), len_one)

I would like to (and I will) add a test that documents this assumption / feature in code so that we don't end up in a situation like this again where a proposed changes breaks a non obvious invariant.

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

I would like to (and I will) add a test that documents this assumption / feature in code so that we don't end up in a situation like this again where a proposed changes breaks a non obvious invariant.

Thank you Andrew. I think the broader invariant is that we consistently order values across kernels, so things like row1 < row2 <=> value1 < value2, or if I sort an array in ascending order i1 < i2 <=> array[i1] <= array[i2]. I agree this isn't documented very clearly beyond a brief mention on https://docs.rs/arrow-array/latest/arrow_array/trait.ArrowNativeTypeOp.html, and should be highlighted more obviously, but I also think this is broadly speaking inline with what people would reasonably expect.

@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

Right -- I was looking for a counter example that that would be different with the two different semantic

Maybe something like the following

Sort order on master (even though "logically" one might expect one month - 1 day to be 27, 29, or 31 days, which are all larger than 15 days

IntervalMonthDayNano(1, -1, 0)    # one month - 1 day, 0 ns)
IntervalMonthDayNano(0, 15, 0)    # 0 months, 15 days, 0ns)

But if you changed comparison to the possibly more intuitive ordering:

IntervalMonthDayNano(1, -1, 0)    # one month - 1 day, 0 ns)
IntervalMonthDayNano(0, 15, 0)    # 0 months, 15 days, 0ns)

then the order doesn't match the order of the underlying i128 🤔

@alamb
Copy link
Contributor

alamb commented Dec 8, 2023

I filed #5192 with tests and documentation explaining the current behavior and my understanding of the rationale.

@alamb
Copy link
Contributor

alamb commented Dec 8, 2023

After further study of this matter, I think it could make sense to add new compairson kernels that were specific to intervals and implemented the partial order interval comparisons as proposed in this PR,

Something like

let arr1 = IntervalMonthDayNanoArray::from(...);
let arr2 = IntervalMonthDayNanoArray::from(...);

// compare arr1 and arr2 using interval specific logic / partial order
let res = interval_partial_lt()

@tustvold
Copy link
Contributor

tustvold commented Dec 8, 2023

Perhaps I might suggest filing a ticket to discuss what the desired semantics are, linking back to the one in DF, as the first step before jumping into code. In particular I am curious what systems like postgres do.

In general I would like to encourage this pattern of working, as it can help avoid frustration, and ensure all the necessary context is understood before work gets underway. Filing subtle and/or large PRs with no prior discussion is not setting things up for sucess.

@alamb
Copy link
Contributor

alamb commented Dec 8, 2023

Perhaps I might suggest filing a ticket to discuss what the desired semantics are

Yes, this is an excellent idea -- I will do so. I agree that for items where the semantics may not be clear, it often helps to start with a ticket rather than a PR

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

I have filed #5200 to discuss how to handle interval comparisons (if at all)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants