Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44501: [C#] Use an index lookup for O(1) field index access #44633

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vthemelis
Copy link

@vthemelis vthemelis commented Nov 4, 2024

Fixes #44501

Rationale for this change

This should speed up the reasonably common operation of looking up a column via name in Schema or RecordBatch.

I would argue that the fact that this operation was previously O(N) (now O(1)) is unintuitive and could easily lead to accidental performance woes.

Note that I would like to also replace the existing Lookups with signature string -> Field but unfortunately those use StringComparer.Default instead of StringComparer.CurrentCulture (which I'm using to make the changed public function backwards compatible). Not sure if this is intentional.

What changes are included in this PR?

This PR simply memoizes the index lookup of the fields. It should not break any existing behaviour.

It also fixes a recent regression about the behaviour of the changed function in the case of using a custom comparator and have no matches in the schema.

Are these changes tested?

All code paths changed are covered by new unit tests.

Are there any user-facing changes?

Yes, the lookup function throws an exception when there are no matches fuels rather than returning -1. This was a regression introduces O(days) ago.

Copy link

github-actions bot commented Nov 4, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@vthemelis vthemelis changed the title Use an index lookup for O(1) field index access GH-44501: [C#] Use an index lookup for O(1) field index access Nov 4, 2024
Copy link

github-actions bot commented Nov 4, 2024

⚠️ GitHub issue #44501 has been automatically assigned in GitHub to PR creator.

@vthemelis vthemelis force-pushed the constant-col-lookup branch 4 times, most recently from ce4dda4 to c7a0c58 Compare November 4, 2024 16:45
@vthemelis
Copy link
Author

@CurtHagenlocher it looks like my tests are failing in CI but I can't replicate this locally.

@vthemelis vthemelis force-pushed the constant-col-lookup branch from c7a0c58 to 21a86cd Compare November 5, 2024 10:18
@vthemelis
Copy link
Author

Oh, I see the issue. The CI does a merge into main first and this has changed since my PR: #44576

The PR above has changed the lookup behaviour when the field is not matched.

CC @georgevanburgh

@vthemelis vthemelis force-pushed the constant-col-lookup branch 2 times, most recently from e6d024d to 52b374a Compare November 5, 2024 10:26
@vthemelis
Copy link
Author

@CurtHagenlocher and @georgevanburgh I've reverted to the original behaviour (throwing an InvalidOperationException) for now. Let me know if you prefer keeping the return -1 regression.

@vthemelis vthemelis force-pushed the constant-col-lookup branch from 52b374a to d099cdf Compare November 5, 2024 10:41
@CurtHagenlocher
Copy link
Contributor

I'm a bit tied up with work but will look at this by Friday.

@vthemelis
Copy link
Author

I'm a bit tied up with work but will look at this by Friday.

Thank you very much!

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! (and sorry for the delay) I left a few suggestions.

}

return -1;
throw new InvalidOperationException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. Is there a specific reason for it?

Copy link
Author

@vthemelis vthemelis Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually fixing a regression that was added in #44576

The function would previously throw InvalidOperationException due to .First(...). The description of the PR is erroneous in asserting that -1 is returned in the case of no match as the exception would first be thrown by the argument of the .IndexOf(...) function.

I noticed this because I wrote my unit tests before that PR was merged. Are you happy with keeping the return -1 regression?


_fieldsIndexLookup = _fieldsList
.Select((x, idx) => (Name: x.Name, Index: idx))
.ToLookup(x => x.Name, x => x.Index, StringComparer.CurrentCulture);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for new code, we should use Ordinal. I don't think CurrentCulture really makes sense here.

Copy link
Author

@vthemelis vthemelis Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change to Ordinal then this would be a breaking change as this is used in GetFieldIndex(string). Are you okay with changing the behaviour?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I can then only use the lookup if the user has provided Ordinal as the comparator but this would defeat the purpose of this PR which is to make lookup of Columns in RecordBatch O(1).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I could change this in a separate PR that implements your suggestions here #44650 (comment) ?

@@ -31,6 +31,7 @@ public partial class Schema : IRecordType
private readonly List<Field> _fieldsList;

public ILookup<string, Field> FieldsLookup { get; }
private readonly ILookup<string, int> _fieldsIndexLookup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also expose this in a way that lets users get all the indexes for a particular field name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm happy to add this, though it feels like it's adding further to the scope of the PR. I'm happy to implement any API suggestions that you may have; what would be your preferred way?

@vthemelis
Copy link
Author

Thanks for the change! (and sorry for the delay) I left a few suggestions.

Thank you very much for your review @CurtHagenlocher ! I added some replies to your comments. Let me know your thoughts and will proceed with changes.

@vthemelis
Copy link
Author

Hi @CurtHagenlocher , do you have some time to look into this?

@vthemelis
Copy link
Author

Hi @CurtHagenlocher, I've undone the style change as you suggested. Let me know your views on the other questions and I'm happy to change this PR accordingly.

@vthemelis
Copy link
Author

If you'd prefer, I can arrange separate PRs for aspects suggested?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C#] Column(string) method in RecordBatch is linear to the number of columns
2 participants