Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gh 5121 fedx bind left join #5122

Merged
merged 5 commits into from
Oct 23, 2024
Merged

Conversation

aschwarte10
Copy link
Contributor

@aschwarte10 aschwarte10 commented Sep 6, 2024

Note: The PR is still WIP (some further changes, and especially unit test coverage will follow) however it is ready for early review. The goal is to have this available in RDF4J 5.1

GitHub issue resolved: #5121

Briefly describe the changes proposed in this PR:

The PR consists of multiple commits which add an implementation for bind left joins. It is recommended to follow the change commit by commit (as the first ones are preparational refactoring)

Commit 1: GH-5121: refactor the bind join logic into a reusable base class

Refactor the existing logic for executing bind joins into a reusable
base class.

This change mostly moves the implementation logic from the existing
ControlledWorkerBindJoin class to a new intermediate implementation
(with the goal to make it reusable in a second step for left joins).

Note that the bind join implementation no longer uses the
ControlledWorkerJoin as base class, i.e. the decision of which join
implementation to use is moved to the strategy.

For backwards code compatibility the "ControlledWorkerBoundJoin" is
kept, but no longer used. Instead the new code is in
ControlledWorkerBindJoin.

Commit 2: GH-5121: prepare execution of left joins in the federation strategy

Prepare to execute a specific implementation of a left join
implementation through the federation strategy.

Commit 3: GH-5121: implementation of left bind join operator

This change provides the implementation and activation for the left bind
join operator.

The algorithm is as follows:

  • execute left bind join using regular bound join query
  • process result iteration similar to BoundJoinVALUESConversionIteration
  • remember seen set of bindings (using index) and add original bindings
    to those, i.e. put to result return all non-seen bindings directly from
    the input

Note that the terminology in literature has changed to "bind joins".
Hence, for new classes and methods I try to follow that.

Change is covered with a unit tests

Commit 4: GH-5121: configurability of bind left joins

Bind left joins for OPTIONAL can be disabled using the
"enableOptionalAsBindJoin" flag in the federation config

Integrate the switch between implementations in the unit test as
parameterized test

Commit 5: GH-5121: code simplifications in bind join implementation

  • use for-each loop for iterating bindingset
  • use IntHashSet
  • use Literal#intValue instead of Integer#parseInt

TODOs

  • further unit test coverage
  • validate check left bind join pattern
  • configurability to turn bind left joins off
  • ...

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@aschwarte10 aschwarte10 added the 📦 fedx fedx: optimized federated query support label Sep 6, 2024
Comment on lines 26 to 38
/**
* A {@link LookAheadIteration} for processing bind left join results.
*
* Algorithm:
*
* <ul>
* <li>execute left bind join using regular bound join query</li>
* <li>process result iteration similar to {@link BoundJoinVALUESConversionIteration}</li>
* <li>remember seen set of bindings (using index) and add original bindings to those, i.e. put to result return all
* non-seen bindings directly from the input</li>
*
*
* @author Andreas Schwarte
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that this is used for OPTIONAL clauses. Is it used for anything else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the docs. This is only for OPTIONALs

protected BindingSet convert(BindingSet bIn, int bIndex) throws QueryEvaluationException {
QueryBindingSet res = new QueryBindingSet();
Iterator<Binding> bIter = bIn.iterator();
while (bIter.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a for-each look?

for (Binding bs : bIn){
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, can be simplified like that. I think when initially writing the bind join logic it was not possible

Comment on lines 60 to 61
int bIndex = Integer.parseInt(
bIn.getBinding(BoundJoinVALUESConversionIteration.INDEX_BINDING_NAME).getValue().stringValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the binding Value object ever a literal? So it would be possible to check if it's instanceof IntegerLiteral or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is always an integer literal. is passed as that in the VALUES clause. Have optimized it

@aschwarte10
Copy link
Contributor Author

@hmottestad thanks for the feedback. I will get back to this PR soon (we are currently in the release finalization phase of our product and this requires most of my attention).

Quick question: is there a timeline for the 5.1 release? I would like to see this change included (if I get it ready) such that we can make use of it in our next release. Ideally the 5.1 release would be available in the second half of november. What are the current project plans?

@hmottestad
Copy link
Contributor

We don't have any plans at the moment.

I'm currently working on the last SHACL paths that aren't supported. I also have a PR for configuring the Apache HttpClient timeouts. And there is a request for adding support for configuring the document loader in the new JSON-LD 1.1 parser.

Would really want the timeouts included, and if I start on the JSON-LD document loader then it shouldn't take more than a few weeks to get it merged.

So we can say mid november for RDF4J 5.1.

@aschwarte10 aschwarte10 force-pushed the GH-5121-fedx-bind-left-join branch 2 times, most recently from c058da3 to 40b03a5 Compare September 29, 2024 09:52
Refactor the existing logic for executing bind joins into a reusable
base class.

This change mostly moves the implementation logic from the existing
ControlledWorkerBindJoin class to a new intermediate implementation
(with the goal to make it reusable in a second step for left joins).

Note that the new bind join implementation no longer uses the
ControlledWorkerJoin as base class, i.e. the decision of which join
implementation to use is moved to the strategy.

For backwards code compatibility the "ControlledWorkerBoundJoin" is
kept, but no longer used. Instead the new code is in
ControlledWorkerBindJoin.
Prepare to execute a specific implementation of a left join
implementation through the federation strategy.
This change provides the implementation and activation for the left bind
join operator.

The algorithm is as follows:

- execute left bind join using regular bound join query
- process result iteration similar to BoundJoinVALUESConversionIteration
- remember seen set of bindings (using index) and add original bindings
to those, i.e. put to result return all non-seen bindings directly from
the input

Note that the terminology in literature has changed to "bind joins".
Hence, for new classes and methods I try to follow that.

Change is covered with some unit tests
Bind left joins for OPTIONAL can be disabled using the
"enableOptionalAsBindJoin" flag in the federation config

Integrate the switch between implementations in the unit test as
parameterized test
- use for-each loop for iterating bindingset
- use IntHashSet
- use Literal#intValue instead of Integer#parseInt
@aschwarte10 aschwarte10 force-pushed the GH-5121-fedx-bind-left-join branch from 40b03a5 to 01bc075 Compare September 29, 2024 11:26
@aschwarte10
Copy link
Contributor Author

The PR should not be fully ready for review. I have addressed my remaining TODOS and the feedback from @hmottestad (Thanks for that btw).

Can I ask for reviews on this?

Note: when this is merged, I will create some additional improvements as part of orthogonal improvement issues (still to be created)

-So we can say mid november for RDF4J 5.1.

That sounds really great. Let's aim for that. Might be helpful for us in our product to test the 5.1 branch earlier in our product (to be able to react when there are still issues encountered during testing / using). Note: we cannot really use RDF4J snapshot versions. Would it be poissible to create a milestone build some time end of October for acceptance testing?

@aschwarte10
Copy link
Contributor Author

I would like to merge this PR in the upcoming days, to continue with #5151 and #5152

Anybody interested in doing further reviews before merging (e.g. @hmottestad )? I will also ask a colleague of mine to help with a review, will likely be there early next week

@hmottestad
Copy link
Contributor

LGTM. Feel free to merge!

@aschwarte10 aschwarte10 merged commit eefa338 into develop Oct 23, 2024
9 checks passed
@aschwarte10 aschwarte10 deleted the GH-5121-fedx-bind-left-join branch October 23, 2024 06:46
@aschwarte10
Copy link
Contributor Author

aschwarte10 commented Oct 23, 2024

thanks @hmottestad

Have merged it. Unfortunately the merge commit message does not reference the PR correctly (and I messed up the commit message, has a lower case issue reference). Is this something that requires to be resolved?

The individual commits below that merge commit are fine though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📦 fedx fedx: optimized federated query support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants