-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-4819 Merge Join #4822
GH-4819 Merge Join #4822
Conversation
core/common/ordering/src/main/java/org/eclipse/rdf4j/common/ordering/StatementOrder.java
Outdated
Show resolved
Hide resolved
core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDataset.java
Outdated
Show resolved
Hide resolved
core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDataset.java
Outdated
Show resolved
Hide resolved
core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/SailDataset.java
Outdated
Show resolved
Hide resolved
@Override | ||
public final CloseableIteration<? extends Statement> getStatements(StatementOrder order, Resource subj, IRI pred, | ||
Value obj, boolean includeInferred, Resource... contexts) throws SailException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at the SailConnection level. It might be better to have the order
argument be after includeInferred
and before Resource... contexts
. Though I do prefer to keep things consistent between the Sail level and the Dataset level, which doesn't have an inferred
argument. Keeping the order
as the first argument also makes sense since the specified order could be for context.
@abrokenjester @kenwenzel I'm working on support merge join. The first step is to add support for the sail to return ordered statements. I would really appreciate some feedback on how to make my changes as consistent as possible with the existing interfaces. Any suggestions on naming is also very welcome. |
Multi variable joins might be problematic:
|
@hmottestad I can offer to add support for ordered indexes to LmdbStore. Do you already have any performance figures? |
|
||
@Override | ||
public CloseableIteration<? extends Statement> getStatements(StatementOrder statementOrder, Resource subj, | ||
IRI pred, Value obj, Resource... contexts) throws SailException { | ||
throw new UnsupportedOperationException("Not implemented yet"); | ||
} | ||
|
||
@Override | ||
public Set<StatementOrder> getSupportedOrders(Resource subj, IRI pred, Value obj, Resource... contexts) { | ||
return Set.of(); | ||
} | ||
|
||
@Override | ||
public Comparator<Value> getComparator() { | ||
throw new UnsupportedOperationException("Not implemented yet"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenwenzel here are the three methods you need to implement to get merge join working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmottestad I've added an initial implementation in:
https://github.com/kenwenzel/rdf4j/tree/GH-4819-merge-join
Unfortunately, QueryBenchmark.complexQuery()
directly fails with an Exception. Maybe it helps to locate the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it, was an optimisation I had made where I hadn't quite accounted for all the edge cases.
This would be really awesome. I've tagged you above in a comment to show what needs to be implemented. That should hopefully be all that is needed. One thing I'm not certain about is transaction isolation, so it might be best to test it out with IsolationLevels.NONE. There is also some work left to be done on the DualUnionIteration, let me know if you run into the UnsupportedOperationException in that class. I haven't made any benchmarks, mostly because I've only implemented this in the ExtensibleStore with a backing data structure that sorts the data for each request. That is just so I can test things out locally. The end goal is for https://github.com/the-qa-company/qEndpoint to use it for analytical queries. They have a really innovative product that uses HDT for storing the data on disk and can work with massive datasets using very little memory. The queries that we want to support are ones that would anyway need to read most of the data from a range query on the index. For instance:
If you want to see if the query planner chooses merge join you can use the query explanation:
And you should see Here is an example for a completely different query:
|
This all gives me an idea for making the query explanation smarter by including what index is being used and a recommendation for the most optimal index so the user can configure the best indexes for their particular queries. |
@kenwenzel I think the performance is a bit bad because it doesn't pick the optimal index.
I changed up the code that picks the index just to see what would happen (TripleStore.java):
Combined with changing the indexes used in the QueryBenchmark:
And the performance is on par with the current performance. Not really any faster or slower. Looking at the query explanation I can see that there isn't much that is joined using merge join:
I'll try out some other queries to see how they look. |
@kenwenzel try this query:
On my machine this is 3x faster with merge join. PS: You can disable merge join by changing the set of orders that are returned:
|
@hmottestad Probably the merge join will show its strength when the IO performance degrades due to memory mapping and/or slow storage systems. This should be the case if the databases grow larger than RAM. |
@kenwenzel collaborating on merge join would be a lot simpler if you were a committer. Would you be interested in me nominating you? |
I can't make guarantees regarding contributions but I don't want to refuse the offer a second time (Jeen already asked a while ago). So yes and thank you for asking. |
@hmottestad I was asked by the team (@D063520) behind the qEndpoint to help with implementing merge joins :) and I see you are already working on it. |
They are sponsoring this feature :) |
@JervenBolleman: we reached out different people and I understood you are too busy, @hmottestad had some free time and he was super reactive |
I think it is wonderful you/the qa company are sponsoring this feature and @hmottestad is much better at this kind of work then I am :) win win all round. |
TODO:
No simple way to do |
import org.eclipse.rdf4j.rio.RDFHandler; | ||
import org.eclipse.rdf4j.rio.RDFHandlerException; | ||
|
||
public abstract class AbstractQueryPreparer implements QueryPreparer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was deleted by accident in 5.0.0-M2.
TODO
|
Develop branch
This branch
|
Doesn't seem to be any performance degradation. |
6fd295e
to
5be175e
Compare
…he write transaction pointer because it was allocated by the java.nio.ByteBuffer and not using unsafe or equivalent.
… complains if we reference an older release for some reason
5be175e
to
1e5c78c
Compare
GitHub issue resolved: #4819
Briefly describe the changes proposed in this PR:
PR Author Checklist (see the contributor guidelines for more details):
mvn process-resources
to format from the command line)