Using Flow instead of iterators for query evaluation? #4320

JervenBolleman · 2022-12-15T08:03:40Z

JervenBolleman
Dec 15, 2022
Collaborator

I have been looking at the Flow class as an alternative to the internal use of iterators during query evaluation. I think we could use this to replace the current iterator approach. This would not only make query evaluation multi-threaded but also gives us "vectored" volcano for free.

The QueryContext will need to be extended to carry along the SubmissionPublisher and each QueryEvaluationStep needs to subscribe to the priors in the execution plan and publish as well.

This would be a major change inside the query evaluation and unlikely to be backwards compatible. Halyard and GraphDB might be majorly affected.

abrokenjester · 2022-12-18T03:19:47Z

abrokenjester
Dec 18, 2022
Maintainer

Interesting idea. Not fully across what this would look like, but sounds worth writing a draft proposal for. Of course as you rightly say the impact on products like GraphDB will likely be major. On the other hand if we can do this in a way where we can at least offer a good migration path, that should not be a massive concern. In the long run, getting rid of the wrapped iterator stack will have significant performance benefits.

0 replies

hmottestad · 2022-12-19T10:46:23Z

hmottestad
Dec 19, 2022
Maintainer

Looks like the Flow is part of the Reactive Systems field. Does Flow solve some of the same problems as RxJava?

I'm curious what will happen to the likes of RxJava once Loom is introduced. My understanding is that issues that lead developers to use RxJava can be solved in new and better ways once Loom becomes available. I can't say that I actually understand how or why, but it was the impression I got from the various talks I've heard about Loom. From the same talks I also get the impression that developers don't really like complexity that comes with RxJava, but those speakers would naturally be biased and I don't have any first hand knowledge myself.

0 replies

barthanssens · 2022-12-19T11:08:32Z

barthanssens
Dec 19, 2022
Collaborator

Not much experience with this, but IIRC the main difference is that RxJava (and Flow) also deal with backpressure, while Loom doesn't (or not as gracefully) ?

0 replies

frensjan · 2022-12-19T13:07:59Z

frensjan
Dec 19, 2022

TL;DR I think Loom is really really cool, but I think its benefits for a query execution engine are limited.

To my understanding Loom will dramatically lower cost for multi-threading in terms of memory and thread-switching overhead. Thread pooling also supports this, but with a poorer developer experience. CompletableFuture in Java in part addresses this; RxJava and the likes improve on this by provide a dataflow paradigm. Some also provide tools for back-pressure and scheduling different parts of data-flows on different thread pools (e.g. to separate IO bound from CPU bound work).

I think that Loom will have a tremendous impact on workloads that do a lot of (network) IO. E.g., web-servers and services in a micro-services environment that for a large part 'orchestrate' with external services / clients with the convenience of imperative programming.

For query execution I don't know if Loom brings that much benefits. For CPU bound parts it doesn't make sense to have (much) more threads than cores. For IO bound parts it could be desirable to pipeline requests to storage in order to take advantage of reordering / batching; especially with spinning disks. Maybe this could be cheaper with Loom than with a thread-pool? I'm not so sure though how well Loom / Java supports async disk I/O. At least using async I/O (on linux) in Java is not available out of the (OpenJDK) box.

0 replies

frensjan · 2022-12-19T13:09:55Z

frensjan
Dec 19, 2022

The dataflow paradigm of Flow, RxJava, reactivestreams, etc. does fit query execution quite well. Although I wonder whether it will provide a substantial benefit over plain old iterators. Async disk I/O could be one of these areas.

In order to improve efficiency in RDF4J’s query execution vectorisation, batching, pipelining and re-ordering are definitely interesting. A challenge with Flow and related could be that the dataflow is ‘record’ oriented. You request n elements and then get up to n elements but one-at-a-time via onNext. For vectorisation etc. I think many systems are taking a ‘record batch’ approach. Perhaps that could be developed based on Flow or the like.

@Jerven, could you outline in some more details how you would see Flow helping in vectorisation and multi-threading?

(As for some context on why I am at all trying to mingle in this discussion: I’m working on a append-mostly, read-many RDF-star storage with dictionary encoding and a reverse index from values per 'position' to statements based on RDF4J).

3 replies

JervenBolleman Dec 19, 2022
Collaborator Author

Very nice on the aside :+1 I am trying something similar but "only" readonly.

frensjan Dec 19, 2022

Would be nice to share notes 😄 I'll reach out.

JervenBolleman Dec 19, 2022
Collaborator Author

https://github.com/sib-swiss/rdf4j-readonly-sail

JervenBolleman · 2022-12-19T13:20:27Z

JervenBolleman
Dec 19, 2022
Collaborator Author

I think generally reactive services and async code are horrible. I regret picking vert.x for one of my work services.

However, the wins of Flow (with Loom or without it) in our situation is that it gives us thread safety if we go into query parallelization.
With flow each step that would now be a CloseableIterator is an unit of code that can run in a different thread, while with just loom we would still need to deal with all the thread-safety issues our selves.
Secondly flow seems to buffer intermediate results which allows the befits of "vectorized" execution (as used in db impls not SIMD).

Still the main issue is taking the result of a Flow and turning this into a lazy closeable iterator as we have now.

So let's take a simple query as an example.

SELECT ?x (count(?y) as ?ys) WHERE {
  ?x rdf:value ?y .
  FILTER(?y > 1)
} GROUP BY ?x

Here if we currently have
{GroupIterator { FilterIterator { { StatementPatternIterator }}}
each with a next/hasNext pair.
Which is hard to parallize because we would need to introduce buffers/locks and queues.
With Flow.

StatementPatternIterator -> publish
                            subscribe -> FilterIterator -> publish
                                                           subscribe -> GroupIterator -> publish

Each iterator may (but does not have to) run in their own thread. The publish/subscribe is buffered and each iteratator when scheduled can work on a few 100 items before the next iterator is scheduled. This gives DB vectored results.
We don't have to worry about synchronized because the publish/subscribe implementations do this.

In the end we move from a pull architecture to a push one. Which is interesting for halyard as that seems to do the same.

Notes on loom:

Long term
- loom will likely give io via io_uring so async
Short term
- we can run more queries in parallel and extract more io out of our backend.
- synchronized etc. are discouraged

3 replies

frensjan Dec 19, 2022

loom isn't required to use io_uring right? There are lib's for Java out there already that provide access to it.

we can run more queries in parallel and extract more io out of our backend.

Are you talking about more queries in parallel? Or reducing query latency by using multiple threads? The former should be a matter of embarrassingly parallel processing at the (SPARQL) endpoint level.

JervenBolleman Dec 19, 2022
Collaborator Author

Loom, currently does some scheduling tricks for hiding the IO latency. That is being worked on to be replaced by io_uring.

Currently a single query is bound to a single thread. While if we had Flow we could have a thread per operator. So latency would be lower, while throughput for single queries would be higher.

frensjan Dec 19, 2022

If operators can work in parallel this could indeed help query latency. I do think that care should be taken here. A thread per operator is not without issues. Taking Cassandra as an example (totally different database, I know, but still) it may cause some pain points in itself.

For throughput (as in number of queries / second), I don't think doesn't help all that much. For throughput the trick typically is to ensure that the bottleneck is always saturated (e.g. make sure the storage subsystem is always base if the workload is mainly IO bound). But I'm not familiar with your workload of course.

I don't really know what you mean with the throughput of a single query. In the end, for a single query, it's the service/response time that matters right?

JervenBolleman · 2022-12-20T15:28:41Z

JervenBolleman
Dec 20, 2022
Collaborator Author

Well there is an attempt at trying to be iterator compatible.

This deadlocks though and is not correct in implementation :(

2 replies

frensjan Dec 22, 2022

Cool!

What's your general impression? Feasible, worthwhile but just not there yet? Very hard to get right?

JervenBolleman Dec 22, 2022
Collaborator Author

Feasible, worthwhile but a much bigger change than I hopedfor.

frensjan · 2023-10-30T12:47:43Z

frensjan
Oct 30, 2023

Anyone revisited this idea recently?

I've embarked on a mission with my team to build an RDF4J layer on top of our existing HBase tables and Elasticsearch indices. We're able to greatly reduce query latency by employing a pipelined version of JoinIteration using reactor.

Essentially the nested loop becomes:

/**
 * Joins left bindings with right in a nested-loop like fashion. The left {@link Flux} is flat-mapped against the
 * right query evaluation step. The sequential version of flat-map is used to preserve order. Subscribe of the
 * nested flux is done on a different thread for the pipelined (eager) evaluation.
 */
public Flux<BindingSet> evaluateReactive(BindingSet bindings) {
    return evaluate(leftPrepared, bindings)
            .flatMapSequential(
                    leftBindings -> evaluate(rightPrepared, leftBindings)
                            .subscribeOn(SCHEDULER)
            );
}

Combined with a buffering subscription to adapt the 'Flux' to 'ClosableIteration' to work within the RDF4J architecture.

We've seen latency of some queries drop from just shy of 3 minutes to 10-15 seconds.

A small step was taken to support stacking such operators (e.g. a join on a pipelined join). And also to provide this from 'leaf' data accessors (StatementPattern and derived operators):

public interface ReactiveQueryEvaluationStep extends QueryEvaluationStep {

    Flux<BindingSet> evaluateReactive(BindingSet bindings);

    int getBlockingBufferSize();

    @Override
    default CloseableIteration<BindingSet, QueryEvaluationException> evaluate(BindingSet bindings) {
        Flux<BindingSet> flux = evaluateReactive(bindings);
        return new CloseableFluxIteration<>(flux, getBlockingBufferSize());
    }
}

Ideally though, such adaptation would only be necessary at the 'edge of the system'. E.g.:

TupleQueryResult result = connection.prepareTupleQuery("select * where { ... }").evaluate();
Flux<BindingSet> flux = result.flux();

Such a QueryResult<T> could be implemented on top of a Flux<T> with some simple adapters as default implementation:

public interface QueryResult<T> extends AutoCloseable, Iterable<T> {

    Flux<T> flux();

    @Override
    default Iterator<T> iterator() {
        return flux().toIterable().iterator();
    }

    default Stream<T> stream() {
        return flux().toStream();
    }

}

I'll park the whole _Closeable_Iteration concern here 🥸

2 replies

JervenBolleman Oct 30, 2023
Collaborator Author

I was going to try an implementation of QueryEvaluationSteps where each Step runs in it's own VirtualThread (Loom). Basically implementing a EvaluationStrategy where each existing QueryEvaluationStep is wrapped. So I had not looked into this idea again.

frensjan Oct 30, 2023

Yeah, loom is definitely interesting. However, for big query plans I fear that all this hand-over between threads is costly.

The main advantages for RDF4J in my view would be to (possibly) spread CPU intensive work over multiple cores and to use pipelining and multi-threading (more than there are cores if need be) to hide the latency of IO bottlenecked parts of a query. Loom could help in a cheaper / more scalable implementation of the IO bounded multi-threading. I think we'll run into issues at the store level before we run out of memory due to a thread pool that is to large.

kenwenzel · 2023-10-30T17:26:56Z

kenwenzel
Oct 30, 2023
Collaborator

This is also very similar to what Halyard is doing for query evaluation.

3 replies

frensjan Oct 30, 2023

Yeah, I know. I don't know how it evolved since Adam developed it when he was at Merck. But at least back then it was a pretty low-level implementation of the idea. Halyard as well as the system we're developing need to mask the access latency some-how.

It could be interesting to transform the main RDF4J query evaluation to a reactive one. I can imagine the native store and the like don't benefit all that much. Although you need to something with all these SSD IOPS! 😁 And I guess remote SPARQL service calls would benefit.

I don't know how this would affect e.g. the GraphDB implementation. I use their free version to toy with, but I have no idea to what extent their implementation is tied to the query evaluation in RDF4J.

kenwenzel Oct 30, 2023
Collaborator

A strong point for event-based evaluation would also be the possible support of streaming queries. We are working a lot with time-series data and stream processing via SPARQL could be really interesting.

frensjan Oct 30, 2023

That could be on the horizon because of such an evaluation strategy. I haven't thought too long about streaming queries. I guess this would introduce things like window operators and temporal joins etc.? Or do you mean something more basic with only one unbounded 'source' evaluation step?

But a move to reactor / rxjava or something similar would in the first place move from a purely pull iterator style evaluation to a push-pull infrastructure with quite a few options for concurrent and/or parallel processing. See e.g. https://spring.io/blog/2019/12/13/flight-of-the-flux-3-hopping-threads-and-schedulers.

frensjan · 2023-10-30T22:42:16Z

frensjan
Oct 30, 2023

@hmottestad I don't know exactly what the state is of the elasticsearch-store, but the latency issue also plays a big role in that implementation. I can imagine ElasticsearchDataStructure#getStatements being implemented on something like reactor to mask ES latency. Do you have a take on this matter?

0 replies

frensjan · 2023-11-01T07:59:45Z

frensjan
Nov 1, 2023

I've dabbled a bit in an implementation based on reactor.io: frensjan@4d9fad5

The changes show that it's possible to use reactive flows in RDF4J. And I think that it's possible to do so with a migration phase wherein both iterators and reactive streams can be mixed. It's awfully rough around the edges though, cuts corners, ignores the fact that iterators in RDF4J need to be closed and what not.

Next to the question whether this is desirable from an architectural point of view, there is the matter of performance. I've used the o.e.r.s.nativerdf.benchmark.QueryBenchmark#complexQuery to do some initial comparison and it shows an over head of ~15%:

flux (reactive + iterators)             avgt    5  17,502 ± 0,580  ms/op
develop (iterators only)                avgt    5  14,839 ± 0,139  ms/op

Some profiling shows that the reactive streams protocol is simply a lot more involved than the more straight-forward vulcano-style of iterators. I guess that in many use cases such extra costs are negligible in comparison to the actual work. But in the 'complex' query benchmark, the tight nested loop evaluation of the joins (implemented with the flat-map sequential operator) generates substantial overhead. I haven't found an easy way to improve this.

From an architecture view this of course does open some avenues. E.g., concurrency for operators that access remote systems through multi-threaded pipelined joins. But also some areas which are CPU bound could be sped up by parallelisation.

0 replies

JervenBolleman · 2023-11-03T15:40:23Z

JervenBolleman
Nov 3, 2023
Collaborator Author

Edit : it was to good to be true and the advantage was due to broken code.

I did an experiment with loom but the results look to good to be true. I will investigate next week. This on the memory store so mostly a big advantage in compute, or a major advantage in broken code ;)

12 replies

hmottestad Nov 9, 2023
Maintainer

Virtual Threads don't play nice with synchronization. When a virtual thread is blocked it is supposed to unmount its OS thread so that another virtual thread can take over. If the virtual thread is inside a synchronized block when it is blocked, then the JVM is not able to release its OS thread, instead pinning it to the thread until the synchronized block is complete.

There are two scenarios in which a virtual thread cannot be unmounted during blocking operations because it is pinned to its carrier:

When it executes code inside a synchronized block or method, or

When it executes a native method or a foreign function.

Pinning does not make an application incorrect, but it might hinder its scalability. If a virtual thread performs a blocking operation such as I/O or BlockingQueue.take() while it is pinned, then its carrier and the underlying OS thread are blocked for the duration of the operation. Frequent pinning for long durations can harm the scalability of an application by capturing carriers.

The scheduler does not compensate for pinning by expanding its parallelism. Instead, avoid frequent and long-lived pinning by revising synchronized blocks or methods that run frequently and guard potentially long I/O operations to use java.util.concurrent.locks.ReentrantLock instead. There is no need to replace synchronized blocks and methods that are used infrequently (e.g., only performed at startup) or that guard in-memory operations. As always, strive to keep locking policies simple and clear.

https://openjdk.org/jeps/425

kenwenzel Nov 10, 2023
Collaborator

I've also had similar problems when introducing multi-threading for the LmdbStore. In the end I've used some kind of spin-locking to avoid the overhead of synchronized data structures:

rdf4j/core/sail/lmdb/src/main/java/org/eclipse/rdf4j/sail/lmdb/LmdbSailStore.java

Line 198 in 617b205

while (!opQueue.add(ROLLBACK_TRANSACTION)) {

JervenBolleman Nov 10, 2023
Collaborator Author

@hmottestad I don't mean synchronization as in the synchronization keyword/classic object locking. I mean the general cost of ensuring that the state of iterator one on thread a is correctly visible to iterator two running on thread b.

@kenwenzel Yes, a cyclic queue is the next option. Problem with spin locks like that is that they don't have a good backoff strategy to allow other threads to run.

hmottestad Nov 10, 2023
Maintainer

Ahh. I understand. Might be the blocking IO that is the biggest issue with the NativeStore.

I think that a ring buffer (eg. cyclic queue) can be written to be very optimally if we only need to handle a single producer and a single consumer, even if they are in different threads. Could probably get away with release/acquire instead of volatile. In Java 9 there is also a new Thread.onSpinWait() method to signal a spin lock, which helps a lot with performance.

Another alternative is to create new virtual threads for each operation. The algorithm for left join could be:

Preperation:

create a virtual thread that retrieves the first result from the left iterator, stored as leftFuture.
create an empty future stored as rightFuture

Algorithm

if the rightFuture is empty
- wait for the left future to complete
- use the result to create a new right iterator and get the first result
- create another virtual thread that retrieves the next result from the left iterator, stored as leftFuture
- create another virtual thread that retrieves the next result from the right iterator, stored as rightFuture
else
- wait for the right future to complete
- create another virtual thread that retrieves the next result from the right iterator, stored as rightFuture

hmottestad Nov 10, 2023
Maintainer

One major optimization that would help the NativeStore is async index scan.

When there is no optimal index for a statement pattern we need to scan through the index skipping over results that don't match the statement pattern. An example could be using the PSO index to answer ?a a foaf:Person if there was no POS index. For this kind of index scan it would be useful to start an async scan that can run in the background while we return the current result, in preparation for returning the next result.

Using Flow instead of iterators for query evaluation? #4320

JervenBolleman Dec 15, 2022 Collaborator

Replies: 12 comments · 25 replies

abrokenjester Dec 18, 2022 Maintainer

hmottestad Dec 19, 2022 Maintainer

barthanssens Dec 19, 2022 Collaborator

JervenBolleman Dec 19, 2022 Collaborator Author

JervenBolleman Dec 19, 2022 Collaborator Author

JervenBolleman Dec 19, 2022 Collaborator Author

JervenBolleman Dec 19, 2022 Collaborator Author

JervenBolleman Dec 20, 2022 Collaborator Author

JervenBolleman Dec 22, 2022 Collaborator Author

JervenBolleman Oct 30, 2023 Collaborator Author

kenwenzel Oct 30, 2023 Collaborator

kenwenzel Oct 30, 2023 Collaborator

JervenBolleman Nov 3, 2023 Collaborator Author

hmottestad Nov 9, 2023 Maintainer

kenwenzel Nov 10, 2023 Collaborator

JervenBolleman Nov 10, 2023 Collaborator Author

hmottestad Nov 10, 2023 Maintainer

hmottestad Nov 10, 2023 Maintainer

JervenBolleman
Dec 15, 2022
Collaborator

Replies: 12 comments 25 replies

abrokenjester
Dec 18, 2022
Maintainer

hmottestad
Dec 19, 2022
Maintainer

barthanssens
Dec 19, 2022
Collaborator

JervenBolleman Dec 19, 2022
Collaborator Author

JervenBolleman Dec 19, 2022
Collaborator Author

JervenBolleman
Dec 19, 2022
Collaborator Author

JervenBolleman Dec 19, 2022
Collaborator Author

JervenBolleman
Dec 20, 2022
Collaborator Author

JervenBolleman Dec 22, 2022
Collaborator Author

JervenBolleman Oct 30, 2023
Collaborator Author

kenwenzel
Oct 30, 2023
Collaborator

kenwenzel Oct 30, 2023
Collaborator

JervenBolleman
Nov 3, 2023
Collaborator Author

hmottestad Nov 9, 2023
Maintainer

kenwenzel Nov 10, 2023
Collaborator

JervenBolleman Nov 10, 2023
Collaborator Author

hmottestad Nov 10, 2023
Maintainer

hmottestad Nov 10, 2023
Maintainer