-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable pipeline to configure connection pooling #9
Comments
+1 For the issue, it seems that several connections are opened and kept in pool, but only one of the used at any time. |
Any updates please? |
Is the problem here that one filter is serially accessed by each worker? We have discussed batch event processing in enhancement filters before. No conclusive approach was decided upon. |
@guyboertje In our particular case the problem was unexpected number of connections to database and no ability to control them. We had 10 pipelines with several filters each and this setup sometimes gave bursts of 100 open connections which were too many for our DB. The best solution for this case would be ability to configure shared connection pool in all pipelines in filters, but it may lead to starvation and not sure if technically possible to implement. So the bare minimum is to simply allow set number of connections for each filter's connection pool. Since now it uses 4 connections per filter as default and cannot be set to 1 connection per filter max. |
@guyboertje Thank you for your response.
Yes, In our case we have one pipeline, with 5 workers and a batch size of 5000.
It is a simple setup, I guess, however, we noticed that the process freezes for records > 50. We have close to 2.5 Million rows from the input SQL query, if we don't include the JDBC streaming, then the process completes in a jiffy. We suspect that the JDBC steaming is causing some bottleneck (sequential executions??) not sure if the contention is on the Oracle connection or something else.
Would this approach help with the JDBC streaming filter?
Sorry, but I'm not too familiar with the architecture of Logstash i.e. how the workers get distributed among the batches etc. Any help or insight into our situation would be appreciated. I'm including a trimmed out snippet of our config:
|
Ahh OK. So you mean global connection pooling. I don't think this can be done in Logstash as each filter is seen as an autonomous transformation engine working on a single event at a time. I have seen similar questions before some time back now, at that time I searched for any kind of jdbc db proxy. I searched again today and found: Both are actively developed and are open source. I have not tested any of them. Please feedback any conclusions if you decide to evaluate them and/or use one in production. In either the original direct or a proxied (pooled) indirect setup we still have a job to do understanding how the various jdbc plugins will react to waiting for its turn at execution and the timing out thereof. I must admit that this is a lesser understood facet of the jdbc plugin behaviour. |
Ok. That leads to a followup question, do (or can) the workers process filters in parallel, i.e. lets suppose I have 5 workers, each is processing an input event (so 5 in all) in parallel, and for each of those events will there be an autonomous instance of a filter be created to process the event? Thanks for the pointers to the JDBC connection pooling, depending on the answer to my above question, we can narrow down to where the main bottleneck is, i.e. if the filters can't process multiple events in parallel, then connection pooling becomes secondary (but a bottleneck nonetheless). |
From @dmitrymurashenkov above...
Looking at the Logstash and Sequel code, each pipeline is autonomous and each filter plugin instance (as seen in the config) declares (from the authors) itself threadsafe or not. Threadsafe filters are reused across workers (but not pipelines). Threadsafe filters are assumed to be able to be called in parallel by each worker thread. On my Macbook, the jdbc_streaming filter uses a Each worker takes a batch of events from the queue (the inputs feed newly minted events into the queue) and feeds the events from the batch through each filter sequentially based on the conditional logic in the config (if any). This means that if you have two jdbc_streaming filters one after the other then only one will be executing a statement at any one time per worker thread. Simultaneous execution of a statement by multiple workers threads is probable (up to the 4 connection limit, the default pool size) but to what degree this simultaneous statement execution occurs is determined by the how synchronised the worker loops become as each worker loop is subject to variable delays as it executes the filters and output(s). Thinking about a worst case scenario, imagine that a jdbc_streaming filter is used to lookup user details from an To test whether a bigger pool size will improve throughput or whether a smaller pool size will put less load on the DB you can modify the
If you do this test, please report your findings back here. |
@guyboertje Much appreciated!, will keep you posted. |
Since
Sequel
supports connection pooling by default, exposing the ability to control aspects of pooling up to the pipeline configuration should be pretty straight-forward.The text was updated successfully, but these errors were encountered: