-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report input level dropped/filtered event metrics #42325
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
We already know the target index/datastream for each event in the output, I wonder if we could instead have per data stream stats at the output level? That might get the same result with less internal pipeline wiring. |
I like the idea. However, for standalone Beats users this might not be as useful since most users are writing all data to |
On a related note, the warning messages that are logged when an event is dropped do not contain the index. We should log the index name and the pipeline (if set) as structured data. beats/libbeat/outputs/elasticsearch/client.go Line 517 in 4dfef8b
|
even for per datastream stats would be very useful. |
Hello folks, I started looking at it and I have a few questions to ask. The intent is to have the So far, from what I've seen, easiest to implement would be to add those at the pipeline metrics. Making the pipeline or the output some how to communicate back to the input would be either a big change or some hack. So if the aim is just to have the metrics, rather than having them on a specific place, I'd avoid having them on the inputs. Is it ok? |
Adding the new metrics in the pipeline, it'd go to the "libbeat": {
"pipeline": {
"clients": 2,
"events": {
"active": 0,
"dropped": 0,
"failed": 0,
"filtered": 4,
"published": 1984,
"retry": 1600,
"total": 1988
},
"inputs": {
"logfile-system-6ecb8afb-3445-41e8-a7bd-914959b160d4": {
"events": {
"failed": 0,
"filtered": 0,
"published": 1984,
"total": 0
}
}
}, |
It would be easier to consume if it were present in the HTTP The metrics above are ingested into So if we do put the data into libbeat pipeline metrics, I would still probably recommend doing an optimistic "join" to bring those pipeline metrics into the HTTP This is the code that would need modified to do this. I'd be happy to take care of this part if you did work aggregating the metrics per input pipeline. beats/libbeat/monitoring/inputmon/httphandler.go Lines 73 to 95 in e9272ad
The API response would then look something like [
{
"events": { # NEW METRICS
"failed": 0,
"filtered": 0,
"published": 1984,
"total": 0
},
"batch_read_period": {
"histogram": {
"count": 6388,
"max": 1108938500,
"mean": 121099029.8828125,
"median": 96635900,
"min": 6726500,
"p75": 102812500,
"p95": 114321550,
"p99": 1096386525,
"p999": 1108896965,
"stddev": 163688364.02907938
}
},
"discarded_events_total": 0,
"errors_total": 0,
"id": "winlog-system.security-58eacef7-041e-49fe-b276-58f0b5f9b2c2",
"input": "winlog",
"provider": "Security",
"received_events_count": {
"histogram": {
"count": 6389,
"max": 100,
"mean": 95.3232421875,
"median": 100,
"min": 3,
"p75": 100,
"p95": 100,
"p99": 100,
"p999": 100,
"stddev": 15.715736752558572
}
},
"received_events_total": 611509,
"source_lag_time": {
"histogram": {
"count": 611509,
"max": 23587791100,
"mean": 2871361958.3984375,
"median": 1054898450,
"min": 53831700,
"p75": 4440275750,
"p95": 8395522725,
"p99": 18535951875,
"p999": 23581712035.000004,
"stddev": 3709899503.770204
}
}
}
] |
Hi, thanks, I'm just out of a chat with Fae and she explained me better how those metrics end up on ES. Which is aligned with what you just said. :) Reporting them in a way they can get in a efficient form on ES seems key part of this task. It also seems to need to change metricbeat to collect them. I see now why having them inside the pipeline metrics isn't good. But the pipeline seems the best place to collect them. As long as the metrics get reported in the right place, where it's collected won't be a problem. I'll tinker a bit more. Probably Monday I'll have something more concrete. |
Hello folks, I've been doing a mix of POC and investigation to see how deep this rabbit hole goes. TL;DR:
I confess I'm more inclined to require the users to configure their inputs to add the needed metadata as the agent does already. It'd allow a much more generic approach which would work for any input. What do you think @nimarezainia and @andrewkroh? The long version: As I said before, collecting the per input data in the pipeline would be ideal as it'd be a single place to modify and all the inputs would benefit from that. However if the beats aren't running under agent, the needed information (inputID or streamID) might not be there. Therefore it would not work for stand alone. What seems to be always there is the input type, what would end up aggregating the metrics of several inputs of the same type on one place. Making the input and pipeline to communicate is possible with wither a significant refactor or some hacky solution. I've experimented with that, it's possible without a bit refactor, but it'd require all inputs to be updated to receive the status of each event. Another option would be to require the inputs to add the metadata to the events for the new metrics to be available. That could be a simpler solution even thought it adds responsibility on the user side. Regardless of the solution it'll also require to add to metricbeat beat module the hability to query the |
@AndersonQ before jumping straight into the implementation details can we please work on an RFC here to make sure we have all potential solutions in mind? |
@jlind23 that's what I'm going for, I just needed to understand better the current state of things before proposing something that would be complex to implement or without enough information to compare different approaches. |
Just to clarify I don't think we need a full RFC sent out to multiple teams. What we need is more like a 1-2 page summary of what the final design will look like, and we can do some prototyping to find out what it should look. That can be either a google doc or just a comment here depending how easy to summarize it is. |
I was going to suggest that, I there isn't enough for an RFC. I'll reorganise what I started and put here. |
Problem StatementCurrently the metrics related to event publishing come from the pipeline and the output. Both of them report the metrics for all events processed by them, without discerning which input produced the events. Those are the metrics currently* available:
Whereas it allows to understand and investigate issues in the pipeline and output, it does not help to understand if there is a specific input misbehaving, if the *there are more, but not relevant for this discussion How to expose and consume the new metricsElastic Agent usersThey will profit from the new metric and dashboards out of the box. No extra configuration will be required and the new or modified dashboards will be added by the Elastic Agent integration. Also the diagnostics will include the new metrics on each Beasts standalone users
Proposed designThe publishing pipeline already has the desired metrics, however they are not input specific. If they were, it’d solve the question of how to collect them. When running in managed mode, the Elastic Agent configures the inputs to add metadata to the events that can be used by the publishing pipeline to know which input generated the event. The metadata can easily be added to standalone beats configuration and the lack of such metadata would only mean the input specific metrics would not be tracked. The main advantage of using the publishing pipeline to track the new metrics is containing the changes to a single place, without having to change every single input and avoiding having to ensure any future input would adhere to collect these metrics as well. Besides, making the publishing pipeline report back the status of each event would require either a big change in the code or a hacky solution. The solution is to have the publishing pipeline collect the metrics and aggregate them by The required config would be:
The new metrics will be stored on a new monitoring namespace,
They might be added to
As the current input metrics aren’t sent to ES. Tt will require extending monitoring index ( {"id":"my-filestream-id","input":"filestream","bytes_processed_total":0,"events_dropped_total":0,"events_filtered_total":0,"events_published_total":0,"events_processed_total":0,"files_active":0,"files_closed_total":0,"files_opened_total":0,"messages_read_total":0,"messages_truncated_total":0,"processing_errors_total":0,"processing_time":{"histogram":{"count":0,"max":0,"mean":0,"median":0,"min":0,"p75":0,"p95":0,"p99":0,"p999":0,"stddev":0}}} Besides the current Agent and standalone beats dashboards need to be updated or new dashboards need to be created. The Agent dashboards will require changes in the Elastic Agent package and the standalone beats dashboard in the Inputs without an ID or with an empty cannot have this metric tracked, however any solution would have this restriction as without a way to uniquely identify the input, it’d not be possible to differentiate the metrics from this input to other inputs. Alternative Designs Input ID passed to the publishing pipeline as an argument or status returned as a return parameter to the inputThe publish pipeline interface is modified to receive the Publish pipeline ‘send back’ final status of the event by an alternative channelEither the publishing pipeline has a communication channel the inputs can subscribe to or the TestingBesides unit tests, integration or e2e tests need to be added:
Open questions
@nimarezainia, @cmacknz, @andrewkroh what do you think? cc: @pierrehilbert, @jlind23 |
You need the input type + the input ID. For standalone agents the input ID can be the empty string.
Integrating any of this with Beats stack monitoring should be out of scope here. We want to augment the input metrics in their current form, not make them available in new places. If we want to do that it can be done separately. Focus on making the new metrics available in the places where input metrics are currently available.
The processor based approach will add some small performance overhead, we see the additional processors Elastic Agent adds that a standalone Filebeat doesn't as a cost in our benchmarks. Can you avoid the processor and just unconditionally include this metadata in the Publish call to the pipeline?
This does exist, it's the beats/libbeat/beat/pipeline.go Lines 58 to 59 in 81e2def
You would have to subscribe every input to it, which isn't as nice as just putting metadata in the pipeline. It also doesn't tell you why the event was dropped and adding that would be complicated I think. beats/libbeat/beat/pipeline.go Lines 65 to 90 in 81e2def
You will still need to detect dropped events in two places regardless:
|
I don't see a way of doing that without having to change every input or something close to that :/ I think the ideal scenatio would be to have the inputID adn type in the event struct to be used by anything manipulating the event, but that isn't the case. Besides if those metrics primary use-case if for integrations, then the agent already add the necessary metadata.
Exactly
I believe you're mixing
I'm just spiting those by input. |
I got something working :) @andrewkroh did you have any more specific idea for the charts to add to the dashboards? Or shall I just come up with something and show you? @nimarezainia @cmacknz any suggestion for dashboard/chart? As I said before, for the beats running under agent, it works out of the box, for standalone beats they'd need to add an addField processor to add
it's using |
I'm still not convinced you need to rely on processors at all, you should just be able to write directly to the event metadata in the publish client that every input has an instance of it. beats/libbeat/publisher/pipeline/client.go Lines 90 to 104 in 090584e
Also, looking at the generation code, we actually include the stream_id as a process already. We can probably remove this. I think it was added to support the original shipper. beats/x-pack/libbeat/management/generate.go Lines 258 to 261 in 81e2def
|
The current "Agent Metrics" dashboard has output level metrics (e.g. active, filtered, published, total, dropped) broken down by the sub-process (e.g. component.id). I think we should should have a very similar view where instead it splits by the input ID. @strawgate was showing me a refreshed "Agent Metrics" dashboard so I would defer to him about where exactly these visualizations should go. |
I haven't found how the pipeline client can know the inputID for each event without either relying on some event field or metadata OR having to change the inputs to add it to the
at the pipeline client right now there is not access to the inputID. The @andrewkroh is there a specific input or a set of inputs that would be the most important to have the new metrics? If we go the route of modifying each input, knowing that would be useful. |
![]() |
@andrewkroh thanks! I was talking to Craig and he really wants to avoid relying on the event metadata added by processors, so we decided to go the route which will require changes per input. Well, perhaps for some inputs there are a generic place which will make the new metric work for a set of inputs. Anyway we'll need to choose the inputs to have the new metrics on and check case by case |
@andrewkroh, @cmacknz, @flexitrev, @pierrehilbert While I was doing the integration tests I've discovered the pipeline client metric for As I said in the beginning, the amount of change to be able to have a proper track of the event per input from the input until the output would require a considerable amount of changes as the current architecture isn't meant for that. Also it wouldn't be reusable on OTel. So I'd bring this up to discuss if it's indeed what should be pursued. As I was talking to @faec she was saying perhaps it'd be better to direct the efforts to have it done on OTel instead of making it on the current beats even if with some hacky solution. Even using some new event metadata, even without a processor adding it, would require changes on the inputs and outputs and I would not assume the additional overhead is negligible. Besides the fact it'd create a contract between input and output to use a new metadata field and it's not be portable to OTel. The |
None of the existing pipeline metrics we rely on will apply in the otel collector, that's not a reason not to iterate on things that already exist. I would agree we should not embark on a massive refactor of the Beats pipeline at this time, but we can take on smaller work to provide a better view into the existing metrics, or enhance the existing metrics to be more accurate.
I don't think there was an ever an intent to track events dropped by the queue, events get dropped by the output or filtered out by processors. Not the queue. What metrics specific metrics are you talking about here specifically, where in the code are they currently defined? If they don't make sense, should we shift the focus to fix or remove them? I originally suggested adding the metrics to the output, primarily to avoid having to create the feedback loop from the output back to the input metrics which I suspect is the hardest part of this. Is that still viable? We track detailed stats for each batch we attempt to send and we have a reference to the underlying beat.Event when that happens. Can we just take the existing metrics and break them down by datastream or input ID? I think that would accomplish the goal of making this visible just in a slightly less convenient way. beats/libbeat/outputs/elasticsearch/client.go Lines 474 to 475 in fb79d49
Taking the existing metrics and breaking them down by index or source doesn't create any net new problems, the metrics being broken down are in the same place with respect to any equivalence in our OTel collector. |
It isn't as simple as it looks like. Using the current streamID or inputID we had agreed isn't ideal because we want to remove the processors which add them. It's possible to have the inputs to add it directly to the event's metadata and now that #42559 is merged, it'll be available. However it isn't straight forward and I'd say we'd need to evaluate the overhead it might cause to check each event status once ES sends its response. There might be events from different inputs on a batch as far as I know. And also consider if it should be added to other outputs and if it's actually possible to do so. Anyway to track the dropped is quite different then tracking the filtered. The filtered happens when the processors run and this metric is already being collected. So I believe we should track them as different tasks. In conclusion what I'm getting is we want to explore how to track the dropped events. So it's worth to to come up with a proposal for that. |
We already iterate over every single event because ES sets an HTTP status code per event in a _bulk request. That is what this function is doing: beats/libbeat/outputs/elasticsearch/client.go Lines 474 to 533 in fb79d49
We'd be incrementing different counters which doesn't worry me, but we do have to sanity check it. We need to avoid having to decode the pre-serialized event, looks like #42559 does what we'd need already. If what you are saying is you want to say that filtered events are now tracked and supporting dropped events becomes a separate issue or set of PRs then sure. |
yes, that's what I meant.
I was looking at that yesterday, It's doable, but I would not assume the overhead is negligible. I just don't wanna inadvertently decrease the output performance to track dropped events per input. |
Yeah we should measure the impact it has in the end but it doesn't appear like such a bad idea that we shouldn't even try is the point I'm trying to make. You could probably get some idea of this with a microbenchmark since the impact will be contained into that one function we know is on the hot path. |
Describe the enhancement:
Today, metrics related to event publishing are available only in an aggregated form that combines metrics from all inputs. This makes it difficult to know which specific input source is responsible for
dropped
orfiltered
events.The input metrics should contain data about the number of events dropped or filtered by each input. This would help narrow down the source of problems to a particular input (and hence a particular integration data stream in the case of Elastic Agent).
I believe this would require the beat publisher client (from libbeat/publisher) to provide a way for inputs to subscribe to this data.
Describe a specific use case for the enhancement or feature:
dropped
andfiltered
metrics that are specific to an integration data stream (viainput_metrics.json
).logging_metrics_namespaces: [stats, inputs]
is used. This would allow them to quickly identify and resolve issues.The text was updated successfully, but these errors were encountered: