-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[System] Issue with short time frame (5min) #1437
Comments
Pinging @elastic/integrations (Team:Integrations) |
Alright, I can reproduce this. Honestly, I'm at a loss here. The data coming in seems fine, and it's across too many different visualizations to just be an issue with a single metric. It's also really unpredictable. Sometimes it happens with a 15 minute interval, sometimes it doesn't. Sometimes it happens with a 20 minute interval, sometimes it doesn't. Is there some rounding issue going on in the visualizations themselves? |
@fearful-symmetry Is it possible because the system module dashboard is still using TSVB? I tested this with EC2 and RDS dashboard, the same thing happens with EC2 dashboard, which is using TSVB for the visualizations. But the RDS dashboard is fine, which is using Lens. |
Also seconded, tinkered around with lens for a bit, everything seems fine. Wonder if that's where the issue is. |
@jasonrhodes Do you have by chance any idea on what is happening here? What team should be pinged on this in case it is a TSVB bug? |
Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as |
@neptunian this issue is very old, but I'm seeing it in my notifications again because of the stale bot. When you have time, can you look to see if this System Module dashboard still displays the described problem here? It would be interesting to understand if using TSVB can cause data issues in dashboards as we are always considering Lens vs TSVB etc. This isn't urgent, but it'd be nice to know if this is a simple/obvious TSVB problem or not. If not, I don't think we need to spend a ton of time digging beyond that unless the answer seems obvious to you. |
Thanks for looking into this, @neptunian -- I guess we should probably work with the integration teams and submit changes for these shipped dashboards? I wonder if we could distill what you found into some "Common Advice" for those creating dashboards like this, so that each team could check their own dashboards and adjust accordingly? |
I'd like to understand why it's currently being done this way before making changes. In the (current) case of using "last value", TSVB figures out how to auto calculate the interval for a given range and returns the averages for each of the time series buckets while only displaying the last bucket's value. If we change this to "entire_time_range", so that it stops getting time series data and gets the average over the entire time range, could that be slower over a larger time range? I wouldn't think so. Is there a reason why we only want the last bucket? Is that more "real time"? I would expect to get the whole time range and not the last bucket when I am allowed to choose a timerange. TSVB doesn't default to this option when I create a gauge, currently, so I assume this was intentional. Also, If we did change it to "entire_time_range", it wouldn't solve the problem completely. If the user selects a small enough time range, say last 10 seconds but only sends metrics every 30s, it will still be an empty bucket over that "entire time range". In the above screenshots I posted I got the metricbeat dashboard after enabling the metricbeat system module whereas @ruflin screenshot says "metrics" so I'm assuming its the integration. However, when I install the system integration it looks different for me so maybe it has been updated and it's somewhat easier to understand why you might not get a value, because its null in that bucket: But for some reason when the time range is small the interval drops to 1s, it stops showing the interval in the tooltip: I'm not sure how these info tool tips work but if we could add more to show the value and interval, the user might understand that the time range is too small to reliably show data in smaller intervals than what they are collecting. Happy to work with @elastic/integrations as I'm assuming they own these? |
There have been quite a few iterations on this intergration dashboard since I opened the issue. I remember we always had quite a bit of discussions around using last value vs not (@simianhacker might be able to chime in for general advice).
That would be an ideal outcome for me of this discussion. @drewdaemon might also be able to chime in here in the context of Lens on what the recommendation is / should be. |
@ruflin thanks for the ping At this point, everything in this issue should be accomplished with Lens. I even have an open PR to replace the gauges with Lens metrics. But, I think the questions about behavior are mostly more fundamental than any specific visualization tool and a full discussion of these behaviors occurred when the Kibana visualizations team originally reworked these very dashboards. A big problem with TSVB's "last-value mode" is that it can be unpredictable and hard to understand as is noted by @neptunian . The tablesAs far as whether to use the average over the full range vs the real-time value, we proposed the following approach for the tables which would contain the best of both. However, at the time, using the Lens table instead of TSVB introduced an extra click to get to the host dashboard. Because of this, we reverted to the old TSVB tables. This blocker is no longer relevant so I would suggest we go with the original suggestion pictured above (part of #4868). The gauges (soon to be Lens metrics)The single-number visualizations are harder because we can't just add another column as we can in the tables. In my PR, I have left the single metric visualizations taking the average over a limited time range simply to maintain parity with the existing visualizations. This suffers from the same issues as TSVB "last value mode." I would much rather we decide which of the two approaches is most valuable and be consistent. Either
|
My general advice is implicit in my comment above, but I would say Make your visualizations consistent and predictable favoring one of these two approaches:
In very few cases does it make sense to constrain the timerange independently of the time picker (i.e. TSVB "Last value mode" and Lens "Reduced time range.") But, of course every case is different. cc @stratoula in case she has any more thoughts. |
Is there a point in allowing the user to select a time range if everything is showing the last value? I can see selecting a point in time (like the Inventory waffle map) but not a time range. Assume this was only for the chart visualizations where we want them to be able to define a time range? Favoring either of the two approaches doesn't seem consistent. It also might confuse the user that once they move into the curated UIs we do things another way. |
Hi @neptunian , thanks for the question and comments.
Yes, the user sees the last value in their selected time range. Sorry—I see now that my original comment was ambiguous on this point.
TBH, I'm surprised to hear this! Can you explain your thinking? Users are often confused by the magic behind TSVB's last-value mode (isn't that the main reason for this issue)? To us, it seems more consistent to indicate either a metric that is subject to the entire timerange, or the last reported value. In the table screenshot I posted above each number is specifically labeled and will be consistent with any independent analysis/checking a user performs on their end (e.g. looking at the latest document in Discover or running their own Elasticsearch aggregation queries).
Hmmm, I guess I don't understand this one. What are these "curated UIs?" |
Sorry, I think i misunderstood. I agree we should "decide which of the two approaches is most valuable and be consistent". When it comes to the "common advice" I am hoping we stick with one of the two approaches and ideally not go either way, unless there is some exceptional case, as you mentioned.
This is might be related to my misunderstanding above. We are building UIs (observability infra) that are influenced by these dashboards. When it comes to something like these gauges for instance or "summary" type of metrics, we do not use the "last metric". Mainly because I didn't know that was happening behind the scenes and this had me thinking whether we should be, too, if that's what a user expects. |
++ on having ONE recommended way and that it is consistent across all parts of our UI's, be it embedded Lens or not. |
As I say, some visualization types do support using both approaches simultaneously, such as the table which can have an aggregation and a last value column, both clearly labeled. However, others, especially single-value visualizations like gauges and metrics, clearly don't and choosing one approach for some and another for others will definitely be confusing. But, that's just me chiming in from the general visualization best-practices angle. I certainly defer to y'all on any questions regarding what is most valuable for observability users, along with how you decide to come to consistency across your dashboards and UI views. |
//this article is also posted in discuss.elastic.co Dashboards!Why do dashboards exist In Observability? What is the difference between business service level dashboards and host level dashboards? The remainder of this discussion is centered on technology dashboards and host level views. Business Service dashboards are a different animal. In viewing host level metrics there are two general groupings: multiple systems – typically of a similar function (web, services, DB) – vs. a view of a single system.
These two views of metrics are different and both important and each has its challenges when it comes to displaying data. When viewing the ‘Latest’ data, we need to know about its precision (is it averaged or point-in-time? over what period?), its actual collection interval (1 sec, 1 min), and any lag between ‘now’ and when the data was collected (is it truly the ‘latest’ real-time or just ‘near-time’?). When viewing trending data we need to show normalized information as well as important data that identifies behavior that may be buried by simple averages. Note: an important consideration for monitoring is choosing a meaningful interval for raw data collection. In some cases this may be reviewed per host, per metric and even per app component. For Disk, 10 minutes may be perfectly acceptable for established, mature environments but not necessarily safe for systems with immature processes and rapid changes – or for systems where disk usage is vital to the stability of the system e.g. database environments. Similarly, for Memory and CPU a default interval of 10 seconds may be fine grained enough, but for some systems 60 or 120 seconds may be more than adequate. Visualization Concepts for Elastic Observability When dealing with time-graphed data Lets take a real-world example Now lets take this to the next level. Here is the same data (same system, same timeframe) as presented in Elastic Kibana’s [Metricbeat System] ECS ‘Host Overview’ dashboard. (Note this holds true in v8.5 test env. as well) How do we deal? |
@mgevans-5 Thank you a lot for taking the time to write all this down and share it! Really appreciated. It is a good point that in many cases we should show Avg + Max (+ maybe Min) to get the full picture. |
I went through this issue again and quite a few related issues. Thanks everyone for contributing. Here my follow up thoughts:
The part that seems to create the most confusion is the empty graphs with |
@ruflin that's a good summary. A few responses:
I haven't heard any discussion about this level of guidance. cc @dej611, @stratoula
In some ways, this is probably true, but the "Edit visualization in Lens" button will retain the TSVB "last value" mode as Lens's "reduced time range" setting, perpetuating the issue. Because of this, the person converting legacy visualizations currently has to understand this problem and manually intervene. The situation is made worse by the fact that TSVB used to turn on "last value" mode by default, so it is widespread in the integration dashboards, probably in many cases without the original authors understanding what was happening. I'm still trying to figure out what to do about this TBH.
Lens already gives you a default label of either "Last value of <field>" or "Average of <field>". Seems like the responsibility here should ultimately rest with the dashboard authors since they're the ones with the power to override the labelling. But, always open to enhancement requests to Lens (not sure if this is what was meant). |
As far as I know there's not been any discussion on this yet. |
Lets brainstorm a bit more on this one as it might also help solve the TSVB migration. What does a user do with this label? How does a user understand what it means (without being an expert)? Can we help the user somehow? If the user selected the wrong option, can they switch over? Do we have during setup some recommendations in case the user is unsure what to select? The way I think we should approach it: Lets make the right choice by default for "new" visualisations. For everything existing, lets guide the user as much as possible and don't assume magical knowledge about how things work. Ideally for 80% of the docs / explanations, users do not have to jump to a website and come back ;-) |
By way of clarification, Lens offers this "reduced time range" option outside of the TSVB migration context since there are valid use cases for it. However, it is a little buried in the UI and not on by default. 👍
I think the concern I stated above ^^ doesn't have to be tied to the TSVB to Lens migration efforts currently underway, so sorry for any confusion that caused. We can always track the integration visualizations that are using this feature and perform an "audit" later. 👍
Strong yes here. This is always our goal and I will open an issue with the ideas you've stated about counter rate and gauge fields.
In discussing this with the visualizations team, the general sentiment was that we should avoid inserting guidance around best practices into the user flow to keep the automatic convert to Lens button as frictionless as possible. But, I think that's okay as far as this issue goes since the scope is limited to improving single-number visualizations in our curated integration dashboards. That is an effort which probably has to be undertaken manually anyways. |
This issue was initially opened around the 5min time frame issue for the metrics dashboards but evolved in a pretty long and fruitful discussion. I expect this issue to also serve partially as documentation if these issues pop up again. Never the less I would like to see us having documentations around the above topics in our docs pages where we can send users to but I'm not sure about the right place. Should it be with Lens? Should there be docs paged focused on the metrics use cases? @drewdaemon @ninoslavmiskovic @mlunadia Interested to hear your take. My plan is to close this issue soonish. The topic is now in good hands with @drewdaemon and team on the Lens side. The topic that likely deserves a follow up issue in Kibana is the max / min discussion for gauge. The thing not clear to me is who takes a lead on follow up docs. |
Closing this issue but please keep chiming in. |
@ruflin you identified this as being in good hands with @drewdaemon - for reference sake can you link the related efforts or identify any internal project ID/name we can use when working with our support/sales groups? |
@mgevans-5 Can you ping me through support (@ruflin ) in reference to this comment here so we can handle this "internally"? @drewdaemon Can you add all the links here that are public available for future reference? |
I don't know of any resources to share outside the comments on this thread. Mainly I'm talking about
As far as ownership goes, the Lens enhancement ideas obviously fall under us (visualizations team), but I would expect the integration authors/maintainers to correct instances of this issue (intermittent blank values due to constrained time ranges) in their own dashboards. It may be that we (visualizations team) can give help in pointing integration owners to visualizations that are likely to exhibit this behavior, but an initiative like this probably makes most sense after the Lens migration is further toward completion. Actually, @ruflin I'm a little surprised to have this specific issue closed because the system dashboard's single-value visualizations haven't yet been corrected... I hoped to do so with #4975 but I never received feedback from the owners on which of the two approaches (last value or average over time range) would be most appropriate and it fell off the back. |
Reopening this issue. @drewdaemon I missed on my end that we have not fixed this yet :-( @cmacknz @lalit-satapathy One of your teams should take this on.
Integrations owner is one of the target groups but I think broader. Basically everyone can build integrations now so only targetting the current owners is not enough. @drewdaemon You linked above some of the most important comments in this thread. Where in the docs should these go? My assumption is Lens related? |
Let's get the PR here re-opened so we can get the necessary feedback. #4975 I will find someone to make sure it gets reviewed, there have been multiple customer issues with this as it is today. @kpollich or someone from his team is probably better suited to review the changes to the visualizations in the package. My team owns the system integration but most of our knowledge is on the data collection side, not the visualization side. |
Sometimes when a short timeframe is selected for the system dashboard (for example 5min) some of the values are not shown. Everything works again as expected as soon as a longer timeframe like 30min is selected. Initially I thought some data did not make it through in time but it seems other graphs work and the data is coming from the same machine. Any idea on what could cause this?
The text was updated successfully, but these errors were encountered: