Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework system metrics overview and host overview #3630

Merged
merged 14 commits into from
Sep 27, 2022

Conversation

flash1293
Copy link
Contributor

@flash1293 flash1293 commented Jun 30, 2022

What does this PR do?

This PR reworks the system overview and host overview dashboards, following the best practices documented here: https://docs.google.com/document/d/1uyyFGx6xA5Kvl8c-ZdvXdvBGrHTylxU9F69TGqfzdmw/edit

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

Detailed changes in this PR

System overview:

  • Make all visualizations "by value" (no library visualizations necessary)
  • Enable margins between panels
  • Remove titles from most panels if the vis itself is sufficient
  • Replace the navigation on top with an explainer for how to drill down to the host overview (by using the row click menu of the "hosts" table)
  • Convert "Number of hosts" to Lens vis
  • Memory/CPU/Disk usage gauges: Use EUI status color scheme
  • Turn the two top n TSVB metrics for top hosts by CPU and Memory into a single "hosts" table with both metrics ordered by CPU usage, showing 1000 host
  • The user can navigate to the host overview page using a row click drilldown (the three dots context menu next to each row) - it will navigate to the dashboard with a pre-set filter - by default it's inline because this speeds up the navigation massively, but shift-click works to open in a new tab (like with any link)
  • ⚠️ It shows the top 1000 hosts by CPU, the table can be client-side ordered by memory directly on the dashboard, but it will only consider the already loaded 1000 hosts - does this make sense or are two tables needed?(one for top by CPU, one for top by memory)
  • ⚠️ The hosts table is not using "last value" mode, instead it takes the average of cpu/memory over the full time range - does this make sense?
  • Convert CPU usage histogram to Lens and use similar coloring like for the gauges, but de-emphasizing the "normal" state (light grey for below 70%, yellow up to 85%, then red)

Host overview:

  • Make all visualizations "by value" (no library visualizations necessary)
  • Enable margins between panels
  • Remove titles from most panels if the vis itself is sufficient
  • Add input control to allow changing the host without leaving the page and without knowing available host names
  • Change markdown header to explain how to use input control and also link back to overview page
  • Restructure the page:
    • Most important metrics, then a separate section for CPU, Memory, Disk and Network
    • Convert "Number of processes" to Lens vis
  • Memory/CPU/Disk usage/Load/Swap gauges: Use EUI status color scheme
  • CPU section
    • Convert CPU Usage and Load over time charts to Lens
      • ⚠️ Lens uses larger intervals by default than TSVB - is this is a problem?
    • Turn Top N processes by CPU usage into Lens table with color coding
    • ⚠️ The processes table is not using "last value" mode, instead it takes the average of cpu over the full time range - does this make sense?
  • Memory section
    • Convert Memory Usage over time charts to Lens
      • ⚠️ Lens uses larger intervals by default than TSVB - is this is a problem?
    • Turn Top N processes by memory usage into Lens table with color coding
    • ⚠️ The processes table is not using "last value" mode, instead it takes the average of cpu over the full time range - does this make sense?
    • Move swap usage gauge from top section down
  • Disk section
    • Restyle Disk IO chart to use same colors
      • ⚠️ Lens uses larger intervals by default than TSVB - is this is a problem?
    • Turn Top N mountpoints by disk usage into Lens table with color coding
    • ⚠️ The mountpoints table is not using "last value" mode, instead it takes the average of disk usage over the full time range - does this make sense?
    • Copy disk usage gauge from top section down as a reference
  • Network section
    • Copy down inbound and outbound traffic metric vis from top section as a reference
    • Split up in and out packetloss visualizations into separate panels
    • Turn top in and out interfaces top n visualizations into a single combined Lens table, using the max of the counter metrics
    • ⚠️ The old interface topn visualizations were using average of in/out bytes, but using the maximum seems more correct, does this make sense?
    • Color in/out traffic in bytes and packets visualizations similar to Lens

Divergence from datavis proposal

  • Moving some metrics from the top section into the aspect-specific sections (swap usage, packet loss) - these seem too specific to me and don't help when trying to get a "general understanding" of the health of the host
  • Adding the host switcher to the top of the dashboard - it takes away
  • Giving the aspect-specific sections a header instead of simple separators - IMHO it doesn't hurt the overall look of the dashboard and it helps setting the context
  • Not duplicating as many metrics for the aspect-specific sections - I didn't want to take away as much space from the time series chart, in some places I did duplication to make the dashboard look more consistent. Happy to discuss this part
  • Not adding a "Top hosts by memory usage" heatmap to the system overview - it wasn't there in the existing dashboard, not sure whether it makes sense
  • Using tables instead of horizontal bar charts for the top processes / interfaces / mountpoints and putting them into their respective section instead of grouping them at the bottom. Visually I like them better in the bottom, but for investigation I think it's helpful to have them next to the chart they relate to. I also picked a table so the list can grow larger than the available space and start scrolling - IMHO in this situation it's a helpful model to discover more

cc @gvnmagni

Problems/Leftovers

I couldn't do some things because they are either not available in Lens yet at all or not at the current version

  • Rank by last value not in 8.1 - could be used for the disk usage per mountpoint table instead of max (it's more correct when resets are happening) - this is available in 8.2
  • As mentioned above - Some of "top processes/interfaces/... by x' are "last value" - this is not possible in Lens yet, we are working on this feature: [Lens] Window config for last value, counter rate and average/percentile kibana#132112 (might be available in 8.4)
  • Nicely looking metrics are not available, we are working on this [Lens] Implement new metric grid visualization kibana#134242 (might be available in 8.4)
  • Formula can't be time scaled in Lens (prevents conversion of disk.io traffic over time) - this is available in 8.3
  • Breakdown can't be collapsed (prevents conversion of network in/out over time for bytes/packages) - this is available in 8.3
  • Can't use the new input controls - the ones I used got deprecated in 8.3 in favor of a new implementation but they need to be integrated with drilldowns [Controls] Drilldown and Links panel Integration kibana#136650 - unclear when available

How to test this PR locally

  • For each visualization, validate whether the configuration still makes sense - I'm not that familiar with the dashboard and maybe I made some mistake
  • Take especially care with the points marked with a warning triangle in the list above - these changes I'm not 100% sure myself, but the others might be problematic too

Screenshots

Screenshot 2022-06-30 at 09 41 44

Screenshot 2022-06-29 at 19 15 16

@flash1293 flash1293 added the enhancement New feature or request label Jun 30, 2022
@flash1293 flash1293 requested review from a team as code owners June 30, 2022 08:46
@flash1293 flash1293 requested review from cmacknz and kvch June 30, 2022 08:46
@elasticmachine
Copy link

elasticmachine commented Jun 30, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-09-27T15:01:51.587+0000

  • Duration: 18 min 50 sec

Test stats 🧪

Test Results
Failed 0
Passed 246
Skipped 0
Total 246

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@elasticmachine
Copy link

elasticmachine commented Jun 30, 2022

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 100.0% (3/3) 💚
Files 100.0% (4/4) 💚 2.688
Classes 100.0% (4/4) 💚 2.688
Methods 60.759% (48/79) 👎 -29.538
Lines 98.793% (2702/2735) 👍 7.408
Conditionals 100.0% (0/0) 💚

@cmacknz
Copy link
Member

cmacknz commented Jun 30, 2022

Added @joshdover as reviewer to get someone with more Kibana experience to look at this, Josh feel free to assign someone else to look at this if needed.

@cmacknz cmacknz removed the request for review from kvch June 30, 2022 12:21
@joshdover
Copy link
Contributor

@flash1293 This is a huge improvement in visual layout and use of new Kibana features. I exported the dashboard to my personal cluster and I found it a lot easier to use. 🎉

  • ⚠️ It shows the top 1000 hosts by CPU, the table can be client-side ordered by memory directly on the dashboard, but it will only consider the already loaded 1000 hosts - does this make sense or are two tables needed?(one for top by CPU, one for top by memory)

I think being able to find the top memory consumers is important, regardless of CPU usage. If we have to do a separate table for this we probably should or we should add a similar heatmap for Top Hosts by Memory Usage.

  • ⚠️ The hosts table is not using "last value" mode, instead it takes the average of cpu/memory over the full time range - does this make sense?

For this comment + all others regarding last value mode:

I think this is tricky because there are two use cases:

  • Help me find what is the problem with my system right now
  • Help me determine what processes are consuming the most resources in <time range>

The previous behavior showed the "realtime" value. My guess is the most helpful metric is the "what is happening right now" case, but makes the tables useless for the other use case. Should we have separate dashboards or visualizations for these different use cases? I'm guessing this is a problem we need to help solve for all integrations.

IMO this is the biggest outstanding problem with this PR.


Question about the drilldowns: it seems that clicking on the table rows doesn't open the host drilldown, but clicking the context menu works. Is there a bug here, because I see the drilldown is configured for "table row click" but that doesn't seem to work?

Related to that, it'd be great for us to expand on this in the future and have a process dashboard that can be drilled down into from the host overview board.

@joshdover
Copy link
Contributor

Also, can the CPU heatmap on the overview dashboard also drilldown to the host dashboard on click?

@flash1293
Copy link
Contributor Author

Thanks for the review @joshdover . I think I can address most of your points somehow - I will report back on these once I have a solution.

about

The previous behavior showed the "realtime" value. My guess is the most helpful metric is the "what is happening right now" case, but makes the tables useless for the other use case. Should we have separate dashboards or visualizations for these different use cases? I'm guessing this is a problem we need to help solve for all integrations.

An important note is that the tsvb vis would show the average of the last few seconds/minutes, but still order the list by the overall average, so the shown entries even in the current dashboard are not necessarily the top consumers. We have a feature on our near term list (8.5 or even 8.4 if we hurry) to allow to do the same thing in Lens (order by overall time range, show the last few minutes in the table). However I’m not sure whether it’s the best behavior as it’s hard to understand for a user - it’s a bit averaged over the full time range and a bit “current state”.

to make it consistent there are two approaches:

  • Show data for the full time range (like it is right now) - if the user wants to see the latest state, they have to set the dashboard time range to 30s
  • Show the very last state - take the very last value, order the list by that and also show that very last value (not averaging at all, just show the value from the last document in the current time range that has the field defined).

What do you think? Should we keep the current tsvb behavior or go one of the other ways?

@joshdover
Copy link
Contributor

What do you think? Should we keep the current tsvb behavior or go one of the other ways?

Thanks for clarifying how this currently works and I do think this likely matches what we want, but the downside is we'd need to bump the minimum package version to 8.4 or 8.5 which isn't super desirable.

I'd be ok keeping the current behavior in the PR over increasing the minimum version right now. As you mentioned the user can still get the latest value using a shorter time span, which I think is better than only solving one of the use cases or increasing the minimum version right now.

We should track some of the improvements we'd like to make the next time we increase the minimum version. I think we'll want to try supporting at least 2-3 minor releases back from the latest public one. Today the latest release is 8.3.x, so bumping the minimum to 8.1 seems fine.

@flash1293
Copy link
Contributor Author

I think we'll want to try supporting at least 2-3 minor releases back from the latest public one. Today the latest release is 8.3.x, so bumping the minimum to 8.1 seems fine.

actually that’s an important point - in the original pr description I listed things that can be improved in future versions (most notably new metric vis) . When discussing this with @akshay-saraswat we concluded bumping to the latest minor would he justifiable (as users of older stack versions can still use the existing legacy assets). Do you think we should make a lag of 2 minors a rule?

@flash1293
Copy link
Contributor Author

Another important point I just realized - the url drilldown used to link from the overview to the detail dashboard is not part of the basic subscription. I guess we need to make sure to only use basic features, is that assumption correct?

@joshdover
Copy link
Contributor

Do you think we should make a lag of 2 minors a rule?

We should probably have some policy on this, but I don't think we do. I think the owning team of this package should chime in on how often we make bug fixes to this package that need to be backported to previous minors. cc @cmacknz

Another important point I just realized - the url drilldown used to link from the overview to the detail dashboard is not part of the basic subscription. I guess we need to make sure to only use basic features, is that assumption correct?

Hmm that will be a breaking change of sorts since the current dashboard does support this (is it also only on basic?).

I still think it'd be good to include it assuming that there is some graceful fallback when not on basic and nothing breaks. May need to update the copy to not mention the drilldown feature though.

@flash1293
Copy link
Contributor Author

The problem with drilldown is that I can’t link the filter up with the input control on the details dashboard. But in 8.3 there’s a new input controls feature that might not have this problem (I have to check though)

@flash1293
Copy link
Contributor Author

flash1293 commented Jul 22, 2022

Addressed some points:

  • Two tables on the overview page - one ordered by CPU, one ordered by memory
  • Switched to filter-based drilldown as table context url drilldown is not available in basic
  • Added another heatmap for memory usage over time
  • Added drilldowns to the heatmaps, too
    • ⚠️ The heatmap always selects both host an time range and there's no way to do both. However it still seems useful to me
  • Removed input control on host overview page as it can't be integrated with multiple dashboards

The biggest open question is

The previous behavior showed the "realtime" value. My guess is the most helpful metric is the "what is happening right now" case, but makes the tables useless for the other use case. Should we have separate dashboards or visualizations for these different use cases? I'm guessing this is a problem we need to help solve for all integrations.

@cmacknz @joshdover could you have a look and check wether it makes sense like this?

Screenshot 2022-07-22 at 16 35 17

@joshdover
Copy link
Contributor

Removed input control on host overview page as it can't be integrated with multiple dashboards

@flash1293 What does this mean exactly?

We have a feature on our near term list (8.5 or even 8.4 if we hurry) to allow to do the same thing in Lens (order by overall time range, show the last few minutes in the table).

A couple questions on this feature:

  • Is it available yet?
  • Would it be possible to both show the overall average in the time range and the last value?

@flash1293
Copy link
Contributor Author

flash1293 commented Aug 4, 2022

@joshdover

What does this mean exactly?

When navigating from one dashboard to another via filter drilldown, then the filter will be set on the target dashboard, but if there is an input control on the same field it won't be picked up as the "current value". So the user ends up with a regular filter pill and an empty select box - if they select something from the select box, a second filter pill is added, effectively filtering out everything. The user would need to remove the filter pill first and reselect which is pretty confusing.

A couple questions on this feature:
Is it available yet?
Would it be possible to both show the overall average in the time range and the last value?

I merged it yesterday, so it will be in 8.5

We can show the overall average and the very last value (top metric of the field sorted by timestamp descending) - would that make sense?

@joshdover
Copy link
Contributor

We can show the overall average and the very last value (top metric of the field sorted by timestamp descending) - would that make sense?

I think if there were two separate columns and clearly labeled it would make sense ("Average CPU" and "Last CPU"?). We could compare this to what other visualizations are doing in other packages but I think we're trying to define what should be the best practice pattern here and I'm not sure looking at what we've done elsewhere is helpful.

WDYT?

@flash1293
Copy link
Contributor Author

Agreed. I like "average" vs "last" - it's less vague than "average over the last few seconds" which is what TSVB is doing at the moment. Gonna update the PR

@flash1293
Copy link
Contributor Author

@joshdover Split up all the metrics in the tables into "Average" and "Last value":
Screenshot 2022-08-12 at 14 47 26
Screenshot 2022-08-12 at 14 47 40

It seems helpful to me - the space is there to show another value and it's better information than just the current value or just the average

Copy link
Contributor

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - did not manually test the latest iteration. Thanks for all your work on this @flash1293! 🎉

Copy link
Contributor

@nimarezainia nimarezainia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes. I haven't been able to personally view the changes but based on the discussions looks like a great improvement. Once I have access to 8.5, i'll provide more feedback if required.

@joshdover
Copy link
Contributor

@nimarezainia As mentioned, we can have some folks test this before we merge these changes. This can be done by downloading these two files and then importing them into Kibana from Stack Management > Saved Objects. I'd recommend creating a test space to do this in since it will override the dashboards from the integration that is currently installed:

https://raw.githubusercontent.com/flash1293/integrations/system-dashboard-rework/packages/system/kibana/dashboard/system-79ffd6e0-faa0-11e6-947f-177f697178b8.json
https://raw.githubusercontent.com/flash1293/integrations/system-dashboard-rework/packages/system/kibana/dashboard/system-Metrics-system-overview.json

@flash1293
Copy link
Contributor Author

Slight correction - the files can’t be imported directly as they aren’t in the right format (elastic-package is doing some transformations) I will provide a proper export tomorrow and send it around.

@flash1293
Copy link
Contributor Author

OK, as discussed offline I reverted the Lens tables on the system overview dashboard back to the TSVB top n visualizations for one-click-drilldown functionality (with adjusted color schemes):
Screenshot 2022-08-31 at 16 36 08

@joshdover @nimarezainia The latest state as importable file can be found here: https://gist.githubusercontent.com/flash1293/d3a8b167ad91576f9c9a770d163e1b20/raw/505cfe2a74083f957f29e260a435bc14be0560b3/export.ndjson

Save this link as an ndjson file , then go to Stack management > Saved object management and import it there (should work for every instance which is receiving system metric data via integration >= 8.1.0)

@ruflin
Copy link
Contributor

ruflin commented Sep 12, 2022

@joshdover @cmacknz @nimarezainia @jlind23 It would be great to get this over the line. This is not only about the system dashboard itself which is a huge improvement but it will also serve as an example for many other integrations on how we should build the dashboards.

@joshdover
Copy link
Contributor

I'm good with this being merged. @nimarezainia did you still want to get additional feedback from SAs or are you happy with the modifications @flash1293 made?

@nimarezainia
Copy link
Contributor

@joshdover sorry I haven't been able to take care of this. if you are all happy let's merge and I will try and find SAs to review it.

Copy link
Contributor

@drewdaemon drewdaemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work. Only one comment from my side.

I think the directions in the System Overview markdown panel are out of date since the tables got switched back to TSVB.
Screen Shot 2022-09-23 at 9 09 33 AM

Also, "table below" should probably be changed to "tables below."

Approving anyway so as not to hold this PR up.

@elasticmachine
Copy link

elasticmachine commented Sep 26, 2022

🚀 Benchmarks report

Package system 👍(1) 💚(1) 💔(1)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
syslog 16393.44 9433.96 -6959.48 (-42.45%) 💔

To see the full report comment with /test benchmark fullreport

@joshdover
Copy link
Contributor

@flash1293 I say we ship this. If we get user complaints, it's not hard to revert and release another update.

@flash1293
Copy link
Contributor Author

Alright, I just removed the unused visualizations - if the build goes green I'm going to merge.

@flash1293 flash1293 merged commit a57592a into elastic:main Sep 27, 2022
@flash1293
Copy link
Contributor Author

@cmacknz could you take it from here in terms of releasing?

@cmacknz
Copy link
Member

cmacknz commented Sep 29, 2022

Yes I can promote the integration, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants