Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU note on Efficiency dashboard #1156

Merged
merged 10 commits into from
Jan 4, 2025
Merged

Add GPU note on Efficiency dashboard #1156

merged 10 commits into from
Jan 4, 2025

Conversation

mmurph3
Copy link
Contributor

@mmurph3 mmurph3 commented Dec 2, 2024

Related Issue

Proposed Changes

  • Makes minor changes to the GPU Savings Insights page.
  • Adds a note to the Efficiency dashboard page to help troubleshoot the missing GPU column.

@mmurph3 mmurph3 requested a review from a team as a code owner December 2, 2024 20:55
Copy link
Contributor

@chipzoller chipzoller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence isn't exactly true. Second one belongs on the Efficiency page.

@thomasvn
Copy link
Member

thomasvn commented Dec 2, 2024

Thanks @mmurph3 for starting this! This is an effort to address some confusion raised by users in SUP-6416.

Agree with Chip here that the first bullet point isn't necessarily true. It's probably safe to remove it.

The second bullet point is good. How about we modify it slightly to the following?

## Troubleshooting

### Kubecost dashboards not showing GPU Efficiency or GPU Savings

In order for Kubecost to begin displaying GPU features, it must first detect that **at least one** of your clusters has a nonzero amount of GPU usage. Please validate that DCGM-Exporter is running in the clusters which have GPUs and that Kubecost is scraping nonzero GPU metrics from the exporter.

It may be good to add this "Troubleshooting" section to this doc here, as well as the Efficiency doc we have. https://docs.kubecost.com/using-kubecost/navigating-the-kubecost-ui/efficiency-dashboard

@chipzoller
Copy link
Contributor

If we're going to create a public Troubleshooting section specific to GPU, we may want to take this opportunity to build it out more completely à la what I have put together here (internal resource).

@mmurph3
Copy link
Contributor Author

mmurph3 commented Dec 4, 2024

Made some changes before seeing the most recent comments. If we don't agree with what I wrote, I'm ok with changing/moving it. I agree, a built out troubleshooting doc would be good.

@chipzoller , for some reason I'm getting 403'd on that internal link you gave: https://app.gitbook.com/o/MQuX6uFwV0j7vIHtR15E/s/xLM07kCOoiNtRubOhU77/customer-nvidia-gpu-troubleshooting#no-gpu-column-in-efficiency-page

I can see about getting access through Cliff.

@thomasvn
Copy link
Member

thomasvn commented Dec 4, 2024

@mmurph3 These are good changes, but are you sure it's enough? I'm concerned these small add-ons may be missed by some users. Hard to catch the sentence in a long document.

What do you think about adding a "Troubleshooting" section to both these docs, and filling in a bit more details about what to do in the event that they are not seeing GPU Efficiency Features?

@chipzoller
Copy link
Contributor

If you guys don't mind, I'd like to take this over and propose some changes here. It just will have to be next week.

@thomasvn
Copy link
Member

thomasvn commented Dec 5, 2024

@chipzoller Good with me!

srpomeroy and others added 4 commits December 27, 2024 10:44
@chipzoller chipzoller changed the title gpu docs update Add GPU note on Efficiency dashboard Dec 30, 2024
Signed-off-by: chipzoller <[email protected]>
@chipzoller chipzoller requested a review from thomasvn December 30, 2024 15:59
Signed-off-by: chipzoller <[email protected]>
@chipzoller
Copy link
Contributor

Requesting review from @thomasvn and @mmurph3.

Copy link
Contributor Author

@mmurph3 mmurph3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chipzoller chipzoller merged commit b0609a4 into main Jan 4, 2025
5 checks passed
@chipzoller chipzoller deleted the gpu-metrics branch January 4, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants