Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix][Frontend] Fix missing /metrics endpoint #6463

Merged
merged 5 commits into from
Jul 19, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Jul 16, 2024

Mount does not work on router, so I'm reverting the change in #5090 so that mount is performed on the application object. Since app is no longer global, I've moved the code into a function to be called after app is constructed.

To avoid similar issues in the future, I've added some tests to check that the endpoints remain accessible.

FIX #6461

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337 DarkLight1337 marked this pull request as draft July 16, 2024 07:17
@DarkLight1337
Copy link
Member Author

@robertgshaw2-neuralmagic can you help update the test to check #4523? Since I haven't worked with prometheus metrics before, I'm worried that I might break something with this refactor.

@DarkLight1337 DarkLight1337 marked this pull request as ready for review July 16, 2024 07:30
@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 19, 2024
@simon-mo simon-mo enabled auto-merge (squash) July 19, 2024 02:00
@simon-mo simon-mo merged commit 6366efc into vllm-project:main Jul 19, 2024
86 checks passed
@DarkLight1337
Copy link
Member Author

@robertgshaw2-neuralmagic since this PR has been merged already, you can open a new PR if you need to update the tests to be more robust.

@DarkLight1337 DarkLight1337 deleted the fix-metrics-endpoint branch July 19, 2024 04:00
dtrifiro added a commit to opendatahub-io/vllm-tgis-adapter that referenced this pull request Jul 19, 2024
dtrifiro added a commit to opendatahub-io/vllm-tgis-adapter that referenced this pull request Jul 19, 2024
github-merge-queue bot pushed a commit to opendatahub-io/vllm-tgis-adapter that referenced this pull request Jul 19, 2024
github-merge-queue bot pushed a commit to opendatahub-io/vllm-tgis-adapter that referenced this pull request Jul 19, 2024
dtrifiro added a commit to opendatahub-io/vllm-tgis-adapter that referenced this pull request Jul 19, 2024
@xfalcox
Copy link

xfalcox commented Jul 22, 2024

Can we get a new docker release with this fix?

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024
@ZeroYuJie
Copy link

v0.5.4 docker release still missing /metrics endpoint

@ZeroYuJie
Copy link

I try use main branch to build docker image,The metrics information has become completely unavailable,no longer contains any statistical information related to the model.

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 7313.0
python_gc_objects_collected_total{generation="1"} 5988.0
python_gc_objects_collected_total{generation="2"} 45.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 922.0
python_gc_collections_total{generation="1"} 83.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="14",version="3.10.14"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.3022793728e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.00747776e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.72291887821e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.240000000000002
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 22.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Aug 6, 2024

I try use main branch to build docker image,The metrics information has become completely unavailable,no longer contains any statistical information related to the model.

Can you revert to the commit from this PR to ensure that the problem isn't caused by subsequent commits? (Notably #6883)

cc @robertgshaw2-neuralmagic

@frittentheke
Copy link
Contributor

The fix https://github.com/vllm-project/vllm/pull/6463/files for the former issue #6461 also contained tests to ensure metrics are there. Are they not being used @DarkLight1337 ?

@DarkLight1337
Copy link
Member Author

The test only checks that the metrics endpoint exists. It doesn't check which metrics are being exposed.

@frittentheke
Copy link
Contributor

frittentheke commented Aug 6, 2024

The test only checks that the metrics endpoint exists. It doesn't check which metrics are being exposed.

That might be a cool thing to test then 🙄 ;-)

cduk pushed a commit to cduk/vllm-pascal that referenced this pull request Aug 6, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: No metrics exposed at /metrics with 0.5.2 (0.5.1 is fine), possible regression?
5 participants