Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Presto Coordinator to ignore (not throw) negative runtime metrics. #23077

Open
spershin opened this issue Jun 25, 2024 · 1 comment
Open

Comments

@spershin
Copy link
Contributor

We recently had this situation:
Coordinator's log was full of exception call stacks:
java.lang.IllegalArgumentException: size is negative
at io.airlift.units.Preconditions.checkArgument(Preconditions.java:26)
at io.airlift.units.DataSize.(DataSize.java:58)
at io.airlift.units.DataSize.succinctDataSize(DataSize.java:48)
at io.airlift.units.DataSize.succinctBytes(DataSize.java:43)
...
at com.facebook.presto.execution.scheduler.SqlQueryScheduler.getBasicStageStats(SqlQueryScheduler.java:853)
...
at com.facebook.presto.dispatcher.DispatchManager.getQueries(DispatchManager.java:392)
...

Clients report that 'query is gone' or something similar.
Coordinator UI is unresponsive.
CPU is unused.

All this was caused by a negative metric (rawInputDataSize) returned by native worker, which has been fixed already.

Expected Behavior or Use Case

It might make sense not to be so strict about runtime metrics and stats and still allow the query and endpoints requesting query stats/state to go through the usual process.

Presto Component, Service, or Connector

Not sure

Possible Implementation

  • Remove the checks (they are, unfortunately in the AirLift (io/airlift/units/DataSize.java) repo.
  • Catch the exception and fallback to something (probably too expensive and will break the work flow).

Example Screenshots (if appropriate):

Context

A simple mistake in a single metric can put Presto cluster into some weird state where queries fail with 'query is gone' error and Coordinator UI is unresponsive.
Would be nice to make Presto more resilient to such problems.

@elharo
Copy link
Contributor

elharo commented Jun 29, 2024

I propose option 3: catch and handle the exception, or maybe just fail the query, but definitely catch it. Don't remove the checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants