Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][QSB] Approached to enforcement of system resource limits #11846

Open
kaushalmahi12 opened this issue Jan 10, 2024 · 2 comments
Open

[RFC][QSB] Approached to enforcement of system resource limits #11846

kaushalmahi12 opened this issue Jan 10, 2024 · 2 comments
Assignees
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search:Resiliency

Comments

@kaushalmahi12
Copy link
Contributor

kaushalmahi12 commented Jan 10, 2024

Author: Kaushal Kumar

Is your feature request related to a problem? Please describe

It is a meta request to QSB feature request to get some community feedback on possible approaches.

Describe the solution you'd like

Approaches to Enforce Resource Limits

There are basically two ways to enforce the resource consumption limits I can think of. First one can focuses on allocating or maintaining the fixed amount of resource usage for a sandbox while second one can be made flexible to make optimum use of resources available.

  1. Reserved - With this approach we can assign a fixed percentage of a resource for a sandbox. All sandboxes cumulatively should not exceed 100. Going with this approach even though multiple sandboxes are underutilised it can trigger cancellation from sandboxes as soon as they hit their limit.
    • Pros
      1. It will make the cancellation a bit easier as we only need to cancel when a sandbox exceeds its limit.
      2. This will help us free ourselves from the pain of tracking sandbox resource usage cumulatively. We need not employ Hierarchical topology for sandboxes.
      3. Sandboxes need not have priority.
      4. Efficacy of system resources overall is more since the #rejections > #cancellations
    • Cons
      1. This can lead to underutilization of resources system wide.
      2. Additional overhead of validating the individual sandboxes resource limit each time Cx creates a sandbox.
  2. Constrained - With this approach we will assign a limit which we will always honor. But one important thing we will do to make optimum usage of available resources is that cumulatively for a resource across all the sandboxes need not have sum up to system level duress limit. But this will create the problem of which sandbox should be selected to cancel the queries. To solve this problem we will have sandbox priority to help when contention happens.
    • Pros
      1. Optimum use of available system resources.
      2. It can cause more cancellations than rejections if not configured properly (free flowing limits e,g; every sandbox with max limit configured)
    • Cons
      1. It is complex to maintain tree topology and priority based cancellation in case of contention.
      2. Efficacy of system resources is less as #cancellations > #rejections in case where none of the snadboxes are hitting the configured limits but cumulatively they are duressing the node. Cancelling a task is wasting the resources on the cancelled task progress so far.

Lets understand them with the help of some examples here. For the sake of simplicity I am only using a single value for resource limit but there will be two limits for each system resource low and high.

Constrained

Lets say we have 3 Sandboxes in the System

  • Sandbox1 - { ResourceLimit: 60, Priority: 1}
  • Sandbox2 - { ResourceLimit: 20, Priority: 3}
  • Sandbox3 - { ResourceLimit: 40, Priority: 2}

System wide resource limit: 90

Lets caputre the current resource usage of the sandboxes at different times

Cancellation Case: sandbox limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 50 25 10

Sandbox2 will start rejecting new requests for this sandbox and cancel some.

Cancellation Case: system limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 50 15 30

here cells in bold will see cancellation as cumulatively it is breaching the system limit. It means that sandbox2 will face cancellation even though the sandbox level limits are not breaching here.

Rejection Case:

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 35 22 30

In this case Sandbox2 will face rejections as the sandbox level limits are breaching.

Reserved

Lets say we have 3 Sandboxes in the System

  • Sandbox1 - { ResourceLimit: 50, Priority: 1}
  • Sandbox2 - { ResourceLimit: 20, Priority: 3}
  • Sandbox3 - { ResourceLimit: 30, Priority: 2}

The sandbox limits for the example are taken in such a way that cumulative sum of the resource limits on sandboxes should sum up to 100 as inherent in the approach.

Cancellation Case: sanbox limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 40 20(1) 20
T3 45 25(2) 10

(1) At this point the sandbox2 will start rejecting new incoming requests
(2) At this point we will also start cancelling running requests from sandbox2 due to sandbox level resource limit breach.

Cancellation Case: system level limit breached

Time Sandbox1 Sandbox2 Sandbox3
T1 40 10 30
T2 50 15 25
T3 50 18 30

In this case the sandbox2 will start cancelling the requests because it is the lowest priority sandbox.

Decision driving factors to select the Approach from one of the Above

  • We want to improve the efficacy of the system resources overall which means we would avoid wasting resources on tasks which potentially can shoot beyond enforced limits. Basically we will favor rejections over cancellations.
  • Our system should try its best to honor the user assigned limits for these sandboxes even though this can lead to underutilisation in the system. For example let say there are 3 sandboxes in the system having limits as 60, 20, 10 respectively, there might be a time when lets say only sandbox 2 has the traffic and the sandbox 2 is inundated with traffic hence it will start rejecting the requests even though system is still not under duress.
  • At any point in time sandbox assigned limit should be honored. For example if at any point in time sandbox should not face cancellation or rejection until defined limit breached.

Personal Verdict

  • We will go ahead with reserved approach for enforching resource limits considering the above points.

Problems with the selected approach to enforce sys resource limits and possible solutions

The only ambiguity with this approach is the ability to maintain the cumulative resource limit to 100 since the user can supply any random value for new sandboxes.

To understand this with the help of examples, lets say at any point in time we have 3 sandboxes in the system

  • sandbox1: { limit: 40 }
  • sandbox2: { limit: 30 }
  • sandbox3: { limit: 20 }

now lets say user want to create a new sandbox with resource limit of 30 the new cumulative sum will become 120 (>100). This warrants the readjustment of the existing sandbox limits or create the new sandbox with the limit of 10.

Now how do we resolve this conflict there are two ways I can think of resolving this

  1. We re-adjust the resource limits of existing sandboxes in the same proportion on user's behalf. e,g; in the above scenario we can let the new sandbox be created with a limit of 30 * 10/12 and readjust the other sandboxe limits to 40 * 10/12, 30 * 10/12 and 20 * 10/12.
  2. We error out the request to create the new sandbox and ask user to re-adjust the limit of existing sandboxes to accomodate the new one.

Personally I think the 2nd option provides better user experience. But I am looking forward to hear from the folks on this.

I am using Sandbox keyword as we had started envisioning this feature with it. But It is not the final name for the construct to be used in the implementation.

Main Issues

Related component

Search:Resiliency

Describe alternatives you've considered

No response

Additional context

No response

@kaushalmahi12 kaushalmahi12 added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 10, 2024
@kaushalmahi12
Copy link
Contributor Author

@backslasht @Bukhtawar @msfroh
Can you guys provide your feedback on this ?

@kaushalmahi12 kaushalmahi12 changed the title [Meta Isuue][QSB] Approached to enforcement of system resource limits [Meta Issue][QSB] Approached to enforcement of system resource limits Jan 11, 2024
@peternied peternied added RFC Issues requesting major changes and removed untriaged labels Jan 24, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2]
@kaushalmahi12 Thanks for writing up this great discussion / proposal

@peternied peternied added the discuss Issues intended to help drive brainstorming and decision making label Jan 24, 2024
@peternied peternied changed the title [Meta Issue][QSB] Approached to enforcement of system resource limits [QSB] Approached to enforcement of system resource limits Jan 24, 2024
@peternied peternied changed the title [QSB] Approached to enforcement of system resource limits [RFC][QSB] Approached to enforcement of system resource limits Jan 24, 2024
@andrross andrross added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 31, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search:Resiliency
Projects
Status: New
Status: Later (6 months plus)
Development

No branches or pull requests

3 participants