RHELAI-2756 Adding Adaptive Throttling #476

eshwarprasadS · 2025-01-10T21:10:59Z

Why Adaptive Throttling?

Fixes #424

The Problem:

If too many requests are sent concurrently to the server, it might:

Run out of GPU memory (VRAM) or other resources.
Fail to respond to requests within the timeout period.
Reducing concurrency manually (batch_num_workers) is a workaround, but it's static and not adaptive to server conditions.

The Solution:

Monitor the server's behavior (e.g., response success or failure).
Dynamically increase or decrease concurrency based on server feedback.

Key Components in the Code

The AdaptiveThrottler Class

This class encapsulates the logic for adjusting concurrency:

Parameters:

        min_workers: Minimum number of concurrent workers (never go below this).
        max_workers: Maximum allowed concurrent workers, matches the num_cpus from instructlab core.
        initial_workers: Starting number of workers, 50% of max to begin with.
        tolerance: Fraction to reduce workers when errors occur.

bbrowning · 2025-01-14T14:43:34Z

I pulled this locally and attempted to verify this fix. What I did was run a Mock OpenAI server locally (using https://github.com/polly3d/mockai with a response delay set to 30 seconds) and then edited my generate_data.py to set client.timeout = 5.0 early in the method, to give a 5 second timeout to our OpenAI client.

Running ilab data generate after that, I get:

WARNING 2025-01-14 09:33:31,371 instructlab.sdg.pipeline:214: Retryable error in pipeline batch generation: Connection error.
failed to generate data with exception: Unable to concatenate an empty list of datasets.

So, I'm hitting the new logging warning about Retryable errors. And, that should be adjusting the throttler down. But, it doesn't appear to actually be retrying that failed request that timed out?

eshwarprasadS · 2025-01-14T19:38:54Z

I pulled this locally and attempted to verify this fix. What I did was run a Mock OpenAI server locally (using https://github.com/polly3d/mockai with a response delay set to 30 seconds) and then edited my generate_data.py to set client.timeout = 5.0 early in the method, to give a 5 second timeout to our OpenAI client.

Running ilab data generate after that, I get:
WARNING 2025-01-14 09:33:31,371 instructlab.sdg.pipeline:214: Retryable error in pipeline batch generation: Connection error.
failed to generate data with exception: Unable to concatenate an empty list of datasets.
So, I'm hitting the new logging warning about Retryable errors. And, that should be adjusting the throttler down. But, it doesn't appear to actually be retrying that failed request that timed out?

Hmm. Good question @bbrowning. From my understanding, the logs that you got indicate that this was activated:

                    if isinstance(root_exception, (requests.exceptions.RequestException, TimeoutError, OpenAIError)):
                        # Retryable errors
                        logger.warning("Retryable error in pipeline batch generation: %s", root_exception)
                        throttler.adjust_workers(success=False)

From our collective understanding, OpenAIError is potentially thrown back to us only after the retries have already been attempted (correct me if I am wrong). In that case, I think the logging message could be clearer, but the behavior is as expected(?): the thread of request that was sent through OpenAI client was retried a couple times, with backoff, and failed still, and gracefully died without succeeding, raising its OpenAIError (I am assuming), and we caught and reported it as such.

This understanding of mine of the behavior was from the Perf. team who said:

My understanding of the code is that if we have num-cpus greater than 1 and if one of the thread hits the timeout or any other errors it would just gracefully shutdown that thread without affecting the data generation in other threads before finally doing the data mixing.

Did your generate 'fail' entirely at the end, or were there these warnings but the generation completed successfully at the end?

Signed-off-by: eshwarprasadS <[email protected]>

…thor Name [email protected] Signed-off-by: eshwarprasadS <[email protected]>

…ttling

bbrowning · 2025-01-15T18:22:09Z

My generation did fail entirely at the end, because it had zero generated output as the requests that timed out were ultimately "lost", in that after we received a timeout those requests were dropped and we attempted no further data generation from that input sample.

bbrowning · 2025-01-21T20:08:28Z

Do we want to leave this PR open for now? Or do we want to close it with the expectation that #484 fixes the underlying issue we observed?

eshwarprasadS · 2025-01-21T20:12:03Z

Do we want to leave this PR open for now? Or do we want to close it with the expectation that #484 fixes the underlying issue we observed?

Yes, lets close this, since #484 addresses the issue already.

mergify bot added testing Relates to testing ci-failure labels Jan 10, 2025

eshwarprasadS added 3 commits January 14, 2025 16:01

feat: add adaptive throttling and tests

a1ab3c9

Signed-off-by: eshwarprasadS <[email protected]>

chore: adding logging for retryable vs non retryable errors

3d1560e

Signed-off-by: eshwarprasadS <[email protected]>

chore: linting and removing generic exception block Signed-off-by: Au…

31fa4ba

…thor Name [email protected] Signed-off-by: eshwarprasadS <[email protected]>

eshwarprasadS force-pushed the RHELAI-2756-adding-adaptive-throttling branch from 9ee943d to 31fa4ba Compare January 14, 2025 21:02

mergify bot removed the ci-failure label Jan 14, 2025

Merge branch 'instructlab:main' into RHELAI-2756-adding-adaptive-thro…

614b303

…ttling

eshwarprasadS closed this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RHELAI-2756 Adding Adaptive Throttling #476

RHELAI-2756 Adding Adaptive Throttling #476

eshwarprasadS commented Jan 10, 2025 •

edited

Loading

bbrowning commented Jan 14, 2025

eshwarprasadS commented Jan 14, 2025 •

edited

Loading

bbrowning commented Jan 15, 2025

bbrowning commented Jan 21, 2025

eshwarprasadS commented Jan 21, 2025

RHELAI-2756 Adding Adaptive Throttling #476

RHELAI-2756 Adding Adaptive Throttling #476

Conversation

eshwarprasadS commented Jan 10, 2025 • edited Loading

Why Adaptive Throttling?

The Problem:

The Solution:

Key Components in the Code

The AdaptiveThrottler Class

bbrowning commented Jan 14, 2025

eshwarprasadS commented Jan 14, 2025 • edited Loading

bbrowning commented Jan 15, 2025

bbrowning commented Jan 21, 2025

eshwarprasadS commented Jan 21, 2025

eshwarprasadS commented Jan 10, 2025 •

edited

Loading

eshwarprasadS commented Jan 14, 2025 •

edited

Loading