Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHELAI-2756 Adding Adaptive Throttling #476

Conversation

eshwarprasadS
Copy link
Contributor

@eshwarprasadS eshwarprasadS commented Jan 10, 2025

Why Adaptive Throttling?

Fixes #424

The Problem:

If too many requests are sent concurrently to the server, it might:

  • Run out of GPU memory (VRAM) or other resources.
  • Fail to respond to requests within the timeout period.
  • Reducing concurrency manually (batch_num_workers) is a workaround, but it's static and not adaptive to server conditions.

The Solution:

  • Monitor the server's behavior (e.g., response success or failure).
  • Dynamically increase or decrease concurrency based on server feedback.

Key Components in the Code

The AdaptiveThrottler Class

This class encapsulates the logic for adjusting concurrency:

Parameters:

        min_workers: Minimum number of concurrent workers (never go below this).
        max_workers: Maximum allowed concurrent workers, matches the num_cpus from instructlab core.
        initial_workers: Starting number of workers, 50% of max to begin with.
        tolerance: Fraction to reduce workers when errors occur.

@mergify mergify bot added testing Relates to testing ci-failure labels Jan 10, 2025
@bbrowning
Copy link
Contributor

I pulled this locally and attempted to verify this fix. What I did was run a Mock OpenAI server locally (using https://github.com/polly3d/mockai with a response delay set to 30 seconds) and then edited my generate_data.py to set client.timeout = 5.0 early in the method, to give a 5 second timeout to our OpenAI client.

Running ilab data generate after that, I get:

WARNING 2025-01-14 09:33:31,371 instructlab.sdg.pipeline:214: Retryable error in pipeline batch generation: Connection error.
failed to generate data with exception: Unable to concatenate an empty list of datasets.

So, I'm hitting the new logging warning about Retryable errors. And, that should be adjusting the throttler down. But, it doesn't appear to actually be retrying that failed request that timed out?

@eshwarprasadS
Copy link
Contributor Author

eshwarprasadS commented Jan 14, 2025

I pulled this locally and attempted to verify this fix. What I did was run a Mock OpenAI server locally (using https://github.com/polly3d/mockai with a response delay set to 30 seconds) and then edited my generate_data.py to set client.timeout = 5.0 early in the method, to give a 5 second timeout to our OpenAI client.

Running ilab data generate after that, I get:

WARNING 2025-01-14 09:33:31,371 instructlab.sdg.pipeline:214: Retryable error in pipeline batch generation: Connection error.
failed to generate data with exception: Unable to concatenate an empty list of datasets.

So, I'm hitting the new logging warning about Retryable errors. And, that should be adjusting the throttler down. But, it doesn't appear to actually be retrying that failed request that timed out?

Hmm. Good question @bbrowning. From my understanding, the logs that you got indicate that this was activated:

                    if isinstance(root_exception, (requests.exceptions.RequestException, TimeoutError, OpenAIError)):
                        # Retryable errors
                        logger.warning("Retryable error in pipeline batch generation: %s", root_exception)
                        throttler.adjust_workers(success=False)

From our collective understanding, OpenAIError is potentially thrown back to us only after the retries have already been attempted (correct me if I am wrong). In that case, I think the logging message could be clearer, but the behavior is as expected(?): the thread of request that was sent through OpenAI client was retried a couple times, with backoff, and failed still, and gracefully died without succeeding, raising its OpenAIError (I am assuming), and we caught and reported it as such.

This understanding of mine of the behavior was from the Perf. team who said:

My understanding of the code is that if we have num-cpus greater than 1 and if one of the thread hits the timeout or any other errors it would just gracefully shutdown that thread without affecting the data generation in other threads before finally doing the data mixing.

Did your generate 'fail' entirely at the end, or were there these warnings but the generation completed successfully at the end?

@eshwarprasadS eshwarprasadS force-pushed the RHELAI-2756-adding-adaptive-throttling branch from 9ee943d to 31fa4ba Compare January 14, 2025 21:02
@mergify mergify bot removed the ci-failure label Jan 14, 2025
@bbrowning
Copy link
Contributor

My generation did fail entirely at the end, because it had zero generated output as the requests that timed out were ultimately "lost", in that after we received a timeout those requests were dropped and we attempted no further data generation from that input sample.

@bbrowning
Copy link
Contributor

Do we want to leave this PR open for now? Or do we want to close it with the expectation that #484 fixes the underlying issue we observed?

@eshwarprasadS
Copy link
Contributor Author

Do we want to leave this PR open for now? Or do we want to close it with the expectation that #484 fixes the underlying issue we observed?

Yes, lets close this, since #484 addresses the issue already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control maximum parallel requests to inference server
2 participants