Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] activations scaling to resolve accuracy issues for infer precision of f16 #27265

Merged
merged 65 commits into from
Jan 14, 2025

Conversation

e-ddykim
Copy link
Contributor

@e-ddykim e-ddykim commented Oct 28, 2024

Details:

  • When a model runs at inference precision of f16, it might be unable to calculate correct results due to limited range of f16.
  • The purpose of this PR is to avoid situations where overflow occurs during calculation by scaling down the activation, thereby obtaining correct results when the infer precision is f16.
  • A new config property "ACTIVATIONS_SCALE_FACTOR" is introduced, which holds a single floating-point value. For example, if it is 64, activations are divided by 64 before Convolution and MatMul. If it is smaller than 0, this feature is disabled.
    • This property also can be set via rt_info of a model as below.
    <rt_info>
        <runtime_options>
            <ACTIVATIONS_SCALE_FACTOR value="8.0" />
        </runtime_options>
    </rt_info>

Tickets:

  • 147052

@e-ddykim e-ddykim requested review from a team as code owners October 28, 2024 02:20
@e-ddykim e-ddykim requested review from itikhono and removed request for a team October 28, 2024 02:20
@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Oct 28, 2024
@geunhwan geunhwan added this to the 2024.5 milestone Oct 28, 2024
* @brief This property scales down activations to prevent overflows when inference precision is f16.
* @ingroup ov_runtime_cpp_prop_api
*/
static constexpr Property<float, PropertyMutability::RW> activations_scale_factor{"ACTIVATIONS_SCALE_FACTOR"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add python bindings for this property

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how users are supposed to understand which value to set ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly experimentally, for now. In the future, we plan to have RT Info attribute of ov::Model which can be set from optimum pipelines or NNCF (if they add calibration flow at some point), and this attribute will be converted to plugin property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need later to merge this feature, then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Property is enough to solve issues in notebooks or solve issue in customers' pipelines. The features that I mentioned are needed to have better user experience, but those are not mandatory to deliver improvements to the end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-lavrenov please take a look

@geunhwan geunhwan removed this from the 2024.5 milestone Oct 29, 2024
@e-ddykim e-ddykim force-pushed the static_scaling branch 2 times, most recently from 0d7c7cd to bc284f5 Compare October 29, 2024 18:59
@e-ddykim e-ddykim requested a review from a team as a code owner October 29, 2024 18:59
@github-actions github-actions bot added the category: Python API OpenVINO Python bindings label Oct 29, 2024
@e-ddykim e-ddykim force-pushed the static_scaling branch 2 times, most recently from 8f22485 to ebca03d Compare November 4, 2024 12:40
@github-actions github-actions bot removed category: inference OpenVINO Runtime library - Inference category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels Nov 4, 2024
@AlexKoff88
Copy link
Contributor

@e-ddykim, please consider this PR: huggingface/optimum-intel#994

@geunhwan geunhwan added this to the 2025.0 milestone Jan 14, 2025

float activations_scale_factor = config.get_property(ov::hint::activations_scale_factor);

if (activations_scale_factor > 0.f && infer_precision == ov::element::f16 && !enableInt8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why !enableInt8 is needed? What if we run a model with hybrid quantization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When enableInt8 is true, activations of Convolution and Matmul are int8. So, I thought that activations scaling cannot be applied in this case. Actually, I met an issue when I tested with a resnet50-int8 model. But, I agree with your comments that we need to support hybrid quantized models. I think we can do it better after ScaleDownSingleLayer is replaced with updated LPT passes in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an option, we can move activation scaling pipeline after main LPT and match ScaleDownSingleLayer only on nodes which are not in low precision

@@ -61,7 +61,7 @@ void ExecutionConfig::set_default() {
std::make_tuple(ov::hint::kv_cache_precision, ov::element::undefined),
std::make_tuple(ov::intel_gpu::hint::enable_kernels_reuse, false),
std::make_tuple(ov::weights_path, ""),
std::make_tuple(ov::hint::activations_scale_factor, 0.f),
std::make_tuple(ov::hint::activations_scale_factor, -1.f),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to re-enable scale factor reading from RT info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation of activations scaling makes significant performance drop for LLMs on onednn path, but most LLM IRs already have rt_info now. So, I think it would be safer to re-enable it after resolving the perf. issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it means that models which really need scaling (flux, sd) won't work out of the box. How big is the perf drop for LLMs with current impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My test results showed about 2x perf. drop. The drop was bigger on faster device. So, I'm doing to resolve this issue, and hope to resolve it before the next timeline.

Copy link
Contributor

@vladimir-paramuzov vladimir-paramuzov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. Please enable scaling by default for dGPU and support models with hybrid quantization later

@vladimir-paramuzov vladimir-paramuzov added this pull request to the merge queue Jan 14, 2025
Merged via the queue into openvinotoolkit:master with commit cc67ad1 Jan 14, 2025
185 checks passed
MirceaDan99 pushed a commit to MirceaDan99/openvino that referenced this pull request Jan 22, 2025
…ion of f16 (openvinotoolkit#27265)

### Details:
- When a model runs at inference precision of f16, it might be unable to
calculate correct results due to limited range of f16.
- The purpose of this PR is to avoid situations where overflow occurs
during calculation by scaling down the activation, thereby obtaining
correct results when the infer precision is f16.
- A new config property "ACTIVATIONS_SCALE_FACTOR" is introduced, which
holds a single floating-point value. For example, if it is 64,
activations are divided by 64 before Convolution and MatMul. If it is
smaller than 0, this feature is disabled.
   - This property also can be set via rt_info of a model as below.
```html
    <rt_info>
        <runtime_options>
            <ACTIVATIONS_SCALE_FACTOR value="8.0" />
        </runtime_options>
    </rt_info>
``` 

### Tickets:
 - 147052

---------

Co-authored-by: Andrew Park <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin category: LP transformations OpenVINO Low Precision transformations category: transformations OpenVINO Runtime library - Transformations Code Freeze
Projects
None yet
Development

Successfully merging this pull request may close these issues.