Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 11, 2025

📄 24% (0.24x) speedup for CompressedTensorsConfig._quantization_scheme_map_from_config in python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py

⏱️ Runtime : 587 microseconds 475 microseconds (best of 37 runs)

📝 Explanation and details

The optimization achieves a 23% speedup by eliminating redundant computations and improving data structure efficiency:

Key Optimizations:

  1. Module-level constant creation: Moved _ACTIVATION_QUANTIZATION_FORMATS from inside the function to module level as a set instead of recreating a list on every call. The line profiler shows this eliminated 400+ microseconds spent repeatedly constructing the list and accessing CompressionFormat attributes (33.4% + 28.2% + 27% of original function time).

  2. Set vs List membership testing: Changed from list to set for O(1) vs O(n) membership checks in is_activation_quantization_format.

  3. Loop-invariant hoisting: Cached is_activation_quantization_format(quant_format) and QuantizationType.FLOAT outside the nested loops since quant_format doesn't change during iteration. This eliminates 105 redundant function calls (22.5% of original total time).

Performance Impact:

  • The is_activation_quantization_format function time dropped from 425μs to 19μs (95% reduction)
  • Most test cases show 5-35% improvements, with the largest gains on scenarios with many config groups (35% faster on 100 groups)
  • Edge cases with empty configs show minor regressions due to the upfront caching cost, but real workloads with substantial config processing benefit significantly

Why This Matters:
This function processes quantization configurations during model initialization. The nested loops over config groups and targets mean is_activation_quantization_format gets called repeatedly with the same quant_format value, making the caching optimization particularly effective for configurations with multiple target layers.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 13 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 73.3%
🌀 Generated Regression Tests and Runtime

import pytest
from sglang.srt.layers.quantization.compressed_tensors.compressed_tensors import
CompressedTensorsConfig

QuantizationType stub

class QuantizationType:
FLOAT = "float"
INT8 = "int8"
FP8 = "fp8"

QuantizationArgs stub with validation

class QuantizationArgs:
def init(self, type, bits=None, group_size=None):
self.type = type
self.bits = bits
self.group_size = group_size

@classmethod
def model_validate(cls, d):
    # Simulate pydantic validation: must be dict with 'type'
    if not isinstance(d, dict):
        raise TypeError("QuantizationArgs must be a dict")
    if "type" not in d:
        raise ValueError("Missing 'type' in QuantizationArgs")
    return cls(type=d["type"], bits=d.get("bits"), group_size=d.get("group_size"))

QuantizationConfig base class stub

class QuantizationConfig:
def init(self):
self.packed_modules_mapping = {}
from sglang.srt.layers.quantization.compressed_tensors.compressed_tensors import
CompressedTensorsConfig

--- Unit tests ---

1. Basic Test Cases

def test_edge_missing_input_activations_with_activation_quantization():
# Test config where activation quantization is enabled but input_activations is missing
config = {
"format": "float_quantized",
"config_groups": {
"group1": {
"targets": ["layer1"],
"weights": {"type": "float"}
# No input_activations
}
}
}
codeflash_output = CompressedTensorsConfig._quantization_scheme_map_from_config(config); result = codeflash_output # 23.8μs -> 22.8μs (4.78% faster)

def test_edge_missing_weights_raises():
# Test config missing weights (should raise ValueError from QuantizationArgs.model_validate)
config = {
"format": "int_quantized",
"config_groups": {
"group1": {
"targets": ["layer1"],
# "weights" missing
"input_activations": {"type": "fp8"}
}
}
}
with pytest.raises(ValueError):
CompressedTensorsConfig._quantization_scheme_map_from_config(config) # 7.82μs -> 8.33μs (6.13% slower)

def test_edge_empty_config_groups():
# Test config with empty config_groups
config = {
"format": "float_quantized",
"config_groups": {}
}
codeflash_output = CompressedTensorsConfig._quantization_scheme_map_from_config(config); result = codeflash_output # 1.69μs -> 2.64μs (36.0% slower)

def test_edge_targets_empty_list():
# Test config group with empty targets list
config = {
"format": "float_quantized",
"config_groups": {
"group1": {
"targets": [],
"weights": {"type": "float"},
"input_activations": {"type": "fp8"}
}
}
}
codeflash_output = CompressedTensorsConfig._quantization_scheme_map_from_config(config); result = codeflash_output # 1.63μs -> 2.24μs (27.4% slower)

def test_edge_no_format_key():
# Test config missing 'format' key
config = {
"config_groups": {
"group1": {
"targets": ["layer1"],
"weights": {"type": "float"},
"input_activations": {"type": "fp8"}
}
}
}
codeflash_output = CompressedTensorsConfig._quantization_scheme_map_from_config(config); result = codeflash_output # 24.2μs -> 22.9μs (5.97% faster)

3. Large Scale Test Cases

def test_large_scale_many_config_groups():
# Test config with many config groups (up to 100)
num_groups = 100
config_groups = {}
for i in range(num_groups):
config_groups[f"group{i}"] = {
"targets": [f"layer{i}"],
"weights": {"type": "float"},
"input_activations": {"type": "fp8"}
}
config = {
"format": "float_quantized",
"config_groups": config_groups
}
codeflash_output = CompressedTensorsConfig._quantization_scheme_map_from_config(config); result = codeflash_output # 434μs -> 321μs (35.1% faster)
for i in range(num_groups):
layer = f"layer{i}"

#------------------------------------------------
from typing import Any, Dict, Optional

imports

import pytest
from sglang.srt.layers.quantization.compressed_tensors.compressed_tensors import
CompressedTensorsConfig

QuantizationType stub

class QuantizationType:
FLOAT = "float"
INT = "int"
NAIVE = "naive"

QuantizationArgs stub

class QuantizationArgs:
def init(self, type_: str, bits: int):
self.type = type_
self.bits = bits

@classmethod
def model_validate(cls, data: Dict[str, Any]):
    # Simple validation for testing
    if not isinstance(data, dict):
        raise ValueError("QuantizationArgs must be a dict")
    type_ = data.get("type")
    bits = data.get("bits")
    if type_ not in [QuantizationType.FLOAT, QuantizationType.INT, QuantizationType.NAIVE]:
        raise ValueError("Invalid type")
    if not isinstance(bits, int) or bits < 0 or bits > 32:
        raise ValueError("Invalid bits")
    return cls(type_, bits)

def __eq__(self, other):
    return isinstance(other, QuantizationArgs) and self.type == other.type and self.bits == other.bits

def __repr__(self):
    return f"QuantizationArgs(type={self.type}, bits={self.bits})"

from sglang.srt.layers.quantization.compressed_tensors.compressed_tensors import
CompressedTensorsConfig

--- Unit tests ---

1. Basic Test Cases

def test_edge_invalid_weights_type():
# Invalid weights type should raise ValueError
config = {
"format": "int_quantized",
"config_groups": {
"group1": {
"targets": ["layer1"],
"weights": {"type": "invalid_type", "bits": 8},
"input_activations": {"type": "int", "bits": 8},
}
}
}
with pytest.raises(ValueError):
CompressedTensorsConfig._quantization_scheme_map_from_config(config) # 20.5μs -> 21.6μs (5.41% slower)

def test_edge_invalid_bits_value():
# Invalid bits value for weights
config = {
"format": "int_quantized",
"config_groups": {
"group1": {
"targets": ["layer1"],
"weights": {"type": "int", "bits": -1},
"input_activations": {"type": "int", "bits": 8},
}
}
}
with pytest.raises(ValueError):
CompressedTensorsConfig._quantization_scheme_map_from_config(config) # 13.0μs -> 13.8μs (6.17% slower)

def test_edge_empty_config_groups():
# No config_groups: should return empty dict
config = {
"format": "int_quantized",
"config_groups": {}
}
codeflash_output = CompressedTensorsConfig._quantization_scheme_map_from_config(config); result = codeflash_output # 1.45μs -> 1.98μs (26.9% slower)

def test_edge_empty_targets():
# No targets: should not add anything to the result
config = {
"format": "int_quantized",
"config_groups": {
"group1": {
"targets": [],
"weights": {"type": "int", "bits": 8},
"input_activations": {"type": "int", "bits": 8},
}
}
}
codeflash_output = CompressedTensorsConfig._quantization_scheme_map_from_config(config); result = codeflash_output # 1.54μs -> 2.21μs (30.3% slower)

def test_edge_non_dict_weights():
# weights is not a dict
config = {
"format": "int_quantized",
"config_groups": {
"group1": {
"targets": ["layer1"],
"weights": "not_a_dict",
"input_activations": {"type": "int", "bits": 8},
}
}
}
with pytest.raises(ValueError):
CompressedTensorsConfig._quantization_scheme_map_from_config(config) # 10.0μs -> 10.9μs (7.76% slower)

def test_edge_non_dict_input_activations():
# input_activations is not a dict
config = {
"format": "int_quantized",
"config_groups": {
"group1": {
"targets": ["layer1"],
"weights": {"type": "int", "bits": 8},
"input_activations": "not_a_dict",
}
}
}
with pytest.raises(ValueError):
CompressedTensorsConfig._quantization_scheme_map_from_config(config) # 15.2μs -> 15.4μs (1.68% slower)

3. Large Scale Test Cases

To edit these changes git checkout codeflash/optimize-CompressedTensorsConfig._quantization_scheme_map_from_config-mhtxo751 and push.

Codeflash Static Badge

The optimization achieves a **23% speedup** by eliminating redundant computations and improving data structure efficiency:

**Key Optimizations:**

1. **Module-level constant creation**: Moved `_ACTIVATION_QUANTIZATION_FORMATS` from inside the function to module level as a set instead of recreating a list on every call. The line profiler shows this eliminated 400+ microseconds spent repeatedly constructing the list and accessing `CompressionFormat` attributes (33.4% + 28.2% + 27% of original function time).

2. **Set vs List membership testing**: Changed from list to set for O(1) vs O(n) membership checks in `is_activation_quantization_format`.

3. **Loop-invariant hoisting**: Cached `is_activation_quantization_format(quant_format)` and `QuantizationType.FLOAT` outside the nested loops since `quant_format` doesn't change during iteration. This eliminates 105 redundant function calls (22.5% of original total time).

**Performance Impact:**
- The `is_activation_quantization_format` function time dropped from 425μs to 19μs (95% reduction)
- Most test cases show 5-35% improvements, with the largest gains on scenarios with many config groups (35% faster on 100 groups)
- Edge cases with empty configs show minor regressions due to the upfront caching cost, but real workloads with substantial config processing benefit significantly

**Why This Matters:**
This function processes quantization configurations during model initialization. The nested loops over config groups and targets mean `is_activation_quantization_format` gets called repeatedly with the same `quant_format` value, making the caching optimization particularly effective for configurations with multiple target layers.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 11, 2025 02:09
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant