Skip to content

Conversation

@xtqqczze
Copy link
Contributor

@xtqqczze xtqqczze commented Nov 3, 2025

Saves an instruction on XARCH:
lea vs mov, add

Example diff:

        mov      edx, edi
        movzx    rdx, word  ptr [r12+2*rdx]
-       mov      esi, edx
-       or       esi, 32
-       add      esi, -97
-       cmp      esi, 25
+       lea      esi, [rdx-0x30]
+       cmp      esi, 9
        setbe    sil
        movzx    rsi, sil
-       add      edx, -48
-       cmp      edx, 9
+       or       edx, 32
+       add      edx, -97
+       cmp      edx, 25
        setbe    dl
        movzx    rdx, dl
        or       edx, esi
-       je       G_M15724_IG11
+       je       SHORT G_M15724_IG11
        inc      edi
        cmp      edi, 8
        jge      G_M15724_IG11
        jmp      SHORT G_M15724_IG14
-						;; size=59 bbWeight=0.64 PerfScore 7.04
+						;; size=53 bbWeight=0.64 PerfScore 7.04

Saves an instruction on XARCH:
`lea` vs `mov`, `add`
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 3, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 3, 2025
@xtqqczze
Copy link
Contributor Author

xtqqczze commented Nov 3, 2025

@MihuBot

@tannergooding
Copy link
Member

tannergooding commented Nov 3, 2025

lea vs mov, add

This is potentially a de-optimization.

Not only are there often fewer LEA than ALU ports, but LEA is expected to be used for "addressing" and as such often gets specialized hardware support such as utilizing the AGU, participating in stack pointer tracking, fast store forwarding prediction, etc. There are also often special considerations of the "two operand" vs "three operand" LEA, with the latter being more restricted and more expensive.

Some newer hardware is more flexible and will allow LEA without scaled index and with only two sources from base, index, and displacement to be executed as an ALU operation instead, but this isn't a guarantee and may still break the other optimizations that are possible.

If this was beneficial to generally do, it's likely a general purpose optimization that should be done by the JIT (rather than a "one off" micro-optimization to a single method).

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Nov 3, 2025

@EgorBot -amd -intel

using System;
using System.Net;
using BenchmarkDotNet.Attributes;

public class IPv4_u16_Benchmarks
{
    public IEnumerable<string> Data() => [
        new string('A', 64),
        "HelloWorld1234567890"
    ];

    [Benchmark]
    [ArgumentsSource(nameof(Data))]
    public bool M(string s)
    {
        for (int i = 0; i < s.Length; i++)
        {
            char ch = s[i];
            if (!char.IsAsciiLetterOrDigit(ch))
                return false;
        }
        return true;
    }
}

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Nov 4, 2025

This is potentially a de-optimization.

@tannergooding Benchmarks show ratios of 0.99, 0.95, 1.05 for znver4, cascadelake and skylake respectively. So this is indeed a deoptimization on older processors (but an optimization on newer ones).

@tannergooding
Copy link
Member

Benchmarks show ratios of 0.99, 0.95, 1.05 for znver4, cascadelake and skylake respectively. So this is indeed a deoptimization on older processors (but an optimization on newer ones).

I would say the benchmark differences are likely too small (0-6ns) to give any kind of definitive result. They are going to be influenced by things like the run to run differences in code alignment, in BDN measuring the overhead of an "empty call", and even the latency of the hardware timer itself (typically around 10-15 cycles on such CPUs).

The code in question is a small sample that differs by an elidable register to register mov instruction. It is a micro-optimization in every sense of the word and so it's not something we'd typically take without significantly more evidence showing its worthwhile.

Beyond that, I still stand by the earlier point in that if this was a desirable optimization it isn't something we should be touching managed code to achieve. These types of subtle codegen differences are the type of thing that need to be handled in the JIT instead. Doing so ensures that its not just one function that benefits, but most functions that employ similar patterns. It is probably representative of some general-purpose transform that is missing and which might have broader impact for cases where it can actually impact the addressing mode of a load/store.

I'd recommend closing this PR and focusing any efforts for this or similar single-method micro-optimizations in the JIT instead so the impact can be more than a handful of nanoseconds.

@xtqqczze xtqqczze closed this Nov 5, 2025
@xtqqczze xtqqczze deleted the IsAsciiLetterOrDigit branch November 5, 2025 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Indicates that the PR has been added by a community member needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants