Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037

Open
bulk88 opened this issue Feb 27, 2025 · 6 comments
Open

UCRTbase.dll toupper() is 133x slower wall time than perl/msvcrt.dll #23037

bulk88 opened this issue Feb 27, 2025 · 6 comments
Assignees

Comments

@bulk88
Copy link
Contributor

bulk88 commented Feb 27, 2025

Module:

Description
A certain profiling call stack caught my eye and the final report from my profiler said 8% of all cpu time of perl is spent inside. isupper()/toupper() from ucrtbase.dll, these are floating between place 4- place 8 as highest CPU hogs on random core .t'es. upper() Reaching # 1 was jaw dropping. Hence I investigated.

Image

Image

some research this is 1 call about 1 U8 BTW, ::LocaleUpdate has 6 FlsGetValue calls (wraped with glerr preserving), toupper() fires::LocaleUpdate() every time, errorno in ucrt added another 4-5 FLSGV calls __acrt_LCMapStringA�() fires ::LocaleUpdate again ,

	KernelBase.dll!LCMapStringEx�()	Unknown
 kernel32.dll!LCMapStringExStub�()	Unknow
ucrtbase.dll!__acrt_LCMapStringA_stat()	Unknown
 ucrtbase.dll!__acrt_LCMapStringA�()	Unknown
 ucrtbase.dll!toupper�()	Unknown

soon after

KernelBase.dll!DefaultSortVersion()	Unknown
 KernelBase.dll!VersionValue�()	Unknown
 	KernelBase.dll!SortChangeCase�()	Unknown 
  KernelBase.dll!LCMapStringEx�()	Unknown

a few cpu ins addrs later (remember lines of code have loops)

MinSortChangeCase�()
KernelBase.dll!GetNamedLocaleHashNode()	Unknown
KernelBase.dll!SortChangeCase�()	Unknown
KernelBase.dll!LCMapStringEx�()	Unknown
kernel32.dll!LCMapStringExStub�()	Unknown
ucrtbase.dll!__acrt_LCMapStringA_stat()	Unknown
ucrtbase.dll!__acrt_LCMapStringA�()	Unknown
ucrtbase.dll!toupper�()	Unknown

kernelbase.dll tries building a tree of nodes or iterating all country codes on earth, data being searched by KernelBase.dll!GetNamedLocaleHashNode looks like

but this is raw memory with unprintables regexped out, i think its country codes but im not going rev eng it

benchmarks its horrible

C:\sources\crtslow\CRTSlow>perl -Mblib -MCRTSlow -e"CRT::Be();";
cache wake                       78063695 us Ln 337
tolower                        2440267443 us Ln 341
_tolower                         18608920 us Ln 345
toLOWER_A                        17415609 us Ln 349
toLOWER_L1                       17737359 us Ln 353
isgraph                        2562596668 us Ln 357
isGRAPH_A                        18712061 us Ln 361
isGRAPH_L1                       17790779 us Ln 365
isalnum                        2520004815 us Ln 369
isALPHANUMERIC_A                 18165949 us Ln 373
isALPHANUMERIC_L1                18033632 us Ln 377
isalnum msvcrt                  111663798 us Ln 392
isalnum msvcr100                 98945415 us Ln 402
isalnum msvcr120                 97687178 us Ln 412
C:\sources\crtslow\CRTSlow>

with psudo threads 3 cores, idk enough if this is scaling or lock contention perl side or ms side is happening

C:\sources\crtslow\CRTSlow>perl -Mblib -MCRTSlow -e"$r = fork(); $t = fork(); ex
it if !$t && $r;  CRT::Be();";
cache wake                       32082575 us Ln 337
cache wake                       33263148 us Ln 337
cache wake                       40918586 us Ln 337
tolower                        3157700393 us Ln 341
tolower                        3165074351 us Ln 341
_tolower                         22321171 us Ln 345
_tolower                         20705845 us Ln 345
toLOWER_A                        22430065 us Ln 349
toLOWER_A                        20589965 us Ln 349
toLOWER_L1                       22076263 us Ln 353
toLOWER_L1                       23926635 us Ln 353
tolower                        3348870216 us Ln 341
_tolower                         23873216 us Ln 345
toLOWER_A                        24134972 us Ln 349
toLOWER_L1                       23457776 us Ln 353
isgraph                        3397327541 us Ln 357
isGRAPH_A                        20434638 us Ln 361
isGRAPH_L1                       24083196 us Ln 365
isgraph                        3679713786 us Ln 357
isgraph                        3694650315 us Ln 357
isGRAPH_A                        24393851 us Ln 361
isGRAPH_A                        25253907 us Ln 361
isGRAPH_L1                       18637274 us Ln 365
isGRAPH_L1                       23057951 us Ln 365
isalnum                        3123988931 us Ln 369
isALPHANUMERIC_A                 23420793 us Ln 373
isALPHANUMERIC_L1                20040976 us Ln 377
isalnum                        3220601143 us Ln 369
isalnum                        3333259367 us Ln 369
isALPHANUMERIC_A                 26456259 us Ln 373
isALPHANUMERIC_A                 25220212 us Ln 373
isALPHANUMERIC_L1                19804287 us Ln 377
isALPHANUMERIC_L1                26882794 us Ln 377
isalnum msvcrt                  144513972 us Ln 392
isalnum msvcrt                  151367705 us Ln 392
isalnum msvcrt                  151673430 us Ln 392
isalnum msvcr100                133158199 us Ln 402
isalnum msvcr100                124220690 us Ln 402
isalnum msvcr100                123605543 us Ln 402
isalnum msvcr120                129325137 us Ln 412
isalnum msvcr120                118722581 us Ln 412
isalnum msvcr120                121761334 us Ln 412
C:\sources\crtslow\CRTSlow>

Steps to Reproduce

#define PERL_NO_GET_CONTEXT
#define  WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"


/* Global Data */
LARGE_INTEGER Frequency = { 0 };
#define g_Frequency Frequency

START_MY_CXT


/* BTIME = BENCH TIME*/
#define BTIMESTART do { \
    LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds; NV nv1; NV nv2; \
    QueryPerformanceCounter(&StartingTime)

#define BTIMEEND(label) \
    QueryPerformanceCounter(&EndingTime); \
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart; \
    ElapsedMicroseconds.QuadPart *= 1000000000; \
    ElapsedMicroseconds.QuadPart = ElapsedMicroseconds.QuadPart           \
    /((LARGE_INTEGER*)(&Frequency))->QuadPart; \
    printf("%-30s %10I64u us Ln %u\n", label, ElapsedMicroseconds.QuadPart, __LINE__); \
} while(0)

#define VP(_vp) ((size_t)(_vp))
#define CP(_cp) ((char *)(_cp))
#define VPP(_p) ((void**)(_p))
#define LST for(n=0; n < 10; n++) { p = CP(low); while(p < CP(hi)){ c = *p;
#define LEND  c1 += r; p++;}   }
#  ifndef MIN
#    define MIN(a,b) ((a) < (b) ? (a) : (b))
#  endif
#  ifndef MAX
#    define MAX(a,b) ((a) > (b) ? (a) : (b))
#  endif

MODULE = CRT		PACKAGE = CRT

void
Be()
PPCODE:
  PUTBACK;
  HMODULE h_orig;
  GetModuleHandleExW(
  GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS,
  &PL_No[0],
    &h_orig
  );
      const IMAGE_DOS_HEADER* h = (const IMAGE_DOS_HEADER*)h_orig;
      PIMAGE_NT_HEADERS nt_header = (PIMAGE_NT_HEADERS const)(VP(h) + VP(h->e_lfanew));
	WORD NumberOfSections = nt_header->FileHeader.NumberOfSections;
  /* IMAGE_FIRST_SECTION() macro is universal, and corrects for OPTIONAL32 vs OPTIONAL64 */
	const IMAGE_SECTION_HEADER * sec = IMAGE_FIRST_SECTION(nt_header);
	const IMAGE_SECTION_HEADER * secEnd = sec + NumberOfSections;
    char * low =~0;
  char * hi = 0;
    char * lowi;
  char * hii;
  U32 r;
  U32 c1 = 0;
  U8 c;
  char * p;
  int n =0;
	for(; sec < secEnd ; sec++) {
    lowi = VP(h) + VP(sec->VirtualAddress);
    hii = lowi + VP(sec->SizeOfRawData);
   low = MIN(low,lowi);
   hi = MAX(hi,hii);
  }
 BTIMESTART;
LST r = c; LEND
BTIMEEND("cache wake");
BTIMESTART; LST
    tolower(c);
LEND
BTIMEEND("tolower");
BTIMESTART; LST
    _tolower(c);
    LEND
      BTIMEEND("_tolower");
    BTIMESTART; LST
    toLOWER_A(c);
    LEND
      BTIMEEND("toLOWER_A");
    BTIMESTART; LST
    toLOWER_L1(c);
    LEND
      BTIMEEND("toLOWER_L1");
    BTIMESTART; LST
    isgraph(c);
    LEND
      BTIMEEND("isgraph");
    BTIMESTART; LST
    isGRAPH_A(c);
    LEND
      BTIMEEND("isGRAPH_A");
    BTIMESTART; LST
    isGRAPH_L1(c);
    LEND
      BTIMEEND("isGRAPH_L1");
    BTIMESTART; LST
    isalnum(c);
    LEND
      BTIMEEND("isalnum");
    BTIMESTART; LST
    isALPHANUMERIC_A(c);
    LEND
      BTIMEEND("isALPHANUMERIC_A");
    BTIMESTART; LST
    isALPHANUMERIC_L1(c);
    LEND
      BTIMEEND("isALPHANUMERIC_L1");
  const char * dllname = 0;
  unsigned char flag;
  unsigned char f_type;
  unsigned char f_len;
  HANDLE h2;
  dllname = (char *)&ms_crt_dllnames;
  typedef int (__cdecl * isctypefn_t)(int);
    h2 = LoadLibrary("msvcrt");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcrt");
      }
    }
    h2 = LoadLibrary("msvcr100.dll");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcr100");
      }
    }
    h2 = LoadLibrary("msvcr120.dll");
    if(h2) {
      isctypefn_t pfn = (isctypefn_t)GetProcAddress(h2, "isalnum");
      if(pfn) {
            BTIMESTART; LST
            pfn(c);
            LEND
            BTIMEEND("isalnum msvcr120");
      }
    }
    iscntrl(c);
    isCNTRL_A(c);
    isCNTRL_L1(c);
    ispunct(c);
    isPUNCT_A(c);
    isPUNCT_L1(c);
    isspace(c);
    isSPACE_A(c) ;
    isSPACE_L1(c) ;
    isxdigit(c);
    isXDIGIT_A(c);
    isXDIGIT_L1(c);
    isdigit(c);
    isDIGIT_A(c);
    isDIGIT_L1(c);
    isalpha(c);
    isALPHA_A(c);
    isALPHA_L1(c);
    XSRETURN_IV(c1);
    return;
 

MODULE = CRTSlow		PACKAGE = CRTSlow		


BOOT:
{
    MY_CXT_INIT;
    /* If any of the fields in the my_cxt_t struct need
       to be initialised, do it here.
     */
    QueryPerformanceFrequency(&g_Frequency);
}

Expected behavior

Half joke half serious, but remove UCRT from default build config win perl and link against msvcrt.dll.

Perl configuration

Summary of my perl5 (revision 5 version 41 subversion 7) configuration:
  Derived from: 73172a67eaae5671dffc06b427f005810d151472
  Platform:
    osname=MSWin32
    osvers=6.1.7601
    archname=MSWin32-x64-multi-thread
    uname=''
    config_args='undef'
    hint=recommended
    useposix=true
    d_sigaction=undef
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cl'
    ccflags ='-nologo -GF -W3 -MD -TC -DWIN32 -D_CONSOLE -DNO_STRICT -DWIN64 -D
CRT_SECURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE -D_WINSOCK_DEPRECATED_NO_WA
NINGS -DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DWIN32_NO_REG
STRY -DUSE_PERLIO'
    optimize='-O1 -Zi -GL -fp:precise'
    cppflags='-DWIN32'
    ccversion='19.36.32535'
    gccversion=''
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=undef
    longlongsize=8
    d_longdbl=define
    longdblsize=8
    longdblkind=0
    ivtype='__int64'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='__int64'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='link'
    ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:\pb64
lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'
    libpth="C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MS
C\14.36.32532\lib\x64"
    libs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.l
b advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.
ib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.li
 vcruntime.lib ucrt.lib
    perllibs=oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg
2.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2
32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib msvcr
.lib vcruntime.lib ucrt.lib
    libc=ucrt.lib
    so=dll
    useshrplib=true
    libperl=perl541.lib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs
    dlext=dll
    d_dlsymun=undef
    ccdlflags=' '
    cccdlflags=' '
    lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c
\pb64\lib\CORE" -machine:AMD64 -subsystem:console,"5.02"'


Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_TIMES
    HAVE_INTERP_INTERN
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_NO_REGISTRY
    USE_PERLIO
    USE_PERL_ATOF
    USE_THREAD_SAFE_LOCALE
  Locally applied patches:
    uncommitted-changes
  Built under MSWin32
  Compiled at Dec 20 2024 10:03:46
  %ENV:
    PERL_DOBK="1"
    PERL_DOBP="1"
    PERL_DODB="1"
  @INC:
    C:/pb64/site/lib/MSWin32-x64-multi-thread
    C:/pb64/site/lib
    C:/pb64/lib

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 27, 2025

python/cpython#79376

https://bugs.python.org/issue35195

In 2018 Python identified this problem. Py ticket remains open ATM Feb 2025. IDK enough arch/API/design/tech info to understand all the comments in the cPy tickets if there is a proposed fix or reject fix or unfairly rejected fix in those 2 tickets.

@khwilliamson
Copy link
Contributor

UCRT works; many bugs went away when we converted to use it.

@tonycoz
Copy link
Contributor

tonycoz commented Feb 27, 2025

I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:

  • it's trying to emulate the case-insensitivity of the Win32 UTF-16 environment with localized 8-bit (possibly MBCS) strings
  • it's using locale sensitive functions to do it, so if you change locale the order may become invalid
  • the ::Add function pushes new strings on the end and the does a sort, making it $O(N \log N)$ when it could be $O(N)$ (the inserting would be $O(N)$, finding the right place to put it is $O(\log N)$ )
  • the copy constructor calls ::Add for each entry in the table, making it $O(N^2 \log N)$ when simply allocating a new array and copying over the entries would be $O(N)$.

Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be.

@tonycoz tonycoz self-assigned this Feb 28, 2025
@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

Forgot to add in the OP.

Since 5.37.10 and commit 8a548d1 msvcrt.dll yes that file, C:\windows\System32\msvcrt.dll is static DLL linked into the perl process virtual addr space 100% of the time, through ws2_32.dll aka winsock static DLL linking against it. I am very dissatisfied at commit 8a548d1 since delay loading winsock was the best milliseconds saving optimization ever done to WinPerl. Especially how EU::MM makefiles have dozens and dozens of very short lived (15-100ms lifetime) perl processes to build a module.

The P5P repo's .t files , and less so CPAN, will call sub runperl() at

perl5/t/test.pl

Line 857 in 16196ae

sub runperl {
10Ks IDK 100Ks of times during a core blead make test. Copy pasted from a GH runner, blead perl has 1.2 million tests.

Files=2910, Tests=1193052, 418 wallclock secs (94.06 usr 10.78 sys + 493.74 cusr 65.30 csys = 663.88 CPU)

100K*4ms= 6.6 minutes faster core make test. 4ms is a wild guess, it will vary machine to machine user to user. Its probably 10 or 20 ms for me and my older equip. I have plans to return ws2_32.dll back to delay loading. My current returning delay loading winsock patch is failing a couple tests and needs some refactor to remove some irrelevant code edits. Its also stalled by another patch I have making user32.dll a delay load, which is not simple, b/c WinPerl's runloop always has monitored the Windows GUI event loop since day 1, and CPAN GUI libs can't be broken by WinPerl's runloop suddenly not dispatching SIGNALS Windows OS GUI messages from user32.dll. But I already protoyped and ran 3 implementations of delay-loaded user32.dll and have 4 more implementations brainstormed in my comments. Did you know 50% of perl.exe'es, in blead perl gmake test exit()/process terminate, before reaching 35/50 milliseconds point of their process run life?

55 millisecond is 1.7 frames at 30 frames per second. Blead perl currently has a 33,000 OP*s executed timer, before the first time it polls the Win32 GUI loop. Its crazy "link av.obj hv.obj perl.obj /delayload:user32.dll -o perl541.dll" really helps with blead perl core self gmake test. No other changes. A PR with just /delayload:user32.dll in it, is inappropriate IMO, b/c technical details I won't go into.

So UCRTbase.dll vs isupper() vs toupper() vs msvcrt.dll. Because msvcrt.dll is static linked into perl since 5.37.10. And my benchmarks in OP, were done with UCRTbase.dll and msvcrt.dll inside the same process, and the perl process already had msvcrt.dll in VA space, I didn't explicitly load it.

The question now is, does WinPerl selectivly replace cherry picked, problematic, slow, libc calls in ucrtbase.dll, with their same name libc calls in msvcrt.dll?

Because perl.exe has the choice of which one to call at runtime, they both are available at all times inside a perl process. The call stacks, profiler reports, and my benchmarks show an ex^^^^ponential multiple orders of magnitude performance difference, between 2 difference implementations, of the same exact C standard lib function.

Next question, why is WinPerl even C linking against MS's isupper()/toupper(), when Perl/P5P implements their own portable isupper()/toupper()? Is that itself a bug? My benchmarks included the Perl portable replacements for the platform vendor/OS/CC provided versions. They exist, they are fast/perfect.

Would slurping/looping U8 values 0x00-0xFF, 1x on process start, through MS UCRT's isupper()/toupper(), and caching the output 0/1 bits or U8s, work?

Nobody can justify enumerating all 250 country codes on earth in a SQL DB/for loop+strcmp(), loopmalloc()ing a red black tree, loop balancing it, loop walking the nodes until match, then loop free(), for each execution of if(isupper(char var_char)) {0;} else {0;} on a 8-bit integer.

You can't upper case an ASCII string, for each 8 bit character, you posting a new job ad on LinkedIn, interview and hiring a new developer and agree on a consulting contract and fee schedule, he reads the ASCII char and writes with a pen, 01000001, and hands you the paper with 01000001 written on it, and you hand him a check for $500, and his employment at you company terminates. He was paid $500 for 15-25 seconds of work. Great company to work for. 5 stars employer. Thats what UCRT is doing internally.

3 rd possible fix, the most difficult fix, which is beyond my expertise, figure out why ucrtbase.dll's cache logic are disabled at runtime inside a perl.exe process. I physically can see my ucrtbase.dll has caching mechanisms/conditional branches and shortcut mechanisms/conditional branches, but for unknown reasons, at runtime, every single time, where an if(){}else{} conditional jmp opportunity exists, the "heavy" no-cache branch of machine code was picked to execute, and not the "fast cache" array/C struct bitfield execution path. I'm not going to single step it again right now, but I counted atleast 3 if/else branch "fast cache, quick return" opportunities inside ucrtbase.dll, that didn't execute, before ucrtbase.dll handed over control to kernel32.dll, and kernel32.dll/kernelbase.dll LCMapStringEx() started searching a SQL DB of country codes, or digging through a memory mapped file of locales and country codes on my disk and inflating that "binary disk file" with variable length intergers, or gzip/huffman encoded vals, 0-indexed offsets, to CPU friendly C structs.

The API docs for LCMapStringEx()@kernel32.dll https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-lcmapstringex#remarks actually spell out, the user should NOT call LCMapStringEx() 1x per char, but pass LCMAP_SORTKEY, and obtain a reference counted (???) sort object from LCMapStringEx() and feed that "sort cache object" into some other API call. ucrtbase.dll is calling LCMapStringEx() 1x per char, which the API docs of LCMapStringEx() specifically say is a bad idea.

So did perl.exe/perl5xx.dll/perl5porters do something wrong and explicitly disable the cache logic inside ucrtbase.dll?

Or this is a bug inside ucrtbase.dll, which only Microsoft can fix, and a member of the public must file a public bug ticket with MS, and MS devs must recompiling and publishing a new higher build number of ucrtbase.dll? Beyond scope for me to diag this. IDK enough.

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

I'm not so worried about the performance of toupper() here, but there are a few other problems with this code:

  • it's trying to emulate the case-insensitivity of the Win32 UTF-16 environment with localized 8-bit (possibly MBCS) strings

Maybe my ucrtbase.dll is going "crazy" because it thinks I'm passing asian MBCS upto 2^16 fake-ascii and it needs to be compliant with MBCS rules, even tho I live in USA, and if I have a 2nd locale or keyboard installed for personal reasons, it will be a locale/keyboard layout somewhere in Continental Europe, but quite far from London or Paris (not Latin-1 but still a normal 0x80-0xFF code page). I have no idea why a MBCS code path would ever trigger on my system, but cPython has the exact same bug I have but its reported and reproduced by other people. Its not unique to me.

PS I've spend 3 days searching ReactOS for what is the limit for U8's per "char" for a "MBCS" code page on a technical MS NLS C API level. I believe MultiByteToWideChar allows 1 MBCS character to be of length I32_MAX or input buffer length in U8, or U8 '\0'. You can do it by using a MBCS continuation mask U8 byte and repeating it forever to the end of the your input byte buffer. A 1 MB long character will become 2 bytes of output, aka UTF16 replacement char L'?'. The other direction, UTF16 to legacy MBCS can't generate such nonsense since UTF16's maximum output to legacy is, 4 bytes of UTF16 generates 4 bytes of Legacy CP or 4 bytes of UTF8 CP. Perhaps with surrogate abuse 4 bytes Wide becomes 6 bytes legacy output.

use Encode;
use Devel::Peek;
my $source = Encode::decode("UTF-16LE","\xF0\xDB\xF0\xDB");
Dump($source);
SV = PV(0x48e1f8) at 0x50a0a0
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x2208aa8 "\xEF\xBF\xBD\xEF\xBF\xBD"\0 [UTF8 "\x{fffd}\x{fffd}"]
  CUR = 6
  LEN = 16

BTW I believe RtlAnsiStringToUnicodeString/RtlMultiByteToUnicodeN and NTFS will not put up with MBCS continuation character abuse. And they are NOT identical to MultiByteToWideChar for all inputs. Benchmarking RtlAnsiStringToUnicodeString vs MultiByteToWideChar is on my todo list, ive never seen someone do it before. I strongly RtlAnsiStringToUnicodeString and siblings will be faster since they take alot of shortcuts, such as being incapable of stopping conversion to wide, at formally unassigned by spec characters in legacy 1 byte per 1 char CPs. "incapable of stopping conversion to wide" means of L"?" wchar_ts in output.

char_low = *char_ptr;
if(char_low & MBCS_CONTINUE) {
    char_high = *(char_ptr+1);
    wide = (char_hi <<16) | char_low;
    char_ptr += 2; /* game over */
}
  • it's using locale sensitive functions to do it, so if you change locale the order may become invalid

IDK enough. Maybe this toupper()/isupper() bug has something to do with that newish in Perl many reader single writer locking process global locale inter-OS thread serializing/anti-race code.

Fixing all this would eliminate the toupper/isupper() calls, I don't know off-hand what the appropriate Win32 API would be.

What are Perl in C's mandatory requirement for vendor C std lib toupper()/isupper() ?

https://en.cppreference.com/w/cpp/string/byte/toupper says no 2^32 or 2^64 abuse

If the value of ch is not representable as unsigned char and does not equal [EOF] , the behavior is undefined.

As you and me both agreed on IRC, there is some really poor quality Win32 only code, inside https://github.com/Perl/perl5/blob/blead/win32/perlhost.h that turns the toupper() O(n^2) (?) bug into a O(n^n) (?) bug in 1 rare place in WinPerl not UnixPerl and only if you make a 2nd ithread/use threads.pm; fork(); in WinPerl.

But I'm less concerned about performance of creating ithread # 2 in a WinOS proc, vs perl interp executing this broken slow toupper() from the run loop in very generic, very common, very production, PP code executing through very vanilla cross-platform Perl C code in Perl_pp_some_op();. Image # 2 in the first post, shows a callstack, where toupper() was executed by the interpreter, from very vanilla cross platform posix-ish Perl C code. Not #ifdef WIN32\n #endif code. Vanilla PP code executing toupper() at a high frequency from generic typical PP code scares me alot more. This is report generated from a C symbol file of all callers of these problem UCRT functions in perl5.41.10.dll. 95% of callers are cross platform perl code, half of them are extreme frequency runloop Perl_pp_*() ops.

call    cs:__imp_isalnum S_find_byclass
call    cs:__imp_isalnum S_isFOO_lc
call    cs:__imp_isalnum S_new_ctype
call    cs:__imp_isalnum S_regmatch
call    cs:__imp_isalpha S_isFOO_lc
call    cs:__imp_isalpha S_new_ctype
call    cs:__imp_iscntrl Perl_mem_collxfrm_
call    cs:__imp_iscntrl Perl_pp_fttext
call    cs:__imp_iscntrl Perl_yyerror_pvn
call    cs:__imp_iscntrl S_find_byclass
call    cs:__imp_iscntrl S_isFOO_lc
call    cs:__imp_iscntrl S_new_ctype
call    cs:__imp_iscntrl S_regmatch
call    cs:__imp_iscntrl S_sv_display
call    cs:__imp_isdigit S_isFOO_lc
call    cs:__imp_isdigit S_new_ctype
call    cs:__imp_isgraph S_isFOO_lc
call    cs:__imp_isgraph S_new_ctype
call    cs:__imp_islower S_isFOO_lc
call    cs:__imp_islower S_new_ctype
call    cs:__imp_isprint Perl_pp_fttext
call    cs:__imp_isprint Perl_yyerror_pvn
call    cs:__imp_isprint S_isFOO_lc
call    cs:__imp_isprint S_new_ctype
call    cs:__imp_isprint S_sv_display
call    cs:__imp_ispunct S_find_byclass
call    cs:__imp_ispunct S_isFOO_lc
call    cs:__imp_ispunct S_new_ctype
call    cs:__imp_ispunct S_regmatch
call    cs:__imp_isspace Perl_pp_fttext
call    cs:__imp_isspace Perl_pp_split
call    cs:__imp_isspace S_isFOO_lc
call    cs:__imp_isspace S_new_ctype
call    cs:__imp_isupper S_isFOO_lc
call    cs:__imp_isupper S_new_ctype
call    cs:__imp_isxdigit S_isFOO_lc
call    cs:__imp_isxdigit S_new_ctype
call    cs:__imp_tolower Perl__to_utf8_fold_flags
call    cs:__imp_tolower Perl__to_utf8_lower_flags
call    cs:__imp_tolower Perl_pp_fc
call    cs:__imp_tolower Perl_pp_lc
call    cs:__imp_tolower Perl_pp_ucfirst
call    cs:__imp_tolower S_new_ctype
call    cs:__imp_toupper Perl__to_utf8_title_flags
call    cs:__imp_toupper Perl__to_utf8_upper_flags
call    cs:__imp_toupper Perl_pp_uc
call    cs:__imp_toupper Perl_pp_ucfirst
call    cs:__imp_toupper S_new_ctype
call    cs:__imp_toupper compare
call    cs:__imp_toupper lookup

@bulk88
Copy link
Contributor Author

bulk88 commented Feb 28, 2025

Another idea, on WinPerl, is a codebase wide grep isupper() -> isupper_l() needed and 100% removal of isupper() forever? If you use isupper_l() and pass a locale ptr in arg 2, The horribly inefficient LocaleUpdate::_LocaleUpdate which calls FlsGetValue() atleast 5x or more. Will take a 2 CPU instructions big shortcut and instant return.

9 stack push non-vol reg+test+cond_jmp+sse_mov_ptr_to_reg+sse_mov_reg_to_ptr+abs_jmp+9 stack pop non-vol reg+x86 ret

That branch in LocaleUpdate::_LocaleUpdate is a NOOP compared to what LocaleUpdate::_LocaleUpdate workload/cpu op countwise/runtime overhead wise, does right now inside the perl.exe proc.

If libperl.dll always passes a locale_t as arg 2, that Perl process-wide thread-wide locale settling race bug with WinPerl serializing multi-OS thread access, using a very poor DIY-ed by Perl re-implementation of MS's Slim reader/writer (SRW) API https://learn.microsoft.com/en-us/windows/win32/sync/slim-reader-writer--srw--locks that whole API thing, basically will disappear through macros/etc from WinPerl/libperl.dll, maybe the exported lock variables stay for less than perfect CPAN XS code, but nothing in libperl.dll will ever obtain that serialize lock ever again,

And MS UCRT Devs probably can't even see the LocaleUpdate::_LocaleUpdate method call since its a "POD" void * class member field in their source code and they forgot its a getter setter op overload method and not an machine type, and LocaleUpdate::_LocaleUpdate can't constant fold anything b/c rules of C/C++ (FlsGetValue() is a C symbol, you multi evaled it, inline and constant folding it is illegal in C).

It doesn't matter in 2025, but IIRC ucrtbase.dll on WinXP with VC2013, calls GetProcAddress("FlsGetValue"), GPA returns NULL, then ucrtbase.dll executes TlsGetValue() every single time (5x-6x).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants