Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl_sv_clear: unrolled SvREFCNT_dec and sv_free2....isn't that?! #22798

Open
richardleach opened this issue Nov 30, 2024 · 9 comments
Open

Perl_sv_clear: unrolled SvREFCNT_dec and sv_free2....isn't that?! #22798

richardleach opened this issue Nov 30, 2024 · 9 comments

Comments

@richardleach
Copy link
Contributor

Description
Towards the bottom of Perl_sv_clear, which is frequently a very hot function, there is the following comment:
/* unrolled SvREFCNT_dec and sv_free2 follows: */

Sounds good, makes sense that it would unroll SvREFCNT_dec and sv_free2.
But that's not actually what it does:

            /* unrolled SvREFCNT_dec and sv_free2 follows: */

            if (!sv)
                continue;
            if (!SvREFCNT(sv)) {
                sv_free(sv);
                continue;
            }
            .....

At least nowadays, sv_free means calling Perl_sv_free:

void
Perl_sv_free(pTHX_ SV *const sv)
{
    SvREFCNT_dec(sv);
}

SvREFCNT_dec is an inline function in sv_inline.h:

PERL_STATIC_INLINE void
Perl_SvREFCNT_dec(pTHX_ SV *sv)
{
    if (LIKELY(sv != NULL)) {
        U32 rc = SvREFCNT(sv);
        if (LIKELY(rc > 1))
            SvREFCNT(sv) = rc - 1;
        else
            Perl_sv_free2(aTHX_ sv, rc);
    }
}

So instead of unrolling SvREFCNT_dec and sv_free2, it calls a function (probably inlined)
to call SvREFCNT_dec which likely will call sv_free2. That's not unrolled at all!

The likely impact of this is some SV freeing is taking longer than it strictly has to. Looking at
gcov coverage when running the test harness, there are 35099459 calls to Perl_sv_free and
797851111 calls to Perl_sv_free2, so maybe ~4% of cleared SVs.

Expected behavior
I haven't sat down to figure this out. It seems like the status quo in Perl_sv_clear must be
wrong though and either the comment should be amended, (more likely) the call to sv_free
is intended to be a recursive call back to sv_free2, or some other code change should happen.

Besides Perl_sv_clear, only ext/Opcode/Opcode.xs and dist/Storable/Storable.xs seem
to call sv_free directly. Probably they should be using SvREFCNT_dec or calling sv_free2
instead.

Perl configuration
blead

@jkeenan
Copy link
Contributor

jkeenan commented Nov 30, 2024

@iabyn, it appears that much of this section of code entered the codebase back in 2010 in this commit:

commit 5239d5c4bfde4ec02e1787e9dc9ada189ad868e5
Author:     David Mitchell <[email protected]>
AuthorDate: Fri Oct 8 16:22:42 2010 +0100
Commit:     David Mitchell <[email protected]>
CommitDate: Mon Oct 11 00:41:17 2010 +0100

    make sv_clear() iterate over AVs
    
    In sv_clear(), rather than calling av_undef(), iterate over the AV's
    elements. This is the first stage in making sv_clear() non-recursive,
    and thus non-stack-blowing when freeing deeply nested structures.
...

Can you take a look?

@bulk88
Copy link
Contributor

bulk88 commented Nov 30, 2024

.text:0000000180001640 Perl_SvREFCNT_dec proc
.text:0000000180001640 48 83 EC 28             sub     rsp, 28h
.text:0000000180001644 48 85 D2                test    rdx, rdx
.text:0000000180001647 74 1B                   jz      short loc_180001664
.text:0000000180001649 44 8B 42 08             mov     r8d, [rdx+8]
.text:000000018000164D 41 83 F8 01             cmp     r8d, 1
.text:0000000180001651 76 0C                   jbe     short loc_18000165F
.text:0000000180001653 41 8D 40 FF             lea     eax, [r8-1]
.text:0000000180001657 89 42 08                mov     [rdx+8], eax
.text:000000018000165A 48 83 C4 28             add     rsp, 28h
.text:000000018000165E C3                      retn
.text:000000018000165F loc_18000165F:
.text:000000018000165F E8 84 20 08 00          call    Perl_sv_free2
.text:0000000180001664 loc_180001664:
.text:0000000180001664 48 83 C4 28             add     rsp, 28h
.text:0000000180001668 C3                      retn
.text:0000000180001668 Perl_SvREFCNT_dec endp

svrefdec

I've never liked svrefcountdec's or sv_free2()'s prototype, Its been on my mind since forever that it looks sketchy, and its putting too much logic and bloat in its one billion caller fns. Why can't it read the refcount field itself? why do all callers have to push that refcount arg on the reg-stack? Yes by chance AMD64 register R8 sneaks into the CPU's Win64 prototype and passes thru to callee frame without a dedicated mov op, but that just one cpu and one os. and doesn't apply for Win32 X86. Not going to git dig it, but I think that badly designed 3th arg, is only used for process-exiting, yes proc termination, why do all the caller's need that 3rd arg to be inlined into them?

@bulk88
Copy link
Contributor

bulk88 commented Dec 1, 2024

Another thought, perl does ref == 1; jmp maybe; ref -= 1; or as written above, load; test 1; jmp maybe; sub 1; store;. Much better ways of writing it is load; sub 1; jmp maybe; store; On almost all CPUs, Intel at minimum, sub includes a completely free if or ><== operator. Intel 86 and x64 CPUs are weirdos in microprocessor world that include lea op that does integer math without the free if or ><== operator. Perl should be doing load; sub 1; jmp maybe; store; to be efficient. The 2nd sub op is inefficient. The 2nd 1 operand is inefficient.

cmp 0 always has less bytes of machine code on Intel and often on ARM vs cmp 1. Intel has no operand (0) shorter op encodings for cmp. ARM32/64 machine code language doesn't allow integers over 2^12 (4096) as operands/constants inside 1 op. ARM64 doesn't allow constants at all above 2^16 ARM64's dedicated move constant to register op only has 2 bytes to work with b/c of 4 byte sized ops. ARM32 has unique on earth 2 byte ops, so 2^12 to work with. When you write if(flags & 0x1_0001_0001){} all ARM CCs will do reg = *((U64*)(instruction_register+offset_to_readonly_c_strings_area_of_binary)); for constant 0x1_0001_0001. ARM64 has a dedicated move U16 op, and 0x10011001 becomes mov; mov;, mov; move; cmp;, or mov reg 0x1234; cmp reg & (reg|0x1234); since 3 part operators are becoming a trend in electrical engineering world.

Intel is playing catch up adding 3 part operators to x64 in 2023

https://en.wikipedia.org/wiki/X86#APX_(Advanced_Performance_Extensions)

So perl's sv_ref_dec() should really be doing ref = ref-1; if(ref == 0) {} or doing this which is better than right now perl blead's design, this is worse than perfect, but better ref2 = ref-1; if(ref2 == 0) {} else {ref = ref2;}. ref = ref - 1; on x86-32 is a dedicated no operand/no constant on 32b. No constant ref = ref - 1; was removed for AMD64. But still ref-1 always includes a free 0 test. IDK why perl is writing ref-1 twice in sv_ref_dec(). Once in void context no assignment, then again with assignment. Atleast change something to == 0 instead of == 1 so its using no a operand/no constant operator on x86/x64. sv_free2() can always do ref = ref + 1; on entry if needed for many chained SV*s (rv/av/hv) dtoring mode.

current blead

PERL_STATIC_INLINE void
Perl_SvREFCNT_dec(pTHX_ SV *sv){
    if (LIKELY(sv != NULL)) {
        U32 rc = SvREFCNT(sv);
        if (LIKELY(rc > 1))
            SvREFCNT(sv) = rc - 1;
        else
            Perl_sv_free2(aTHX_ sv, rc);
    }
}

better not perfect

PERL_STATIC_INLINE void
Perl_SvREFCNT_dec(pTHX_ SV *sv){
    if (LIKELY(sv != NULL)) {
        U32 rc = SvREFCNT(sv) - 1;
        if (LIKELY(rc))
            SvREFCNT(sv) = rc;
        else
            Perl_sv_free2(aTHX_ sv, rc);
    }
}

perfection would be just like MS COM ABI does its ref counting since x86/x64 is CISC and includes a free store to RAM, inside sub operator.

PERL_STATIC_INLINE void
Perl_SvREFCNT_dec(pTHX_ SV *sv){
    if (LIKELY(sv != NULL)) {
        U32 rc = --SvREFCNT(sv);
        if (rc  == 0)
            Perl_sv_free2(aTHX_ sv);
    }
}

If sv_free2() needs to ++ it for perl api global memory sanity, oh well just ++ again, don't burden the one billion callers that inline sv_ref_dec() . The ++ is meaningless for perf if your about to call the fat on any OS libc free().

Your not saving anything perf wise trying to skip the assignment to RAM. The cache line is already sitting in secret registers, the cache line addr in DDR ram is already SMP lock-held in the "northbridge". Writes from a core to DDR are already async for decades.

@bulk88
Copy link
Contributor

bulk88 commented Dec 1, 2024

The likely impact of this is some SV freeing is taking longer than it strictly has to. Looking at gcov coverage when running the test harness, there are 35099459 calls to Perl_sv_free and 797851111 calls to Perl_sv_free2, so maybe ~4% of cleared SVs.

The interp has an addiction to malloc mem and mortal stack when C autos are appropriate. examples

https://github.com/Perl/perl5/blob/blead/malloc.c#L2127

[ Why on earth is this not # ifdef rmved on Win32 when Perl's Unix putenv just calls Perl Win32's P5P controlled backend putenv at https://github.com/Perl/perl5/blob/blead/win32/win32.c#L2415 ? why does the backend win32 P5P putenv malloc and copy the string before the searching it? not even knowing if it will ever use the modified 2nd copy !!! why does the perl api act like perl's front end p5p putenv() must accept RO stored const strings when we knows its RW memory and trappable perl exceptions cant happen in syscalls? why does Win32 Perl, even bother converting from Microsoft-ese PP Native putenv API (%ENV) to Unix putenv API, then immediately convert back to Microsoft-ese PP Native putenv API which is 100% compatible with the Win32 putenv kernel C API. Guess its time to write another patch but I have too many PRs up that haven't been merged yet to want to write more PRs. ]

Perl's malloc addiction is is made exponentially worse by khw's semi-recent fixes to serialize and de-race condition Unix and Win32 libc locale API vs interp's locale API vs OS getcwd() and friends.

example https://github.com/Perl/perl5/blob/blead/locale.c#L5206

khw's fixes are constantly making new malloc buffers nested as layers of strings process tools/fn calls get applied to get the final correct behavior needed. All these new malloc buffers are obviously added to save stack or mortal, then tossed/libc free() within a dozen microseconds. Note khw's code is implementing Unix os_foo_r() feature on non-os_foo_r() Unix. Note real Unix os_foo_r() is still using static global buffers just like 1970 Unix, except the static global buffers are per-OS-thread-synchronous-thread-locked using OS TLS. Real Unix will grow the malloc buffer and constantly reuse the malloc buffer, but never free/shrink it (exception for OS thread exit). Perl instead keeps using its mortal/save stack over and over, instead of having "global static" my_perl malloc buffers (+PvCUR+PvMAX) that can grow, but never be freed or shrink, the way Unix getcwd() is implemented (hopefully, you really don't want to look at BSD/GNU backend or /proc or ring 0 kernel source code without heavy drinking beforehand, /proc should be removed from Linux outright unless its a beta build of Ubuntu, your sh.exe can implement it since /proc is really a GUI widget and not an API for serious use and doesn't belong in libc or ring-0, and if /proc offers a data field not available in /var or /dev or libc/Linux-Kernel C API with a struct, thats a Linux kernel defect).

On my Win32 blead perl, if I look through a process memory dump of perl -e"sleep(9999999)", 75% of all malloc() memory regions, are chunks of %ENV and $PATH and @argv. 50% is P5P controlled code reasons, 50% is MS msvcrt.dll controlled code reasons. This 75% is 10s of KBs or maybe 1MB worth of these strings. IDK if they are freed or live malloc blocks when I do the mem snapshots. Remember modern OSes very rarely return free user mode mem pages back to the OS. User-mode WinOS will almost never return a sub 65KB long malloc block back to Win Kernel so I assume I'm looking at freed malloc blocks, but stretching the proc's private pages to 1MB or 4MB just for perl -e"sleep(9999999)" is less than perfect even if later on all 3.5 MB will be eaten up by CV* OPs and SVs for realistic production perl scripts.

Another really big perl XS C api design problem is, perl's keywords are correctly POSIX analogs, but over the last 25 years perl is gaining more and more bug fixes and new features to the PP keywords. Perl's middle layer C code, and XS author facing code/API/func call prototypes, keep adding 1970 Unix, C prototype arguments in new no-CPAN not exported functions, and sometimes in the CPAN-approved API.

1970 Unix API mandates 2 horrible API requirements, #1 all incoming char *s, are assumed to be immutable RO C strings owned by the fn's caller. #2 its sacrilegious desecration of a holy book to spend a precious 1, 2 or 4 bytes to record a strings length in RAM or spinning rust. Pascal is hellbent on genocide of the unix people, its a struggle for survival.

Over the last 25 years Perl's middle C layer keeps implementing/adding more and more code, using 1970 Unix C strings, instead of moving around SV *s, which SOLVE all those 1970 problems. Its just bad to keep extending this, instead of leaving creation of the const char * variable, to the very last moment before perl calls into ring-0/real Linux/real libc. Nobody has ever, and nobody will ever, use libperl.so as a wannabe Boost library, with no my_perl variable allocated. What C fns even work without crashing in libperl.so without an initialized interp struct? So why is new P5P code being added assuming users want to use libperl without perl? SV s need to be passed around the core and exposed to CPAN XS. Not instantly dropping the SV api right in pp_foo() and then calling Perl_my_fix_all_unix_bugs_foo() and all its child Perl_xxx() calls, plus presenting 1970 Unix C APIs to CPAN XS devs.

Perl C API does need some more thinking tho on SV* API for "RO" caller owned SV* arg-mode vs ownership takeover SV* arg-mode. sv.c does some tricks of PV* stealing or SV head stealing SV body stealing or PADTMP array stealing currently, but RO-mode vs ownership takeover-mode is fuzzy in core-only and CPAN facing API design/docs today. ``sv.c``` tricks can be expanded.

ISO C strlen() is not "free" and only 1% time or less in real life will be const-folded/intrinsic/inlined-away. IDK why most GPL-type devs believe GCC can inline away all strlen() calls as if ISO C99 is just an emulation layer ontop of Javascript/Java/Rust/.NET./Perl!!!!

I've profiled gmake as spending 15% of all CPU usage in libc strlen(), and 24% of all CPU usage in its string hash algorithm loop. gmake's code will never write U32 length to a memory address except for its struct gmk_dirent, and will never write U32 hash to a memory address except for its struct GMK_HEK (perl-ese). The rest of CPU usage is for parsing/bsearch/memchr of strings, AST/RB/Binary tree crawling, and calling into MS code/MS DLL C fncs. These 3 CPU time users are probably legit, and difficult to easily optimize. gmake's design is not something to show off as high quality code or follow.

I'd advise for all P5 core devs to once in a while, get a hot cup of coffee, start your C debugger, disable profiling/stepping of libc.so in your C debugger, put a rock on key F11, and watch what perl C code flashes by for a good 3 or 5 minutes while sipping something. And think are those lines of C code "justified" or not. You never know where that train will take you in perl C core. Thats how I write all my misc core PRs.

@richardleach
Copy link
Contributor Author

I really want this issue to just be about the code following /* unrolled SvREFCNT_dec and sv_free2 follows: */. Attempting to shoehorn in critiques about how the core does memory allocations or uses the tmps stack, or how the XS or C APIs work, or what gmake does is unlikely to result in changes to those things. More likely it will just put off engagement with this specific issue.

To achieve changes on those other topics, focussed Issues or Pull Requests (or topics on the mailing list) seem like better vehicles to me.

better not perfect

PERL_STATIC_INLINE void
Perl_SvREFCNT_dec(pTHX_ SV *sv){
    if (LIKELY(sv != NULL)) {
        U32 rc = SvREFCNT(sv) - 1;
        if (LIKELY(rc))
            SvREFCNT(sv) = rc;
        else
            Perl_sv_free2(aTHX_ sv, rc);
    }
}

It's not immediately obvious that this is safe. From looking at Perl_sv_free2, it looks like SvREFCNT_dec could be called on an SV with rc==0. In which case SvREFCNT(sv) - 1 in the suggested rewritten function would wrap and the SV wouldn't be freed as it should be.

@richardleach
Copy link
Contributor Author

@iabyn, it appears that much of this section of code entered the codebase back in 2010 in this commit:

commit 5239d5c4bfde4ec02e1787e9dc9ada189ad868e5
Author:     David Mitchell <[email protected]>
AuthorDate: Fri Oct 8 16:22:42 2010 +0100
Commit:     David Mitchell <[email protected]>
CommitDate: Mon Oct 11 00:41:17 2010 +0100

    make sv_clear() iterate over AVs
    
    In sv_clear(), rather than calling av_undef(), iterate over the AV's
    elements. This is the first stage in making sv_clear() non-recursive,
    and thus non-stack-blowing when freeing deeply nested structures.
...

Can you take a look?

I think that section of code made sense at the time. Changes to SvREFCNT_dec, sv_free, and sv_free2 occurred in 75a9bf9 and probably the lone call to sv_free in question got overlooked at that time.

@richardleach
Copy link
Contributor Author

FWIW This builds and the test harness passes, but it might not be the best way to update this call site:

diff --git a/sv.c b/sv.c
index f0b0b006d4..b7c950506f 100644
--- a/sv.c
+++ b/sv.c
@@ -7125,7 +7125,7 @@ Perl_sv_clear(pTHX_ SV *const orig_sv)
             if (!sv)
                 continue;
             if (!SvREFCNT(sv)) {
-                sv_free(sv);
+                Perl_sv_free2(sv, 0);
                 continue;
             }
             if (--(SvREFCNT(sv)))

@jkeenan
Copy link
Contributor

jkeenan commented Dec 2, 2024

I really want this issue to just be about the code following /* unrolled SvREFCNT_dec and sv_free2 follows: */. Attempting to shoehorn in critiques about how the core does memory allocations or uses the tmps stack, or how the XS or C APIs work, or what gmake does is unlikely to result in changes to those things. More likely it will just put off engagement with this specific issue.

To achieve changes on those other topics, focussed Issues or Pull Requests (or topics on the mailing list) seem like better vehicles to me.

+1

FWIW This builds and the test harness passes, but it might not be the best way to update this call site:

diff --git a/sv.c b/sv.c
index f0b0b006d4..b7c950506f 100644
--- a/sv.c
+++ b/sv.c
@@ -7125,7 +7125,7 @@ Perl_sv_clear(pTHX_ SV *const orig_sv)
             if (!sv)
                 continue;
             if (!SvREFCNT(sv)) {
-                sv_free(sv);
+                Perl_sv_free2(sv, 0);
                 continue;
             }
             if (--(SvREFCNT(sv)))

For me this passed on an unthreaded build on Linux using both gcc and clang as the compiler -- but make failed on threaded builds with both gcc and clang -- albeit with different error output -- on the same platform.

Tail of make output:

$ clang --version
Ubuntu clang version 18.1.3 (1ubuntu1)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ sh ./Configure -des -Dusedevel -Duseithreads -Dcc=clang && make test_harness
...
clang -c -DPERL_CORE -D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -std=c99 -O2 -Wall -Werror=pointer-arith -Werror=vla -Wextra -Wno-long-long -Wno-declaration-after-statement -Wc++-compat -Wwrite-strings sv.c
sv.c:7128:36: error: too few arguments to function call, expected 3, have 2
 7128 |                 Perl_sv_free2(sv, 0);
      |                 ~~~~~~~~~~~~~      ^
./proto.h:4701:1: note: 'Perl_sv_free2' declared here
 4701 | Perl_sv_free2(pTHX_ SV * const sv, const U32 refcnt);
      | ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make: *** [makefile:265: sv.o] Error 1
$ gcc --version
gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0

$ sh ./Configure -des -Dusedevel -Duseithreads -Dcc=gcc && make test_harness
...
gcc -c -DPERL_CORE -D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -std=c99 -O2 -Wall -Werror=pointer-arith -Werror=vla -Wextra -Wno-long-long -Wno-declaration-after-statement -Wc++-compat -Wwrite-strings -Wno-use-after-free sv.c
sv.c: In function ‘Perl_sv_clear’:
sv.c:7128:31: warning: passing argument 1 of ‘Perl_sv_free2’ from incompatible pointer type [-Wincompatible-pointer-types]
 7128 |                 Perl_sv_free2(sv, 0);
      |                               ^~
      |                               |
      |                               SV * {aka struct sv *}
In file included from sv.c:32:
perl.h:224:22: note: expected ‘PerlInterpreter *’ {aka ‘struct interpreter *’} but argument is of type ‘SV *’ {aka ‘struct sv *’}
  224 | #  define pTHX  tTHX my_perl PERL_UNUSED_DECL
      |                      ^
perl.h:229:25: note: in expansion of macro ‘pTHX’
  229 | #  define pTHX_         pTHX,
      |                         ^~~~
proto.h:4701:15: note: in expansion of macro ‘pTHX_’
 4701 | Perl_sv_free2(pTHX_ SV * const sv, const U32 refcnt);
      |               ^~~~~
sv.c:7128:17: error: too few arguments to function ‘Perl_sv_free2’
 7128 |                 Perl_sv_free2(sv, 0);
      |                 ^~~~~~~~~~~~~
In file included from perl.h:6160:
proto.h:4701:1: note: declared here
 4701 | Perl_sv_free2(pTHX_ SV * const sv, const U32 refcnt);
      | ^~~~~~~~~~~~~
make: *** [makefile:265: sv.o] Error 1

@mauke
Copy link
Contributor

mauke commented Dec 2, 2024

Those compiler error both indicate that 2 arguments were passed where 3 were expected. Basically, aTHX_ is missing:

Perl_sv_free2(aTHX_ sv, 0);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants