Fix: libxml manual memory management #15906

ysbaddaden · 2025-06-17T09:28:25Z

libxml allows to customize the memory allocators, which we used to plug the GC.

Under certain conditions (e.g. MT, libxml 2.14), this integration leads to segfaults when a GC cycle happens within a libxml function.

I'm not quite sure what's happening. Maybe libxml keeps pointers somewhere that the GC doesn't scan (thread locals?) and the GC collects them... though that would create random segfaults, not segfaults during GC. Anyway: removing the GC integration fixes the issue.

In addition, the libxml2 distributed in macOS 15.4 is patched to remove the custom memory allocators API

Other bindings to external libraries in Crystal don't plug the GC but use manual memory management to free the external allocations when not needed anymore (automated through finalizers).

The difficulty is that we allocate the whole DOM tree with libxml. We could use a libxml parser to build a DOM tree with Crystal objects, and it would be fantastic, but then we'd have to reimplement XPath 😰

This patch replaces the GC integration with explicit memory management instead.

For readers, writers, or xpath contexts, we merely free the libxml allocations in finalizers.

Tree nodes are more complex. The xml free functions will recursively free the whole subtree; a node might be unlinked and thus won't be freed with the rest of the document anymore, etc.

We can store a reference to the XML::Node of documents into the libxml node (using the _private struct member) because the XML::Node will live exactly as long as the libxml doc.

We can't do that for subtree nodes that may be collected at any time and would leave dangling pointers. We could not care (and allow multiple XML::Node to represent one libxml node for example, since collecting the doc will free the libxml node) but a XML::Node can be unlinked from the main tree, and must be manually freed. I thus introduced a dual reference:

Every XML::Node has a strong reference to its document's XML::Node (can't free document while we have references to a subtree node);
Every document XML::Node keeps a cache with weak references to all its subtree XML::Node (only those that have been instantiated).

A XML::Node is thus unique per libxml node, yet can be collected when unused without freeing some of the document, after which we can instantiate a new XML::Node.

NOTE: I was able to avoid breaking changes, especially in constructors, but I still undocumented the constructors that take an external libxml pointer, or expose internal details. I assume they should be internal API. We should extend the XML integration, for example with DOM manipulation (following the DOM spec, not the libxml API) so nobody has to extend the libxml integration themselves.

~~Depends on #15899.~~
Closes #15619 (to be verified on macOS).

On Linux: no more segfaults with libxml 2.14 when the crystal code is compiled, however the interpreter still segfaults and prints error messages to STDERR.

No issues with libxml 2.13 and below.

ysbaddaden · 2025-06-19T12:55:14Z

It's probably not fully safe. There's nothing preventing to dup a XML::Node document. We can have multiple XML::Node instances for the same tree node, for example multiple #xpath requests, a call to #children and #next and so on.

What if we take a XML::Node twice: we call Node#unlink from one instance and we lose the reference, the GC will finalize and free the LibXML::Node, then we try to access the second node => 💥

ysbaddaden · 2025-06-19T12:59:27Z

What's the _private struct member for LibXML structs? it's always NULL... could we... use it for application data 🤔

HertzDevil · 2025-06-19T13:07:44Z

It says:

    /**
     * @deprecated Use xmlCtxtGetPrivate() and xmlCtxtSetPrivate()
     *
     * For user data, libxml won't touch it
     */

Another one:

    /** Application data. Often used by language bindings. */

ysbaddaden · 2025-06-19T15:53:17Z

I tried to use the _private struct member to store the corresponding XML::Node reference, so we only ever have one XML::Node instance per LibXML::Node* but we can't safely nullify the _private when we lose the XML::Node because xmlFreeDoc or xmlFreeNode can run before a finalizer and will have freed the LibXML::Node* and we get malloc errors or segfaults.

Edit: only a Hash(LibXML::Node*, WeakRef(XML::Node)) seems safe.

Edit: or just make sure we can't dup a the XML::Node of a document (so we don't free twice), or can't unlink a node twice... with the difficulty that different XML::Node for the same LibXML::Node* might try to unlink (or future proof: it got added to another doc) 😭

ysbaddaden · 2025-06-19T18:36:35Z

I may have found a solution (no more issues):

Allocate a document Node once and store it into _private; this is fine because the libxml node lives as long as the document XML::Node (the latter is the one that frees the former).
Allocate each subtree Node once and keep a cache of libxml node to Node references per document, so we're sure that every unlinked node will share the same Node and will be unlinked once without having to access the libxml node that may have already been freed.

And... I'm just noticing an issue: collecting an unlinked node doesn't clean up the unlinked node (nor its subtree), and at best we're leaking some entries in the cache (not cool), and at worst new LibXML::Node* reusing the same malloc 💥

~~I could cleanup the cache in the finalizer, at least zero the Node address to avoid mutating the Hash, but there's no telling a thread wasn't mutating or rehashing the hash 😭~~

Oh, we keep the unlinked node reference to the document, so either we collect everything or we collect nothing and thus never collect an unlinked node before its document, so we're not leaking 🎉

~~The complexity will be when we adopt the node into another document (not supported): we'll have to move the unlinked node (and its subtree) from the original doc's cache into the new doc's cache.~~

Edit: but... if I don't need to do any cleanup from the cache, and we collect everything or nothing, then I can just put every instantiated Node in _private, and that would simplify adopting a Node into another document (only need to update the @document of sub nodes)

Edit 2: Nope. Without a strong reference from doc Node to subtree nodes we can lose a reference to a XML::Node!

With _private the XML::Node finalizer must nullify _private, but we can't because it may have been freed 💥.
With a blind cache (no pointers) the finalizer must clear each XML::Node from the cache (optimization: unless the libxml doc has been freed).
If the cache has pointers, Boehm complains that finalizers have dual references (node -> doc -> cache -> node).
This leaves us with WeakRef, maybe?

Edit 3: we indeed needed WeakRef 🎉

The main advantage is a much simpler interface. The disadvantages are: 1. `XML::NodeSet` can now only hold 2.1 billion entries because `Slice#size` is an Int32; it shouldn't be an issue in practice. 2. `XML::Node#children` will now allocate an `XML::Node` for every child node, but will only allocate once, not on every access.

src/xml/node.cr

ysbaddaden · 2025-06-20T12:45:25Z

I updated the PR description to reflect the current state.

~~I'm putting the PR back into DRAFT because there's still an issue: we can't free the libxml of an unlinked node because it will free its subtree, yet we might still have live XML::Node to them 😭~~

That would free its subtree nodes, but we might have live XML::Node referencing these nodes. We thus have to delay until the document is unreachable, at which point we can free everything at once.

src/xml/node.cr

ysbaddaden · 2025-06-23T16:52:06Z

Argh, XML::Node#content= ~~also unlinks the children nodes for element and attribute nodes:~~ is even worse: it immediately frees the subtree for document fragments, elements and attribute nodes 😱

https://gnome.pages.gitlab.gnome.org/libxml2/html/tree_8h.html#adb743abd3a548d61e4a40df29c441e30

straight-shoota · 2025-06-23T17:13:51Z

XML::Node#content= could unlink all children first?

ysbaddaden · 2025-06-23T17:16:53Z

Yep, I'm doing that. I think xmlSetProp might also be an issue: we do expose the XML::Node for a LibXML::Attr* 😭

Confirmed: xmlSetProp -> xmlSetPropNs -> xmlFreeNodeList(prop->children). Same for xmlUnsetProp: it unlinks and frees the prop.

Edit: it was already an issue if GC.free frees the allocation, which I believe Boehm does?

The document only has weak references to its nodes, so a node may be collected before the document... but if the node was unlinked the document's finalizer wouldn't know about the node anymore, and it leaked the libxml node. Fixes the issue by keeping a list of unlinked libxml nodes.

This is unsafe because the nodes and/or any descendant node might still be referenced by a XML::Node. This patch explicitly unlinks the nodes so they won't be freed until the document is collected.

ysbaddaden · 2025-06-23T18:38:18Z

I fixed the unlinked nodes issue, as well as xmlSetContent, xmlSetProp and xmlUnsetProp freeing the nodes recursively.

src/xml/node.cr

ysbaddaden · 2025-06-26T11:55:17Z

The interpreter still segfaulting may be due to the compiler being linked to libLLVM that links with libxml2.so.2, but the interpreter is trying to load libxml.so.16 (ABI bump in 2.14) and we end up with conflicts?

I got a warning when trying to link the std specs with both LLVM and libxml about this.

ysbaddaden self-assigned this Jun 17, 2025

ysbaddaden added kind:refactor breaking-change topic:stdlib:serialization labels Jun 17, 2025

ysbaddaden mentioned this pull request Jun 17, 2025

XML: modernize API when available & workaround issues with legacy versions #15899

Merged

ysbaddaden added 2 commits June 19, 2025 09:17

XML: manual memory management of libxml objects

8dfbed6

XML: undocument constructors and drop XML::Node.new(node)

7053f50

ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 905d839 to 0100017 Compare June 19, 2025 12:48

ysbaddaden added 2 commits June 19, 2025 20:53

Fix: make sure to allocate a XML::Node once per LibXML::Node*

22edf3f

ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 623c758 to 22edf3f Compare June 19, 2025 18:53

Fix: avoid breaking change in XML::Node.new

2263aab

ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 5b55479 to 83bd1dc Compare June 20, 2025 10:21

Fix: use weak references in document node's cache

2790e21

ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 83bd1dc to 2790e21 Compare June 20, 2025 10:33

ysbaddaden marked this pull request as ready for review June 20, 2025 10:34

Sija reviewed Jun 20, 2025

View reviewed changes

src/xml/node.cr Outdated Show resolved Hide resolved

src/xml/node.cr Outdated Show resolved Hide resolved

Fix: removed unused XML::Node#unlinked?

b36b5e1

ysbaddaden commented Jun 20, 2025

View reviewed changes

src/xml/node.cr Show resolved Hide resolved

Fix: avoid breaking change in NodeSet.new

fdb2901

ysbaddaden marked this pull request as draft June 20, 2025 12:45

Fix: can't free unlinked nodes while document is reachable

7202519

That would free its subtree nodes, but we might have live XML::Node referencing these nodes. We thus have to delay until the document is unreachable, at which point we can free everything at once.

ysbaddaden marked this pull request as ready for review June 20, 2025 13:31

ysbaddaden commented Jun 20, 2025

View reviewed changes

src/xml/node.cr Outdated Show resolved Hide resolved

ysbaddaden mentioned this pull request Jun 23, 2025

Extract XML::Document from XML::Node #15920

Open

straight-shoota requested changes Jun 23, 2025

View reviewed changes

src/xml/node.cr Outdated Show resolved Hide resolved

src/xml/node.cr Outdated Show resolved Hide resolved

src/xml/node.cr Outdated Show resolved Hide resolved

ysbaddaden added 2 commits June 23, 2025 20:24

Fix: xmlSetContent, xmlSetProp and xmlUnsetProp immediately free nodes

d4f0262

This is unsafe because the nodes and/or any descendant node might still be referenced by a XML::Node. This patch explicitly unlinks the nodes so they won't be freed until the document is collected.

straight-shoota reviewed Jun 23, 2025

View reviewed changes

src/xml/node.cr Outdated Show resolved Hide resolved

src/xml/node.cr Outdated Show resolved Hide resolved

ysbaddaden added 3 commits June 24, 2025 10:59

Fix: don't unlink twice in XML::Node#content=

414068e

Fix: remove the XML::Node#unlinked property

5244f5f

Fix: don't free unlinked node adopted into another doc

a6126c8

straight-shoota reviewed Jun 24, 2025

View reviewed changes

src/xml/node.cr Outdated Show resolved Hide resolved

straight-shoota reviewed Jun 24, 2025

View reviewed changes

src/xml/node.cr Outdated Show resolved Hide resolved

ysbaddaden mentioned this pull request Jun 24, 2025

XML: improve the DOM interface #15915

Open

ysbaddaden added 3 commits June 24, 2025 14:04

Mark internal details as :nodoc:

d91d2cd

Undocument the #to_unsafe methods (internal API)

1376bbc

More details around safety rules for internal XML::Node properties

6ada9e9

straight-shoota approved these changes Jun 24, 2025

View reviewed changes

straight-shoota added this to the 1.17.0 milestone Jun 24, 2025

straight-shoota merged commit 7f0f956 into crystal-lang:master Jun 26, 2025
38 checks passed

straight-shoota added kind:breaking and removed breaking-change kind:refactor labels Jun 26, 2025

ysbaddaden deleted the fix/libxml-manual-memory-management branch June 26, 2025 08:14

ysbaddaden mentioned this pull request Jun 26, 2025

Incompatible libxml changes in upcoming Crystal release toddsundsted/libxml_ext#1

Open

Uh oh!

Fix: libxml manual memory management #15906

Fix: libxml manual memory management #15906

Uh oh!

Conversation

ysbaddaden commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Jun 19, 2025

Uh oh!

ysbaddaden commented Jun 19, 2025

Uh oh!

HertzDevil commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ysbaddaden commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ysbaddaden commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

straight-shoota commented Jun 23, 2025

Uh oh!

ysbaddaden commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ysbaddaden commented Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ysbaddaden commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ysbaddaden commented Jun 17, 2025 •

edited

Loading

HertzDevil commented Jun 19, 2025 •

edited

Loading

ysbaddaden commented Jun 19, 2025 •

edited

Loading

ysbaddaden commented Jun 19, 2025 •

edited

Loading

ysbaddaden commented Jun 20, 2025 •

edited

Loading

ysbaddaden commented Jun 23, 2025 •

edited

Loading

ysbaddaden commented Jun 23, 2025 •

edited

Loading

ysbaddaden commented Jun 26, 2025 •

edited

Loading