Skip to content

Fix: libxml manual memory management #15906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jun 17, 2025

libxml allows to customize the memory allocators, which we used to plug the GC.

Under certain conditions (e.g. MT, libxml 2.14), this integration leads to segfaults when a GC cycle happens within a libxml function.

I'm not quite sure what's happening. Maybe libxml keeps pointers somewhere that the GC doesn't scan (thread locals?) and the GC collects them... though that would create random segfaults, not segfaults during GC. Anyway: removing the GC integration fixes the issue.

In addition, the libxml2 distributed in macOS 15.4 is patched to remove the custom memory allocators API :trollface:

Other bindings to external libraries in Crystal don't plug the GC but use manual memory management to free the external allocations when not needed anymore (automated through finalizers).

The difficulty is that we allocate the whole DOM tree with libxml. We could use a libxml parser to build a DOM tree with Crystal objects, and it would be fantastic, but then we'd have to reimplement XPath 😰

This patch replaces the GC integration with explicit memory management instead.

For readers, writers, or xpath contexts, we merely free the libxml allocations in finalizers.

Tree nodes are more complex. The xml free functions will recursively free the whole subtree; a node might be unlinked and thus won't be freed with the rest of the document anymore, etc.

We can store a reference to the XML::Node of documents into the libxml node (using the _private struct member) because the XML::Node will live exactly as long as the libxml doc.

We can't do that for subtree nodes that may be collected at any time and would leave dangling pointers. We could not care (and allow multiple XML::Node to represent one libxml node for example, since collecting the doc will free the libxml node) but a XML::Node can be unlinked from the main tree, and must be manually freed. I thus introduced a dual reference:

  1. Every XML::Node has a strong reference to its document's XML::Node (can't free document while we have references to a subtree node);
  2. Every document XML::Node keeps a cache with weak references to all its subtree XML::Node (only those that have been instantiated).

A XML::Node is thus unique per libxml node, yet can be collected when unused without freeing some of the document, after which we can instantiate a new XML::Node.

NOTE: I was able to avoid breaking changes, especially in constructors, but I still undocumented the constructors that take an external libxml pointer, or expose internal details. I assume they should be internal API. We should extend the XML integration, for example with DOM manipulation (following the DOM spec, not the libxml API) so nobody has to extend the libxml integration themselves.

Depends on #15899.
Closes #15619 (to be verified on macOS).

On Linux: no more segfaults with libxml 2.14 when the crystal code is compiled, however the interpreter still segfaults and prints error messages to STDERR.

No issues with libxml 2.13 and below.

@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 905d839 to 0100017 Compare June 19, 2025 12:48
@ysbaddaden
Copy link
Contributor Author

It's probably not fully safe. There's nothing preventing to dup a XML::Node document. We can have multiple XML::Node instances for the same tree node, for example multiple #xpath requests, a call to #children and #next and so on.

What if we take a XML::Node twice: we call Node#unlink from one instance and we lose the reference, the GC will finalize and free the LibXML::Node, then we try to access the second node => 💥

@ysbaddaden
Copy link
Contributor Author

What's the _private struct member for LibXML structs? it's always NULL... could we... use it for application data 🤔

@HertzDevil
Copy link
Contributor

HertzDevil commented Jun 19, 2025

It says:

    /**
     * @deprecated Use xmlCtxtGetPrivate() and xmlCtxtSetPrivate()
     *
     * For user data, libxml won't touch it
     */

Another one:

    /** Application data. Often used by language bindings. */

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jun 19, 2025

I tried to use the _private struct member to store the corresponding XML::Node reference, so we only ever have one XML::Node instance per LibXML::Node* but we can't safely nullify the _private when we lose the XML::Node because xmlFreeDoc or xmlFreeNode can run before a finalizer and will have freed the LibXML::Node* and we get malloc errors or segfaults.

Edit: only a Hash(LibXML::Node*, WeakRef(XML::Node)) seems safe.

Edit: or just make sure we can't dup a the XML::Node of a document (so we don't free twice), or can't unlink a node twice... with the difficulty that different XML::Node for the same LibXML::Node* might try to unlink (or future proof: it got added to another doc) 😭

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jun 19, 2025

I may have found a solution (no more issues):

  1. Allocate a document Node once and store it into _private; this is fine because the libxml node lives as long as the document XML::Node (the latter is the one that frees the former).

  2. Allocate each subtree Node once and keep a cache of libxml node to Node references per document, so we're sure that every unlinked node will share the same Node and will be unlinked once without having to access the libxml node that may have already been freed.

And... I'm just noticing an issue: collecting an unlinked node doesn't clean up the unlinked node (nor its subtree), and at best we're leaking some entries in the cache (not cool), and at worst new LibXML::Node* reusing the same malloc 💥

I could cleanup the cache in the finalizer, at least zero the Node address to avoid mutating the Hash, but there's no telling a thread wasn't mutating or rehashing the hash 😭

Oh, we keep the unlinked node reference to the document, so either we collect everything or we collect nothing and thus never collect an unlinked node before its document, so we're not leaking 🎉

The complexity will be when we adopt the node into another document (not supported): we'll have to move the unlinked node (and its subtree) from the original doc's cache into the new doc's cache.

Edit: but... if I don't need to do any cleanup from the cache, and we collect everything or nothing, then I can just put every instantiated Node in _private, and that would simplify adopting a Node into another document (only need to update the @document of sub nodes)

Edit 2: Nope. Without a strong reference from doc Node to subtree nodes we can lose a reference to a XML::Node!

  • With _private the XML::Node finalizer must nullify _private, but we can't because it may have been freed 💥.
  • With a blind cache (no pointers) the finalizer must clear each XML::Node from the cache (optimization: unless the libxml doc has been freed).
  • If the cache has pointers, Boehm complains that finalizers have dual references (node -> doc -> cache -> node).
  • This leaves us with WeakRef, maybe?

Edit 3: we indeed needed WeakRef 🎉

The main advantage is a much simpler interface. The disadvantages are:

1. `XML::NodeSet` can now only hold 2.1 billion entries because
   `Slice#size` is an Int32; it shouldn't be an issue in practice.

2. `XML::Node#children` will now allocate an `XML::Node` for every child
   node, but will only allocate once, not on every access.
@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 623c758 to 22edf3f Compare June 19, 2025 18:53
@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 5b55479 to 83bd1dc Compare June 20, 2025 10:21
@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 83bd1dc to 2790e21 Compare June 20, 2025 10:33
@ysbaddaden ysbaddaden marked this pull request as ready for review June 20, 2025 10:34
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jun 20, 2025

I updated the PR description to reflect the current state.

I'm putting the PR back into DRAFT because there's still an issue: we can't free the libxml of an unlinked node because it will free its subtree, yet we might still have live XML::Node to them 😭

@ysbaddaden ysbaddaden marked this pull request as draft June 20, 2025 12:45
That would free its subtree nodes, but we might have live XML::Node
referencing these nodes. We thus have to delay until the document is
unreachable, at which point we can free everything at once.
@ysbaddaden ysbaddaden marked this pull request as ready for review June 20, 2025 13:31
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jun 23, 2025

Argh, XML::Node#content= also unlinks the children nodes for element and attribute nodes: is even worse: it immediately frees the subtree for document fragments, elements and attribute nodes 😱

https://gnome.pages.gitlab.gnome.org/libxml2/html/tree_8h.html#adb743abd3a548d61e4a40df29c441e30

@straight-shoota
Copy link
Member

XML::Node#content= could unlink all children first?

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jun 23, 2025

Yep, I'm doing that. I think xmlSetProp might also be an issue: we do expose the XML::Node for a LibXML::Attr* 😭

Confirmed: xmlSetProp -> xmlSetPropNs -> xmlFreeNodeList(prop->children). Same for xmlUnsetProp: it unlinks and frees the prop.

Edit: it was already an issue if GC.free frees the allocation, which I believe Boehm does?

The document only has weak references to its nodes, so a node may be
collected before the document... but if the node was unlinked the
document's finalizer wouldn't know about the node anymore, and it leaked
the libxml node.

Fixes the issue by keeping a list of unlinked libxml nodes.
This is unsafe because the nodes and/or any descendant node might still
be referenced by a XML::Node. This patch explicitly unlinks the nodes so
they won't be freed until the document is collected.
@ysbaddaden
Copy link
Contributor Author

I fixed the unlinked nodes issue, as well as xmlSetContent, xmlSetProp and xmlUnsetProp freeing the nodes recursively.

@straight-shoota straight-shoota added this to the 1.17.0 milestone Jun 24, 2025
@straight-shoota straight-shoota merged commit 7f0f956 into crystal-lang:master Jun 26, 2025
38 checks passed
@ysbaddaden ysbaddaden deleted the fix/libxml-manual-memory-management branch June 26, 2025 08:14
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jun 26, 2025

The interpreter still segfaulting may be due to the compiler being linked to libLLVM that links with libxml2.so.2, but the interpreter is trying to load libxml.so.16 (ABI bump in 2.14) and we end up with conflicts?

I got a warning when trying to link the std specs with both LLVM and libxml about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault when using XML module on MacOS 15.4
4 participants