-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Fix: libxml manual memory management #15906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: libxml manual memory management #15906
Conversation
905d839
to
0100017
Compare
It's probably not fully safe. There's nothing preventing to What if we take a |
What's the |
/**
* @deprecated Use xmlCtxtGetPrivate() and xmlCtxtSetPrivate()
*
* For user data, libxml won't touch it
*/ /** Application data. Often used by language bindings. */ |
I tried to use the Edit: only a Edit: or just make sure we can't |
I may have found a solution (no more issues):
Edit 2: Nope. Without a strong reference from doc Node to subtree nodes we can lose a reference to a XML::Node!
Edit 3: we indeed needed |
The main advantage is a much simpler interface. The disadvantages are: 1. `XML::NodeSet` can now only hold 2.1 billion entries because `Slice#size` is an Int32; it shouldn't be an issue in practice. 2. `XML::Node#children` will now allocate an `XML::Node` for every child node, but will only allocate once, not on every access.
623c758
to
22edf3f
Compare
5b55479
to
83bd1dc
Compare
83bd1dc
to
2790e21
Compare
I updated the PR description to reflect the current state.
|
That would free its subtree nodes, but we might have live XML::Node referencing these nodes. We thus have to delay until the document is unreachable, at which point we can free everything at once.
Argh, https://gnome.pages.gitlab.gnome.org/libxml2/html/tree_8h.html#adb743abd3a548d61e4a40df29c441e30 |
|
Yep, I'm doing that. I think Confirmed: xmlSetProp -> xmlSetPropNs -> xmlFreeNodeList(prop->children). Same for Edit: it was already an issue if |
The document only has weak references to its nodes, so a node may be collected before the document... but if the node was unlinked the document's finalizer wouldn't know about the node anymore, and it leaked the libxml node. Fixes the issue by keeping a list of unlinked libxml nodes.
This is unsafe because the nodes and/or any descendant node might still be referenced by a XML::Node. This patch explicitly unlinks the nodes so they won't be freed until the document is collected.
I fixed the unlinked nodes issue, as well as xmlSetContent, xmlSetProp and xmlUnsetProp freeing the nodes recursively. |
The interpreter still segfaulting may be due to the compiler being linked to libLLVM that links with I got a warning when trying to link the std specs with both LLVM and libxml about this. |
libxml allows to customize the memory allocators, which we used to plug the GC.
Under certain conditions (e.g. MT, libxml 2.14), this integration leads to segfaults when a GC cycle happens within a libxml function.
I'm not quite sure what's happening. Maybe libxml keeps pointers somewhere that the GC doesn't scan (thread locals?) and the GC collects them... though that would create random segfaults, not segfaults during GC. Anyway: removing the GC integration fixes the issue.
In addition, the libxml2 distributed in macOS 15.4 is patched to remove the custom memory allocators API
Other bindings to external libraries in Crystal don't plug the GC but use manual memory management to free the external allocations when not needed anymore (automated through finalizers).
The difficulty is that we allocate the whole DOM tree with libxml. We could use a libxml parser to build a DOM tree with Crystal objects, and it would be fantastic, but then we'd have to reimplement XPath 😰
This patch replaces the GC integration with explicit memory management instead.
For readers, writers, or xpath contexts, we merely free the libxml allocations in finalizers.
Tree nodes are more complex. The xml free functions will recursively free the whole subtree; a node might be unlinked and thus won't be freed with the rest of the document anymore, etc.
We can store a reference to the XML::Node of documents into the libxml node (using the
_private
struct member) because the XML::Node will live exactly as long as the libxml doc.We can't do that for subtree nodes that may be collected at any time and would leave dangling pointers. We could not care (and allow multiple XML::Node to represent one libxml node for example, since collecting the doc will free the libxml node) but a XML::Node can be unlinked from the main tree, and must be manually freed. I thus introduced a dual reference:
A XML::Node is thus unique per libxml node, yet can be collected when unused without freeing some of the document, after which we can instantiate a new XML::Node.
NOTE: I was able to avoid breaking changes, especially in constructors, but I still undocumented the constructors that take an external libxml pointer, or expose internal details. I assume they should be internal API. We should extend the XML integration, for example with DOM manipulation (following the DOM spec, not the libxml API) so nobody has to extend the libxml integration themselves.
Depends on #15899.Closes #15619 (to be verified on macOS).
On Linux: no more segfaults with libxml 2.14 when the crystal code is compiled, however the interpreter still segfaults and prints error messages to STDERR.
No issues with libxml 2.13 and below.