Skip to content

XML: modernize API when available & workaround issues with legacy versions #15899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jun 13, 2025

Uses the libxml per context error handlers when available.

For example we can always use it for XML::Reader since it's available since at least libxml 2.9 (released in 2012), that becomes the default expected libxml version.

Sadly the other per context error handlers only appeared in libxml 2.13 (released in 2024) so we can't assume they exist, but can still use them when compiling against this version.

This patch adds a libxml version detection using pkg-config. We can specify the LIBXML_VERSION environment variable to target another release if the runtime version will be different (though it is highly recommended to use libxml 2.13+), or for the Windows MSVC target.

On older libxml releases, errors can be raised per context (e.g. xml reader) and through the structured error handler (now deprecated) and the older generic (long deprecated).

For these older releases, this patch still sets the globals (actually thread locals), be they error handlers or some xml save configuration, but saves them when the fiber might yield, restoring the default handlers; it eventually restores the globals when the fiber is resumed, so the current thread will see the values previously configured for the fiber, whatever if another fiber did something else, or the fiber got resumed by another thread (execution context).

NOTE: libxml 2.14 almost always segfaults during GC while running spec/std/xml, while the 2.9 to 2.13 releases are fine. fixed in #15906.

Use the new per-context error handlers when available (libxml 2.13 and
newer) that now report errors properly (and enforces NOERROR).

Tries to avoid thread safety issues in libxml 2.9 to 2.12 by resetting
the default error handlers while the current fiber may swapcontext, for
example while doing IO, and setting them back afterwards. This will at
least avoid leaking error handlers in the thread locals while we
switched to another fiber that may mess with these (oops), and makes
sure to restore them properly, even if the fiber is resumed to another
thread (no lost references).

Deprecates the public API for generic and structured error collect
handlers. They're unsafe and musn't be used!
Use the new per-context indentation configuration from libxml 2.14 when
availmable, otherwise tries to avoid concurrency and parallelism issues
by saving the globals before a potential fiber swapcontext and restoring
them afterwards.
@ysbaddaden ysbaddaden self-assigned this Jun 13, 2025
line_number = LibXML.xmlTextReaderLocatorLineNumber(locator)
raise Error.new(msg_str, line_number)
end
LibXML.xmlTextReaderSetStructuredErrorHandler(@reader, ->Error.structured_callback, Box.box(@errors))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice this is a breaking change... but I don't understand why "read from string" should raise, while "read from io" should collect errors instead 🤨

@@ -487,38 +487,68 @@ class XML::Node
end
end

# :nodoc:
SAVE_MUTEX = ::Mutex.new
Copy link
Contributor Author

@ysbaddaden ysbaddaden Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need the mutex anymore and it would still not restore the xmlIndentTreeOutput global thread local when the fiber is resumed by another thread.

@ysbaddaden ysbaddaden marked this pull request as ready for review June 16, 2025 15:35
ysbaddaden added a commit to ysbaddaden/crystal that referenced this pull request Jun 17, 2025
)

XML::Reader: use xmlTextReaderSetStructuredErrorHandler

libxml: add bindings to new bindings in 2.13 and 2.14 (improved thread safety)

libxml 2.13+ never report errors when NOERROR option is set

Modernize the error handling & improve thread safety

Use the new per-context error handlers when available (libxml 2.13 and
newer) that now report errors properly (and enforces NOERROR).

Tries to avoid thread safety issues in libxml 2.9 to 2.12 by resetting
the default error handlers while the current fiber may swapcontext, for
example while doing IO, and setting them back afterwards. This will at
least avoid leaking error handlers in the thread locals while we
switched to another fiber that may mess with these (oops), and makes
sure to restore them properly, even if the fiber is resumed to another
thread (no lost references).

Deprecates the public API for generic and structured error collect
handlers. They're unsafe and musn't be used!

Modernize XML generation + improve fiber/thread issues

Use the new per-context indentation configuration from libxml 2.14 when
availmable, otherwise tries to avoid concurrency and parallelism issues
by saving the globals before a potential fiber swapcontext and restoring
them afterwards.

Fix: review suggestions + format
@straight-shoota straight-shoota added this to the 1.17.0 milestone Jun 17, 2025
@straight-shoota straight-shoota merged commit 4505f4f into crystal-lang:master Jun 18, 2025
38 checks passed
@ysbaddaden ysbaddaden deleted the fix/libxml-error-handlers branch June 19, 2025 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants