-
-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SILE and XML - A Return on Experience #2111
Labels
Comments
Point 1 above, I forgot to say, tangentially relates to #1957. I might be wrong but one seldom find |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This was referenced Oct 20, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
So Saith Simon:
And we obviously mention similar things in the User Manual and other places (such as our Wiki, etc.)
But is that actually true?
And what sort of challenges a package-designer might face when trying to use SILE as a typesetter for "arbitrary" XML files? I have some experience with this,1 and I'd like to share it with you.
SILE ships with an XML streaming parser (Luaexpat a.k.a. lxp), and merely converts XML elements and their attributes to SILE syntax tree nodes, as a straightforward "exact" mapping (tags become commands, attributes become options, text nodes becomes content).
It doesn't do anything special with the XML, and notably:
xmlns
,xmlns:xxx
) specifically, though that is also a part of the equation; but rather to conflicts between different input documents. Let's call the former "internal namespacing" and the latter "external namespacing".Thoughts on document models
The problem in this space is that XML is a very broad format, with different applications.
But in practice, there are two different types of XML content (possibly, and actually often, mixed in a given schema/model, depending on context):
This is a first-order approximation, but it's a useful distinction.2
So beyond it's naive parser, what does SILE for us here?
Not much, so:
SU.ast.walkContent()
(original, deeply recursive, not seen in any code base), andSU.ast.processAsStructure()
(which I introduced recently and used in my modules), but they fail on some non-trivial cases, where one has to loop on the content manually, anyway...3csl/core/utils/xmlparser
in the current PR), with a few customizable "rules", albeit restrained to a minimal set of features (spaces and namespaces). We could go further, and generalize the idea, importing it in the XML inputter, with a few add-ons (e.g. better content appropriation strategies.)Thoughts on external namespacing
It should be obious that complex enough schemas with have conflicts with other schemas, including SILE's own.
Say a document encodes chapters as
<chapter><title> ... </title><body> ... </body></chapter>
. We'd like to use our existing book class, but wait... We need to save our chapter command, which has a wholly different structure, and use that saved version in our re-implementation. Ah. But then other documents are in trouble. And it's a lot of potential command saving/restoring, notwithstanding 3rd-party package expectations...In my above-mentioned approach (#2082), I've used a simple "namespace" mechanism (= read a "prefix"). It's not really used for CSL (which is processed differently), but heh! I just imported the idea from my other in-progress projects in SILE.
Thoughts on internal namespacing
A document could include, say, an SVG not wrapped in a CDATA (uh-oh), but simply with a namespace declared on the root element, or actually any element.
That's the most usual way, but note the namespace prefix doesn't even has to be
svg
, it could be anything...So we'd need a special provision for explicitly namespaced elements (Luaexpat has some options, but I am not sure they are what we'd want here)...
Notes on the root element
By the way, currently, the XML inputter wraps the parse tree in a
document
command node, it it does start with a<document>̀
tag4 with no class (plain applies) and no clean strategy how to load the necessary tag support (enforcing a class from command line? Dubious at best...; using a wrapper document is better, but not straightforward; a preamble too is possible...)5Thoughts on paragraphing
i'm less advanced in my thoughts here, but I have a deep feeling that paragraphing done at the typesetter's level (
typesetter.parseppattern
) is inherently wrong, and that it should be done at the inputter's level.To be honest, I even feel the newPar/endPar typesetter hooks into the class are not that great, and that we should have a different general approach to this. Our syntax tree is not even really an AST. The latter would have explicit paragraph nodes, where appropriate. (I have a few ideas on this, but I'll keep them for another time.)
Concluding remarks
Thanks for reading this far, if you made it. I hope I've given you some food for thought, and that you'll consider these points in the future. I'm happy to discuss any of these points further.
Footnotes
In my now >3-year involvement with SILE, I've implemented parsers and or processors for 2 subsets of TEI XML (dictionaries, critical apparatus), a substantial portion of 2 biblical scripture XML schemas (USX, and a prior attempt at USFX), CSL (locales, styles), and an attempt at reviewing SILE's DocBook support. In the same time frame, I have not seen any other SILE developer or user attempt to use SILE as a typesetter for other arbitrary XML documents. ↩
For instance,
<strong>text</strong> is <em>good,</em> always
is presentational, and all spaces are significant.On the other hand,
... is structural, and spaces or linebreaks are not significant. Order might be...
... is also structural at the "form" level, but the order of elements is not significant, and the rendering might reshuffle them as, say: "elementary /ˌɛləˈmɛntəri/ (el-emen-tary)". Just to show that the developer will have some already complex code strategies to implement this, and anything that would help alleviate the burden would be welcome. One will already have a lot of
SU.findInTree()/SU.removeFromTree()
calls to do, involve several inputfilter or string parsing, and that's just the beginning...EDIT: feat: DocBook class overhaul #1789 for docbook support is far from complete... #1338 is full of such syntax tree ad-hoc operations, far beyond decency. ↩
See also lists package: enforceListType precludes XML handling #2073, with attempts to use a "schema" (SILE's lists) for another (HTML lists), and the difficulties encountered in the process. I promised a discussion in my comments: Well, this is it. ↩
Or
<sile>
tag, see also Top level tags differ between XML/TeX flavors #508, an old discussion that also points towards a sensible resolution... ↩So we have many "workarounds". But in reality, we might be more frequently in the "wrapper" scenario for most real-world cases. Personally, I'd use a "master document" (cool re·sil·ient stuff), for metadata, book covers, etc. so the XML would end up just as an included fragment. (My approach even here https://github.com/Freely-Given-org/BibleTypesetter/pull/3) ↩
The text was updated successfully, but these errors were encountered: