forked from tatuylonen/wiktextract
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NOTES-processing-order.text
45 lines (38 loc) · 1.97 KB
/
NOTES-processing-order.text
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
PROBLEM:
- templates are transcluded before parsing tables
- templates are also transcluded before processing and matching HTML tags
(start tag may come from a different template than end tag)
- templates do occur inside HTML tags generating attributes
- templates do occur inside attribute positions in tables
POSSIBLE APPROACHES TO SOLVE IT:
- process file in two phases, first processing templates and then rest
-> parse tree will not contain template nodes
-> need to define template expansion/capture functions
-> noinclude etc need to be handled specially, e.g. using regexps
-> template processing cannot refer to the subtitles it is contained in
(particularly problematic for wiktextract)
REJECT
- identify which templates contain unbalanced structures and process those
first; postpone other templates for parsing and later processing.
Assume templates occur in certain positions in tables, inside HTML tags, etc.
Expand them before parsing in just those positions. Additionally expand
any macros that generate
- unbalanced HTML tags
- table delimiters |-, |+, |, !, ||, !! at top level
-> will allow most templates in parse tree
-> some templates (semi-arbitrarily) will not be captured
-> noinclude etc need to be handled specially, e.g. using regexps
-> would probably be sufficient for wiktextract
- dynamically add rows and columns to tables while expanding templates
-> difficult to implement
-> does not handle all cases, e.g., templates inside HTML tags
REJECT
- perform initial parse, and then re-parse tables after template expansion
-> does not allow recursive expansion strategy for text generation like
currently done
-> does not handle all cases, e.g., templates inside HTML tags
REJECT
Processing <noinclude> etc specially before any parsing would probably
make sense. This way at least they are out of the way from confusing
the hierarchy.
Removing HTML comments specially before parsing would simplify many things