forked from codequarterly/cq-challenge-markup
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.submission
313 lines (246 loc) · 14.5 KB
/
README.submission
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
-*- mode: markup; -*-
* Code Challenge: Markup
* The task
The challenge, as accepted, was to provide:
# \i{The source code to a parser that, given the contents of a file in
whatever form is natural for your language of choice—perhaps as an
open stream or a string—returns a data structure representing the
document tree as described in the [Markup specification] if the
file is syntactically correct.}
You find the code to what I would conisder a reasonable parser for
Markup provided in src/lib. Specifically, the Perl modules
\code{Markup::Tokenizer} and \code{Markup::Parser}.
# \i{English language documentation for the data structure returned by
the parser sufficient to allow a programmer familiar with your
implementation language to write a back-end that generates some
kind of output from the results returned by the parser.}
Descriptions of the data structures returned by Markup::Parser may
be found in the pod\note{pod or 'Plain Old Documentation' is a standard
form of in line documentation used throughout Perl. It can be viewed
using he \code{perldoc} utility like so:
% perldoc Module::Parser
% perldoc Module::Tree
in location where perldoc can find the module in question.} documentation.
A more detailed description will be provided towards the end of this
document.
# \i{The source code to a sample back-end that generates well-formed
XML according to the “Trivial XML backend” section of the
spec.\note{We don’t use XML because we have any great love for it
but because it is a reasonable lowest-common-denominator format.}}
A back-end meeting that above criteria can be found in the Perl
module \code{Markup::Backend::XML}. The back-end as an example:
use Markup::Parser;
use Markup::Tokenizer;
use Markup::Backend::XML;
my $tokenizer=Markup::Tokenizer->new(links => $links);
my $parser=Markup::Parser->new(tokenizer => $tokenizer);
my $backend=Markup::Backend::XML->new();
# code to retrieve $source
my $tree=$parser->parse($source);
my $string=$backend->string($tree);
# \i{The source code to a driver program that, given the name of a
file, parses the file and then generates a corresponding XML file.}
The file markup.pl in src should fulfill this particular requirement.
- In a pipe:
% cat test | markup.pl | less
- To stdout:
% markup.pl test | less
- To a file
% markup.pl test -o test.xml
You may also pass it the argument \code{--no-links} to turn off link
processing as required by the spec.
# \i{Any libraries you use beyond those that are normally part of an
installation of your chosen programming language.}
Everything used is core Perl according to \code{Module::CoreList}.
# \i{Any build instructions or scripts are needed to build the driver
program or instructions how to run it if it requires no separate
building.}
Since this is Perl, there really aren't any build instructions. The
shebang \node{For the users of that other operating system out there,
shebang it the \code{#!/usr/bin/perl} at the top of the script. It
tells the system what program to use as an interpreter for the file.}
assumes that perl is in /usr/bin. Test cases can be run by calling
\code{prove} in the root directory of the project. It should be able
to locate the test files located in the t directory.
# \i{Optionally, any notes about your experience implementing this
code: how you came up with your design, blind alleys you went up,
or surprising problems you ran into. Or anything else you want to
share.}
The commit log for this repository contains a good deal of these. I will
definitely be fleshing things about a bit more here as well.
* Architectural Structure
The processor is broken down into four components, two of which
contain the AST generation code, \code{Markup::Tokenizer} and
\code{Markup::Parser}, one being the output back-end,
\code{Markup::Backend::XML}, and the last being the \code{markup.pl}
driver script. With the exception of \code{markup.pl}, each of these
represents a pass through a representation of the source data. The
entirety of the source data is loaded into a scalar variable and
parsed as a whole to simplifying the parsing process.
** Markup::Tokenizer
The tokenizer is really a lexer in the spirit of lex. It uses an
array of prioritized regular expressions to match individual tokens
within a string of text. Once a token is matched, the matched text is
removed from the beginning of the string and a new token search is
started. If no token can be found at the start of the string, the
tokenizer attempts to find the earliest match for any token that it
can match. Anything between the current start of the string and that
token is considered a raw text token.
Potential tokens are compared to a list of simple "must follow" rules
to determine if a given token is legal in the current location. If a
rule is found for the given token, it is compared to the previously
matched token. Tokens which are found to not follow a token listed in
their rule are either deleted, as in the case of an emacs mode line,
or converted into a text token. When completed, \code{tokenize} will
return an array or arrays containing a token symbol followed by the
text matched to produce that symbol.
** Markup::Parser
Initially, designing a parser for Markup seemed to pose several rather
annoying problems. Once being that the language appeared to have
elements of context that would prevent a reasonable representation of
the grammar that could be understood by yacc or bison. A more
detailed analysis of the spec by a more schooled eye might reach a
different conclusion. As such, it seemed logical to build a recursive
decent style parser \note{Assuming that I have the definition right.}
rather than look for parser generator tools.
Parsing is primarily handled by looking up each initial token in a
hash and mapping those tokens to a more specific processing function.
This helps to keep the primary parsing loop free of extraneous
\code{if ... elsif ... elsif ... else} combinations and simplify the
overall parsing logic greatly. Each specific parsing function uses a
combination of rewriting tokens in the token stream and recursion as a
means of handling the various Markup structures. For instance,
\code{\[} is converted from a \code{LINK_BLOCK_START} token into
\code{ESCAPE}, text, and \code{TAG_BLOCK_START} tokens. This allows
for reuse of the tagged markup code for links rather than rewriting
special logic for them.
Headers and indent sections are matched as single tokens by the
tokenizer and parsed further within the parser. Trying to match them
as multiple tokens turned out to be more problematic than it was
worth. Initial designs attempted to treat two and three space
combinations as separate tokens. Counting these tokens in the parser
to determine a degree of indention required a rather complex while
loop which did little more than determine the length of the initially
matched block of white space. Header tags caused a similar problem as
they where initially treated as two kinds of a tokens, a '*' and a '*
' token.
Rather than try to deal with the added complexity of treating them as
separate tokens, it made more sense to treat indentation as a single
token. A can follow rule in the tokenizer and a code to determine the
length of matched text virtually eliminated the parsing complexity.
Similar code, accounting for the extra space at the end of a header,
greatly simplified header parsing.
Empty paragraph tokens parsed as normal and consumed when appending to
the body of a \code{Module::Tree} object. Any attempt to deal with
these in the parse directly would require tracking checking the state
of the node being currently constructed in every place where a new
child node is added. As such, doing empty checking
\code{Module::Tree::append_node} greatly simplifies the calling code.
** Markup::Backend::XML
The trivial XML back-end is meant solely as a proof of concept and
means of testing the entire processing stack. Its output is designed
to match the test xml documents provided with these instructions as
closely as possible. The one exception to this rule has to do with
the way that links are processed, link_def elements are all on one
line rather than spread out over several.
As a general rule, constructing an output document involves walking a
given \code{Markup::Tree} in a recursive manner. Each tree node has a
\code{body} member which may contain either a text scalar or a reference to a
child \code{Markup::Tree} node. Child nodes are processed by calling
the \code{string} while text nodes are escaped and added to the output
document.
Indentation is handled via a string of spaces constructed based on the
current call depth. Nodes with the \code{inline} flag set do not
receive indentation or newlines. Tags with the \code{subdocument}
flag set append an extra layer of indentation to avoid skewing the
indentation of a the parent node.
** Markup::Tree
\code{Markup::Tree} contains state for \code{Markup::Parser} and the AST
representation of a parsed Markup document. The internal members
\code{indent}, \code{text}, and \code{node} are used by the parser to
maintain state while processing. \code{indent} stores the current degree
of indentation while \code{text} and \code{node} are a representation of
the node being constructed. Once a node is completed, the \code{body}
member holds the children of the current node and \code{name} is self
exploratory.
Using \code{Markup::Tree} in this manner does pollute the AST with data
members that have very little to do with document structure. The trade off
from not having to create two data structures in parallel does seem to
outweigh the added noise.
** Markup::Base
While this code could be considered "reinventing the wheel", the few features
required by this parse did not warrant pulling in \code{Moose} or another
OO framework. \code{Markup::Base} attempts to provide a reasonable default
constructor, a means of requiring certain arguments, and a means of specifying
default values for fields.
The \code{AUTOLOAD} method is a special function in perl which handles calls
to undefined methods. By explicitly turning off a portion of the strict pragma
with \code{no strict 'refs';} it is possible to use a scalar as a reference value.
In combination with perl type globs, this allows us to modify the symbol table for
a given package (which represents a class definition) and create field accessor
methods at runtime. The \code{lvalue} attribute also allows created accessor
methods to be assigned to as though they were variables rather than functions.
* Implementation Notes
Source Statistics:
127 371 2328 ./src/lib/Markup/Tree.pm
116 316 2431 ./src/lib/Markup/Backend/XML.pm
74 210 1591 ./src/lib/Markup/Tester.pm
54 158 982 ./src/lib/Markup/Util.pm
219 766 5333 ./src/lib/Markup/Tokenizer.pm
653 1831 13732 ./src/lib/Markup/Parser.pm
139 334 2588 ./src/lib/Markup/Base.pm
124 318 2371 ./src/markup.pl
72 125 880 ./t/make_tests.pl
1578 4429 32236 total
Commit Count: 128
While the commit count will be out of date buy the time I finish this
document, the rest of the numbers should be pretty accurate. It's
somewhat amazing to me how little code there actually is in this
parser. The entirety of the lexing code fits in 219 lines with added
noise form pod. I'm very certain that there is someone who could do
it in less and that it probably has been done in significantly less.
It's still amazing too me how little perl you need to accomplish
things.
One of the things that I had wanted to do here build a parser that
didn't use a parser generator. I've built a few flex/bison based
parsers and interpreters in the past but have not constructed any
non-trivial parsers by hand. Of course, as I stated above, the Markup
specification did appear somewhat daunting when I tried to mentally
think of it in terms of BNF form. More than anything I had a mental
block around trying to define a syntactic ruleset which used arbitrary
whitespace. If I were to try it now, I might be able to come up with
something, having examined the rules in detail.
De-denting also proved to be an interesing issue. Keeping track of the
current degree of indentation and how much the next line has changed in
relation to the current one is not all that difficult. Determining
exactly where in the parsing process you should de-dent and when to
de-dent without special indentation is an entirely different matter.
One thing I that I really like about working with perl is the native
hash implementation. Java has them as a container class, I know there
are C and C++ implementations that can be considred cannon but I really
like not having to think about them in great detail. It makes certain
kinds of things very easy to accomplish.
For instance, perl does not have the equivalent of a switch/case
construct that other languages do. My preferred method of
accomplishing the same thing is to use a hash as a lookup table for
function ref's. I'm not really sure how they compare in performance
to a chain of \code{if ... elsif} constructs but, they do make things
a heck of a lot simpler to read. Compare the v1.1 and v1.0 tag in my
git repository. The \code{_parse_internal} funciton goes from being a
rather opaque \code{if ... elsif} chain to a reasonable simple loop.
Once you understand what the loop does you barely have to think about it
anymore and use the \code{%parse_funciotion} hash to see how each of
token is mapped.
As to the rest of the structure in \code{Markup::Parser}, I'm not really
sure how much I like it. The portions that do token rewriting are a
little tricky to understand if you just look at them cursorily. For
instance, the order of tokens being unshifted is reverse of what you might
initially think it should be. Since \code{unshift} behaves equivalently
to \code{push} to the front side of an array, they have to be in that
order.
There are also some places where returning through multiple layers of
the call stack necessitated adding in a throwaway token. The reasoning
behind this might not be immedately obvious unless you can keep track of
the state of the processing loop in multiple layers of the call stack.
Overall, I'd say this was a pretty interesting build. It gave me a chance
to try out a few things I hadn't done before.