-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathps01.scm
241 lines (219 loc) · 11.9 KB
/
ps01.scm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
;;; Ben Zinberg
;;; Regex combinators
;;; (Assignment for 6.945 Advanced Symbolic Programming)
;;; Write-up. (This file contains answers to the accompanying questions in
;;; ps.txt. Most of the actual code is found in regexp.scm and its includes.)
(load "regexp.scm")
;;; PROBLEM 1
(define (r:+ expr)
(r:repeat 1 #f expr))
(define (r:* expr)
(r:repeat 0 #f expr))
#|
(pp (r:grep* (r:seq (r:quote "I ")
(r:* (r:quote "really "))
(r:quote "like Scheme"))
"tests.txt"))
("[16]. I like Scheme"
"[17]. I really like Scheme"
"[18]. I really really really like Scheme"
"[20]. I really really really really really like Scheme"
"[22]. I really really really really really really really really really like Scheme")
(pp (r:grep* (r:seq (r:quote "I ")
(r:+ (r:quote "really "))
(r:quote "like Scheme"))
"tests.txt"))
("[17]. I really like Scheme"
"[18]. I really really really like Scheme"
"[20]. I really really really really really like Scheme"
"[22]. I really really really really really really really really really like Scheme")
(pp (r:grep* (r:seq (r:quote "I ")
(r:+ (r:alt (r:quote "really ")
(r:quote "truly ")))
(r:quote "like Scheme"))
"tests.txt"))
("[17]. I really like Scheme"
"[18]. I really really really like Scheme"
"[20]. I really really really really really like Scheme"
"[21]. I really truly really like Scheme"
"[22]. I really really really really really really really really really like Scheme")
|#
;;; PROBLEM 2
;;;
;;; a. The body of (r:repeat 0 1 expr) will itself call (r:repeat 0 1 expr),
;;; resulting in an infinite loop.
;;;
;;; b. Bonnie's code will produce much shorter regular expressions, and they
;;; will be more human readable mostly for that reason. We should also note
;;; that having extremely large regular expressions may cause efficiency
;;; problems, both for storage and for matching. For Bonnie's suggestion, less
;;; new code will be required: we can just substitute (r:? expr) for (r:alt
;;; expr "") directly, whereas implementing Alyssa's suggestion would likely
;;; involve explicitly the dirty details of copying the base string several
;;; times, and tacking on expr to the end of these copies various numbers of
;;; times. The code might need comments explaining this convoluted strategy.
;;;
;;; c. Ben's code uses BRE syntax, whereas Alyssa's and Bonnie's use ERE syntax.
;;; (BRE is supposedly more basic and should be supported by more programs.
;;; This might become important if we try to unmarry ourselves from using grep
;;; for string matching.) The regular expressions coming out of Ben's method
;;; are shorter than all the others. A human can actually see what the meaning
;;; is immediately when reading an interval expression, rather than reading a
;;; complicated alternation construct and saying "Oh, this is their way of
;;; conveying intervals." Also, the number of additional characters tacked on
;;; by a repeat is small, whereas it was on the order of (max-min) * length of
;;; expr for Bonnie's suggestion, and even bigger for Alyssa's suggestion.
(define (r:repeat min max expr)
(string-append "\\("
expr
"\\{"
(number->string min)
","
(if max
(number->string max)
"")
"\\}"
"\\)"))
#|
(pp (r:grep* (r:seq (r:quote "I ")
(r:repeat 2 5 (r:quote "really "))
(r:quote "like Scheme"))
"tests.txt"))
("[18]. I really really really like Scheme"
"[20]. I really really really really really like Scheme")
(pp (r:grep* (r:seq (r:quote "I ")
(r:repeat 2 #f (r:quote "really "))
(r:quote "like Scheme"))
"tests.txt"))
("[18]. I really really really like Scheme"
"[20]. I really really really really really like Scheme"
"[22]. I really really really really really really really really really like Scheme")
(pp (r:grep* (r:seq (r:quote "I ")
(r:repeat 2 5 (r:alt (r:quote "really ")
(r:quote "truly ")
(r:quote "don't ")))
(r:quote "like Scheme"))
"tests.txt"))
("[18]. I really really really like Scheme"
"[19]. I really really don't really like Scheme"
"[20]. I really really really really really like Scheme"
"[21]. I really truly really like Scheme")
|#
;;; PROBLEM 3
;;;
;;; My strategy is basically an object-oriented or data-directed one.
;;; Following the tag system used in chapter 2 of SICP, I created tagged data
;;; types for the various basic forms and means of combination found in regular
;;; expressions, and define a "render" method for each, as well as a
;;; "render-atomic" method which places parentheses around the object if such
;;; parentheses would be needed for grouping in the context of a larger
;;; expression. There is then a generic "render" procedure (as well as
;;; "render-atomic") which dispatches to the appropriate method, and this
;;; generic procedure is used in some of the methods to render "child elements"
;;; as subexpressions of the parent (e.g., alternation relies on being able to
;;; generically render each of the subexpressions being alternated between).
;;;
;;; Now the task of eliminating unnecessary parentheses comes down to making
;;; sure the renderer knows whether parentheses are necessary. This is the
;;; purpose of having both render and render-atomic: when grouping is
;;; necessary, render-atomic is called, and parentheses are only added if the
;;; grouping needs additional parentheses. So for example, the atomic
;;; rendering of c\{2,3\} is (c\{2,3\}), but the atomic rendering of [ab] is
;;; simply [ab]. Subtle cases such as this one and many more can be seen in
;;; the output of regexptest.scm, which shows all generated regular expressions
;;; and matches for the test cases. See for example the test case "Repeated
;;; repeat."
;;; PROBLEM 4
;;;
;;; My strategy for implementing backreferences is to allow names (symbols) to
;;; be attached to the objects representing regular expressions. When the
;;; regular expression is rendered, the result is a "labelled string", in which
;;; the first character of each parenthesized chunk is labelled with the name
;;; given to the object which produced the chunk. (We ensure that chunks
;;; generated by named objects are parenthesized.) If backreferences occur
;;; within the regular expression to be rendered, they are rendered as
;;; placeholders which record (in their labels, remember this is a labelled
;;; string) the name of the object to be backreferenced.
;;;
;;; After this initial rendering, we wish to resolve backreferences. Doing so
;;; will not introduce any new parentheses, so we can now start counting
;;; parentheses. We create a table which maps names to backreference numbers
;;; (i.e. the number which would be used in a backreference to the chunk havin
;;; that name). Then we use this table to replace backreference placeholders
;;; with the appropriate backreference number and preceding backslash.
;;;
;;; This strategy requires the notion of a "labelled string," which is fleshed
;;; out in labelled-string.scm. Note that in the above, the labels associated
;;; to placeholders are not symbols, and so they can be distinguished from the
;;; labels associated to named chunks. Since each character in a labelled
;;; string can have at most one label, this implies that backreferences
;;; themselves cannot have names. This is reasonable, since backreferences to
;;; backreferences can (and probably should) always be simplified away.
;;;
;;; I should note that the way in which I deal with labelled strings is sloppy
;;; and, if I planned to extend this software further, should be reworked to
;;; have have more careful type checking. It is important that each render
;;; method return a labelled string (not just a string), but to keep the code
;;; short and readable, I did this by having the labelled-string-append and
;;; labelled-string->string procedures take in either strings or labelled
;;; strings as arguments, and always return labelled-strings. So, it's
;;; important that each render method does something that eventually converts
;;; the result into a labelled string, but this is not explicitly stated in the
;;; code and the place at which conversion happens is not always obvious if you
;;; don't read carefully. It's probably okay for programs of this length, but
;;; if the program got bigger I'd definitely want to fix this up a bit.
;;; PROBLEM 5
;;;
;;; ISSUES
;;;
;;; The main weirdness is that a lot of the special characters, which in BREs
;;; are treated literally when unescaped and are treated specially when
;;; escaped, are treated precisely the reverse way in EREs. So for example \(
;;; \) are grouping symbols in BREs whereas they're literal parentheses in
;;; EREs, and () are grouping symbols in ERES whereas they're literal
;;; parentheses in BREs. Of course, this only holds outside of bracketed
;;; expressions []. And by the way, [] are considered special without escaping
;;; in both BREs and EREs. To test the new (and old) backends, there should be
;;; some test cases with lots of literal parentheses, braces and brackets, as
;;; well as grouping, backreferences, repetition, and char-froms that result in
;;; the rendering of syntactic prantheses, braces and brackts.
;;;
;;; Also, there are a bunch of new additions to the lineup of characters that
;;; must be escaped in EREs, such as ?, +, |. We should include test cases
;;; that have literal versions of these characters in addition to the syntactic
;;; constructs requiring them. (Actually, that will only be possible for |.
;;; Since we are building a language on top of REs, we do not have to use the
;;; syntactic sugar ?, +, *; we can use interval expressions instead.)
;;;
;;; It's worth noting that the implementation of backreferences has to change
;;; in order to accommodate an ERE backend, since the same expression may have
;;; different backreference numbers as a BRE vs. as an ERE. This is likely to
;;; cause problems for an incorrect implementation if there are backreferences
;;; in an RE that has lots of grouping and literal parentheses. I will attempt
;;; to increase the level of abstraction so that the code differences between
;;; BREs and EREs are small and transparent, without having so much abstraction
;;; that it becomes hard to read and understand the code.
;;;
;;; SOLUTIONS
;;;
;;; In this problem I created my third draft of the code. In this draft, a new
;;; boolean constant USE_ERE is defined, and the constants
;;; chars-needing-quoting and chars-needing-quoting-in-brackets are re-branded
;;; to procedures which check the value of USE_ERE and return the appropriate
;;; constant. Such a procedure is also created for pipe |, as well as open and
;;; close parentheses and braces. The procedure surround-with-braces and the
;;; new version of parenthesize use these refactored constant-giving
;;; procedures, and would work even if the future values of these constants
;;; changed.
;;;
;;; The procedure make-backreference-table needed surprisingly few changes to
;;; work for EREs. I mostly just had to change has-initial-open-paren? to work
;;; with the new constant-giving procedures. However the new version of
;;; make-backreference-table is not completely general: The notion of
;;; backslash, and the fact that it escapes exactly one character, is
;;; hard-coded in. Thus, only two front two characters of the labelled string
;;; in question are remembered at a given time, since only one character is to
;;; be skipped. I decided not to go more general because this meaning of
;;; backslash holds for all REs and there is never a time when more than two
;;; characters would together be treated as a single token for the purposes of
;;; an RE. Generalizing would just make the code longer and harder to read.