-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
136 lines (103 loc) · 5.03 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
Tue Jun 30 17:05:12 CEST 2009
- Replace all remaining use of TR1 classes (primarily shared_ptr) by
Qt equivalents. The minor now requires Qt 4.5, but works without
TR1 support.
Thu Jan 15 10:14:05 CET 2009
- Fix some user interface glitches.
Fri Jan 9 16:37:12 CET 2009
- Improvements to the mining viewer, including underlining of forms
within the list of sentences, and a preferences dialog where
various thresholds can be set.
- Include a small sample corpus, and a Makefile that automates mining.
- Speed up searching through through suffix arrays by determining
the upper and lower bound of an n-gram with binary search.
Previously, n-gram was looked up through binary search, but the
upper and lower bounds were found with a linear search.
- Many small fixes.
Wed Dec 10 11:41:53 CET 2008
- Integrate the error miner and viewer in the same build
infrastructure. Simply invoking 'qmake && make' in the top-level
directory will build the miner, viewer, and evaluator.
Tue Nov 11 10:35:42 CET 2008
- Add the '-e val' option for enabling sparseness correction, and
specifying the alpha variable. In case of doubt: 1.0 is a sensible
value for alpha.
Tue Nov 4 10:45:49 CET 2008
- Don't pass all sentence handlers as a vector to the constructor
of TokenizedSentenceReader, but provide addHandler() and
removeHandler() methods.
- Add smoothing, as described by Sagot and de la Clergerie.
Smoothing is enabled with the '-b val' flag, where 'val' is the
value used for the beta parameter.
- Remove the -a (all n-grams) option. It's not really useful for
*error* mining, and does not really make much sense now that we
have n-gram expansion.
Thu Oct 23 12:01:48 CEST 2008
- Simplify Miner::handleSentence().
- Add the '-c' option to disable ngram expansion.
- Simplify SuffixArray::compare(). As a bonus, due to less
operations that involve iterators, this gives a small
performance gain.
0.1.6 (October 6, 2007)
- Cache unigram ratios. Although binary search is used to locate
a sequence in the suffix array, counting the frequency of a
suffix requires a linear count. Since short n-grams occur very
frequently, this can take a large amount of time. Caching unigrams
takes relatively little memory, and give a considerable speedup
(54 to 12 seconds for the whole mining process on my test set).
- Use the ssort algorithm by McIlroy and McIlroy. This speeds
up suffix sorting considerably.
- Use perfect hashing and suffix arrays to look up arbitrary
length n-grams in the parsable and unparsable sentence lists.
- Discard the '-m' option for mining a range of n-grams. Instead
use a new method that can extend n-grams (normally unigrams)
when a longer n-gram has a higher 'error rate' than its parts.
0.1.5 (September 5, 2008)
- Add a new '-m' option to mine a range of n-grams in combination
with '-n'. For instance, '-n 1 -m 2' will mine unigrams and
bigrams. This option is still experimental, and probably doesn't
produce good results yet.
- Rename the '-m' option to '-u'.
- If the '-s' option is used to exclude forms with a near-zero
suspicion, remove the for from the set of forms as well. This
frees up more memory, and excludes these forms from the results.
0.1.4 (August 31, 2008)
- Check if the conversion of an option argument was correct.
- Add the '-s t' option. This option removes observations that
have dropped below the threshold t. If t is near-zero, this
has little effect on the analysis, while it speeds up the
analysis considerably.
- Avoid unnecessary map lookups, giving a slight speed-up.
0.1.3 (July 15, 2008)
- Add the '-v' option for verbose output.
- Make the miner observable.
0.1.2 (July 13, 2008)
- Add the '-a' option to include all ngrams in the analysis. With
this option, forms that only occur in parsable sentences are also
included.
- Switch to TR1 unordered_set for storing forms, giving a nice
performance win at the cost of support for older g++ versions.
- Fix usage information a bit.
0.1.1 (July 9, 2008)
- Move source files and internal headers to the src/ subdirectory.
0.1.0 (July 9, 2008)
- Port to C++.
0.0.4 (June 30, 2008)
- By default, restrict mining to forms that have suspicion. Add the
'-a' option for mining of all forms. Mining just the forms with
suspicion requires less memory (and time).
- Combine observation and form suspicion calculations in one single
method. As a result, observation suspicions can be stored temporally,
making the Observation class unnecessary. Sentences now just store
an array of observed forms, and Forms do not have a list of
observations. This reduces memory use quite a bit.
- Wrap mining results in a MineResults instance, which contains a
reference to the list of sentences as well. This is more convenient
when we want to show sample sentences for suspect n-grams in the
future.
0.0.3 (June 12, 2008)
- Add the ngram observation frequencies (total and unparsable) to
the default output.
0.0.2 (June 12, 2008)
- Add the '-m freq' option to specify a minimal frequency threshold
for observed ngrams in unparsable sentences.