-
Notifications
You must be signed in to change notification settings - Fork 15
/
CHANGES.xml
298 lines (285 loc) · 12.5 KB
/
CHANGES.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
<document xmlns="http://maven.apache.org/changes/1.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/changes/1.0.0 http://maven.apache.org/xsd/changes-1.0.0.xsd">
<properties>
<title>Norconex Collector Core Project</title>
<author email="[email protected]">Norconex Inc.</author>
</properties>
<body>
<release version="2.1.0-SNAPSHOT" date="2024-??-??" description="Minor release.">
<action dev="essiembre" type="update">
Minimum Java Version is now 11.
</action>
<action dev="essiembre" type="fix">
Fixed crawler throwing error when issuing a stop command.
</action>
</release>
<release version="2.0.2" date="2023-07-09" description="Maintenance release.">
<action dev="essiembre" type="add">
New "deferredShutdownDuration" collector configuration option to delay
the collector shutdown when it's done executing.
</action>
<action dev="essiembre" type="update">
Maven dependency updates: norconex-commons-maven-parent 1.0.2,
H2 2.2.220, JSoup 1.15.3.
</action>
<action dev="essiembre" type="update">
JMX crawler MBeans are now unregistered as the last thing before
collector shutdown.
</action>
</release>
<release version="2.0.1" date="2022-08-30" description="Maintenance release.">
<action dev="essiembre" type="add">
New MDC attributes which can be used in supporting logging framework:
"ctx:crawler.id", "ctx:crawler.id.safe", "ctx:collector.id.safe",
and "ctx:collector.id.safe".
</action>
<action dev="essiembre" type="fix">
Fixed occasional concurrency issue when crawler terminates.
</action>
<action dev="essiembre" type="fix">
Fixed the crawler sometimes not exiting when done.
</action>
</release>
<release version="2.0.0" date="2022-01-02"
description="Major release. NOT a drop-in replacement for 1.x.">
<!-- 2.0.0 (GA) -->
<action dev="essiembre" type="update">
Updated transitive dependencies with known vulnerabilities.
</action>
<action dev="essiembre" type="update">
The name of the data store engine "storetypes" collection/table has
been shorten to just the class "simple" name + "--storetypes".
</action>
<action dev="essiembre" type="update">
Updated dependencies to avoid logging library detection conflict.
</action>
<action dev="essiembre" type="update">
Updated JdbcDataStoreEngine table name automatic creation to take into
account more special characters.
</action>
<action dev="essiembre" type="fix">
Fixed "maxConcurrentCrawlers" throwing IllegalStateException with
"Connection Pool Shut Down" message when different than default value
or does not match the number of crawlers.
</action>
<action dev="essiembre" type="fix">
Fixed JdbcDataStoreEngine#getStoreNames not returning proper names.
</action>
<action dev="essiembre" type="fix">
Fixed JdbcDataStoreEngine XML configuration being loaded twice.
</action>
<action dev="essiembre" type="fix">
Fixed MongoDataStore#deleteFirst not successfully deleting and
returning the first record.
</action>
<action dev="essiembre" type="fix">
Fixed data store deserialization not taking into account sub-types,
affecting JDBC and MongoDB implementations.
</action>
<action dev="essiembre" type="fix">
Fixed data store engine resources not being being included as part
of the crawler resource cleaning process.
</action>
<action dev="essiembre" type="fix">
Fixed throwing an error when trying to log the execution summary
after the data store engine was closed.
</action>
<!-- 2.0.0-RC1 -->
<action dev="essiembre" type="add">
New StopCrawlerOnMaxEventListener class to stop crawlers upon reaching
a maximum number of specific crawler events.
</action>
<action dev="essiembre" type="add">
New DeleteRejectedEventListener class to delete documents matching
specific document "rejected" events.
</action>
<action dev="essiembre" type="add">
Added deduplication configuration options via
CrawlerConfig#setMetadataDeduplicate and
CrawlerConfig#setDocumentDeduplicate
</action>
<action dev="essiembre" type="add">
New crawler event: REJECTED_DUPLICATE.
</action>
<action dev="essiembre" type="update">
Maven dependency updates: MongoDB Driver 4.3.2, Testcontainers 1.16.0.
</action>
<action dev="essiembre" type="update">
Launching crawler now sets crawler name as thread name even
before starting to process references.
</action>
<action dev="essiembre" type="update">
Metadata checksummer now an element of CrawlerConfig.
</action>
<action dev="essiembre" type="update">
Checksummers "targetField", "sourceFields", and "sourceFieldsRegex"
are deprecated in favor of "toField" and "fieldMatcher".
</action>
<action dev="essiembre" type="update">
RegexMetadataFilter and RegexReferenceFilter have been deprecated
in favor or MetadataFilter and ReferenceFilter.
</action>
<action dev="essiembre" type="update">
Checksummers "disabled" flag deprecated in favor of setting a null
checksummer or using a self-closed checksummer tag in config.
</action>
<action dev="essiembre" type="fix">
Fixed invalid configuration in POM "maven-dependency-plugin".
</action>
<!-- 2.0.0-M2 -->
<action dev="essiembre" type="add">
Added JdbcDataStoreEngine as a data store implementation.
</action>
<action dev="essiembre" type="add">
Added "crawlersStartInterval" configuration option.
</action>
<action dev="essiembre" type="add">
New crawler events:
DOCUMENT_QUEUED, DOCUMENT_PROCESSED.
</action>
<action dev="essiembre" type="add">
JMX reporting now returns active references and event counts.
</action>
<action dev="essiembre" type="add">
Now provides execution summary and the end of a crawler execution.
</action>
<action dev="essiembre" type="remove">
Removed JEF dependency in favor of improved JMX for tracking.
</action>
<!-- 2.0.0-M1 -->
<action dev="essiembre" type="add">
Now supports providing multiple committers.
</action>
<action dev="essiembre" type="add">
New collector events: COLLECTOR_RUN_BEGIN, COLLECTOR_RUN_END,
COLLECTOR_STOP_BEGIN, COLLECTOR_STOP_END,
COLLECTOR_CLEAN_BEGIN, COLLECTOR_CLEAN_END,
COLLECTOR_STORE_EXPORT_BEGIN, COLLECTOR_STORE_EXPORT_END,
COLLECTOR_STORE_IMPORT_BEGIN, COLLECTOR_STORE_IMPORT_END
</action>
<action dev="essiembre" type="add">
New crawler events: CRAWLER_INIT_BEGIN, CRAWLER_INIT_END,
CRAWLER_RUN_BEGIN, CRAWLER_RUN_END,
CRAWLER_STOP_BEGIN, CRAWLER_STOP_END,
CRAWLER_CLEAN_BEGIN, CRAWLER_CLEAN_END.
</action>
<action dev="essiembre" type="add">
New method on CrawlerEvent: isCrawlerShutdown.
</action>
<action dev="essiembre" type="add">
New UNSUPPORTED crawl state.
</action>
<action dev="essiembre" type="add">
New Collector#clean() method and related events.
</action>
<action dev="essiembre" type="add">
New Collector#exportDataStore(), Collector#importDataStore() methods
and related events.
</action>
<action dev="essiembre" type="add">
New .core.reference package along with new .core.store package
for storing of URL crawling information.
</action>
<action dev="essiembre" type="add">
New IDataStoreEngine accessible from crawler to store any kind
of objects by implementors in their own extensions.
</action>
<action dev="essiembre" type="add">
AbstractDocumentChecksummer and AbstractMetadataChecksummer classes
(and their subclasses) now have an "onSet" configurable option for
dictating how values are set: append, prepend, replace, optional.
</action>
<action dev="essiembre" type="add">
New CrawlDoc, CrawlDocInfo, and CrawlDocMetadata (either new
or renamed).
</action>
<action dev="essiembre" type="add">
New Crawler#isQueueInitialized() method to support asynchronous
reference queueing.
</action>
<action dev="essiembre" type="add">
Now logging throughput (documents per seconds) and estimated remaining
time.
</action>
<action dev="essiembre" type="update">
Now always resume previous incomplete executions. Can now "clean"
to start fresh.
</action>
<action dev="essiembre" type="update">
Now using XML class from Norconex Commons Lang for loading/saving
configuration.
</action>
<action dev="essiembre" type="update">
Now using SLF4J for logging.
</action>
<action dev="essiembre" type="update">
Lists are now replacing arrays in most places.
</action>
<action dev="essiembre" type="update">
ICollector, ICollectorConfig, ICrawler, ICrawlerConfig were all
replaced with Collector, CollectorConfig, Crawler, and CrawlerConfig.
</action>
<action dev="essiembre" type="update">
Default working directory structure has been modified.
</action>
<action dev="essiembre" type="update">
Path is used in addition/instead of File in many places.
</action>
<action dev="essiembre" type="update">
Configurable CollectorLifeCycleListener, IJobLifeCycleListener,
IJobErrorListener, ISuiteLifeCycleListener, ICrawlerEventListener
all replaced with IEventListener. These new listeners can be set on
the collector configuration, or be implemented on configuration objects
and automatically be detected.
</action>
<action dev="essiembre" type="update">
Dependency updates: Norconex Importer 3.0.0, Norconex JEF 5.0.0,
Norconex Commons Lang 2.0.0, Norconex Committer 3.0.0, H2 1.4.197.
</action>
<action dev="essiembre" type="update">
CrawlerConfig#OrphanStrategy is now public.
</action>
<action dev="essiembre" type="update">
Now requires Java 8 or higher.
</action>
<action dev="essiembre" type="update">
Command-line arguments are now different, with more options such
as "cleaning" previous executions,
importing/exporting the crawl store and forcing a commit of any remains
from committer queue, rendering of configuration file once interpreted,
etc.
</action>
<action dev="essiembre" type="update">
Now use simple file-locks to prevent running conflicting
commands concurrently.
</action>
<action dev="essiembre" type="update">
Dates now takes the zone into consideration.
</action>
<action dev="essiembre" type="update">
Collector "maxParallelCrawlers" is now deprecated in favor of
"maxConcurrentCrawlers".
</action>
<action dev="essiembre" type="remove">
Removed "data" package in favor of "reference" package.
</action>
<action dev="essiembre" type="remove">
Removed some of the deprecated code from 1.x.
</action>
<action dev="essiembre" type="remove">
Removed CRAWLER_RESUMED crawler event.
</action>
<action dev="essiembre" type="remove">
Removed CollectorConfigLoader, CollectorLifeCycleListener,
CrawlerLifeCycleListener, IJobLifeCycleListener, IJobErrorListener,
ISuiteLifeCycleListener, ICrawlerEventListener
(replaced by IEventListener).
</action>
<action dev="essiembre" type="remove">
Remove all previously available crawl store implementions in favor
of new MVStoreDataStore.
</action>
</release>
</body>
</document>