Releases: apache/incubator-stormcrawler
Apache StormCrawler 3.1.0 (Incubating)
Disclaimer
Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Release Summary
This is our 2nd release after joining the ASF incubator as a poddling. It contains the new playwright module, which can be used for scraping dynamic content.
What's Changed
- send email if CI build fails by @pjfanning in #1217
- Fixes #1214 - "Update Release Docs with Feedback from 3.0 RC2 Vote" by @rzo1 in #1218
- Fix #1223 - Remove declareOutputFields from Solr StatusUpdaterBolt by @mvolikas in #1224
- Apache StormCrawler 3.0 (Incubating) by @rzo1 in #1225
- Fix #1226 "Add FileSpout TestCase for Custom Metadata Injections" by @rzo1 in #1227
- 1024 Playwright protocol implementation, fixes #1024 by @jnioche in #1228
- Fix #1230: Set sitemap key before outlink processing by @mvolikas in #1231
- #1220 - Add disclaimer for binary test artifacts by @rzo1 in #1234
- #1221 - Switch Source to tar.gz by @rzo1 in #1233
- #1215 - Update RAT exclusions. Fixes licenses by @rzo1 in #1235
- #1236 - Fix Typos in StormCrawler by @rzo1 in #1237
- #1222 - Fix Release Docs by @rzo1 in #1232
- #1238 - Avoid use of star imports by @rzo1 in #1239
- Fix #1244 "Migrate to JUnit 5" by @rzo1 in #1245
- Fix #1216 - Add RAT Exclusion File for standalone RAT by @rzo1 in #1243
- #1248 - Use pre-compiled patterns for mime type matching in TikaParser by @rzo1 in #1249
- #1251 - Update to Storm 2.6.3 by @rzo1 in #1252
- #626: Add routing field in metadata - Solr StatusUpdaterBolt by @mvolikas in #1242
- #851 Merge branch 851 into main by @mvolikas in #1256
- #1259 - Enable Dependabot by @rzo1 in #1260
- #1261 - Automatically generate THIRD-PARTY.txt via GitHub Action by @rzo1 in #1262
- #1257 - Update to Storm 2.6.4 by @rzo1 in #1258
- #1162 - Replace Coveralls with JaCoCo by @sigee in #1255
- Bump testcontainers.version from 1.19.7 to 1.20.1 by @dependabot in #1277
- Bump org.apache.maven.plugins:maven-javadoc-plugin from 3.5.0 to 3.10.0 by @dependabot in #1267
- Bump actions/setup-java from 3 to 4 by @dependabot in #1264
- Bump actions/checkout from 3 to 4 by @dependabot in #1265
- Bump org.jsoup:jsoup from 1.17.2 to 1.18.1 by @dependabot in #1271
- Regenerated License file after dependency upgrades by @github-actions in #1280
- Bump tika.version from 2.9.1 to 2.9.2 by @dependabot in #1269
- Bump com.ibm.icu:icu4j from 74.2 to 75.1 by @dependabot in #1272
- Bump org.apache.maven.plugins:maven-enforcer-plugin from 3.4.1 to 3.5.0 by @dependabot in #1289
- Bump org.apache.maven.plugins:maven-jar-plugin from 3.3.0 to 3.4.2 by @dependabot in #1288
- Bump org.apache.maven.plugins:maven-compiler-plugin from 3.11.0 to 3.13.0 by @dependabot in #1285
- Bump org.apache.rat:apache-rat-plugin from 0.15 to 0.16.1 by @dependabot in #1283
- Bump org.apache:apache from 31 to 33 by @dependabot in #1275
- Bump junit.version from 5.10.2 to 5.11.0 by @dependabot in #1278
- Bump org.apache.solr:solr-solrj from 9.5.0 to 9.6.1 by @dependabot in #1281
- Bump org.apache.maven.archetype:archetype-packaging from 2.4 to 3.2.1 by @dependabot in #1287
- Bump org.mockito:mockito-core from 5.10.0 to 5.13.0 by @dependabot in #1279
- Bump com.microsoft.playwright:playwright from 1.43.0 to 1.46.0 by @dependabot in #1268
- Bump selenium.version from 4.18.1 to 4.24.0 by @dependabot in #1266
- Bump log4j2.version from 2.23.0 to 2.24.0 by @dependabot in #1284
- Regenerated License file after dependency upgrades by @github-actions in #1282
- Fix #1290 "Add close/cleanup method to ParseFilters" by @rzo1 in #1291
- Bump opensearch.version from 2.12.0 to 2.16.0 by @dependabot in #1276
- Regenerated License file after dependency upgrades by @github-actions in #1292
- Aligned version of OpenSearch in test with recent upgrade to 2.16 by @jnioche in #1293
- Bump actions/cache from 3 to 4 by @dependabot in #1263
- Revert "Bump log4j2.version from 2.23.0 to 2.24.0" by @rzo1 in #1294
- #1295 - Add workflow to publish SNAPSHOTS to repository.a.o by @rzo1 in #1296
- Regenerated License file after dependency upgrades by @github-actions in #1297
New Contributors
- @sigee made their first contribution in #1255
- @github-actions made their first contribution in #1280
Full Changelog: stormcrawler-3.0...stormcrawler-3.1.0
Apache StormCrawler 3.0 (Incubating)
Disclaimer
Apache StormCrawler is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Release Summary
This is our first release after joining the ASF incubator as a poddling. It is a breaking change with renamings in the group ids and
the removal of the elasticsearch module.
What's Changed
- Handling of DateTimeParseException in WARCSpout by @michaeldinzinger in #1140
- Generate THIRD-PARTY.txt file, fixes #1145 by @jnioche in #1146
- Remove coveralls maven plugin, fixes #1148 by @jnioche in #1149
- OpenSearch - better handling of mappings by @jnioche in #1155
- Delete CODE_OF_CONDUCT.md by @pjfanning in #1158
- Create DISCLAIMER by @pjfanning in #1159
- Update NOTICE by @pjfanning in #1160
- Changed package names to org.apache by @jnioche in #1165
- Create .asf.yaml by @pjfanning in #1161
- Fix #1174 - Exclude optional artifact from storm-hdfs by @rzo1 in #1175
- Fix #1164 - Change license headers by @rzo1 in #1173
- Removed devs section from pom.xml by @jnioche in #1181
- Fix #1167 - Remove Elasticsearch module by @rzo1 in #1182
- Remove hyphens in storm-crawler by @jnioche in #1177
- Fixes #1178 "Set version to 3.0-SNAPSHOT" by @rzo1 in #1183
- Fixes #1169 - Use Apache Parent POM & Enable RAT by @rzo1 in #1180
- Removed ref to Discord in README by @jnioche in #1184
- Fix #1168 - Add a modified version of CONTRIBUTING.md by @rzo1 in #1186
- Fix #1163 - Change the GitHub templates for PRs to be more ASF specific by @rzo1 in #1185
- Upgrade to Storm 2.6.2, fix #1188 by @jnioche in #1189
- link to ASF web site .asf.yaml by @pjfanning in #1192
- Update README.md by @jnioche in #1195
- 1200 - Fix license headers by @jnioche in #1201
- #1197 - Allow to disable SSL/TLS verification in OpenSearchConnection by @rzo1 in #1199
- Fix #1202 - Add release documentation and comply with source package naming requirements by @rzo1 in #1203
- #1207 -- add forbidden-apis by @tballison in #1208
- #1209 fix for emulation error in tests run on silicon by @joshfischer1108 in #1210
- Resolves #1211 "Fix License Header" by @rzo1 in #1212
- #1205 update archetype in README by @joshfischer1108 in #1206
- Introduce "skip.format.code" to skip code formatting by default by @rzo1 in #1213
New Contributors
- @pjfanning made their first contribution in #1158
- @tballison made their first contribution in #1208
- @joshfischer1108 made their first contribution in #1210
Full Changelog: 2.11...stormcrawler-3.0
StormCrawler 2.11
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Upgrade to OpenSearch 2.11 #1113 by @jnioche in #1114
- Use mock server for selenium tests, fix #1116 by @jnioche in #1119
- Issue #728: Adding asterisk for metadata transfer by @michaeldinzinger in #1117
- WARCSpout loads inputs using HDFS by @jnioche in #1122
- Fix wrong most recent date was set by @chhsiao90 in #1126
- Glob field mapping for indexer.md.mapping by @jnioche in #1130
- Add committer statement by @michaeldinzinger in #1134
- Implement configurable getDocumentID in DeletionBolt by @chhsiao90 in #1135
- Add two tests for SiteMapParserBolt by @michaeldinzinger in #1138
- dependency upgrades by @jnioche in #1139
New Contributors
- @chhsiao90 made their first contribution in #1126
Full Changelog: 2.10...2.11
What's new in StormCrawler 2.10
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Selenium test by @jnioche in #1093
- refactoring timeouts Selenium by @jnioche in #1102
- Improvements and fixes to HttpRobotRulesParser when following redirects by @sebastian-nagel in #1103
and a lot more!
Full Changelog: 2.9...2.10
See https://digitalpebble.blogspot.com/2023/10/focus-on-protocol-improvements-in.html for more details on the protocol improvements
What's new in StormCrawler 2.9
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Change HttpProtocol to defer to configured values for retryOnConnectionFailure and followRedirects by @ndtreviv in #1056
- Cache redirected robots.txt for target host only if path is /robots.txt and query is empty by @sebastian-nagel in #1057
- Issue #1043: Fixing problems after restart of Frontier service by @michaeldinzinger in #1054
- #1049 Replace "Collapse and Expand Results" Solr query with "Result Grouping" query. by @syefimov in #1053
- OpenSearch 2.7.0 + renamed OpenSearchConnection by @jnioche in #1064
- BasicURLNormalizer .unmangleQueryString() returns invalid results if "&" symbol in a parents path #1059 by @syefimov in #1062
- Dependency upgrades. fixes #1066 by @jnioche in #1067
- Automatic creation of index definitions should use the bolt type by @jnioche in #1069
- mechanism to retrieve more generic value of configuration by @jnioche in #1071
- Create DeletionBolt.java for Solr. #1050 by @syefimov in #1073
- Increase the number of redirects to 5 for Robots.txt fetching by @michaeldinzinger in #1074
- Issue #1042: Adapt parsing of robots.txt files by @michaeldinzinger in #1055
- Test URL Filtering from the command line by @jnioche in #1081
New Contributors
- @michaeldinzinger made their first contribution in #1054
- @syefimov made their first contribution in #1053
Full Changelog: 2.8...2.9
What's new in StormCrawler 2.8
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Enforce Java 11 in archetypes by @msghasan in #1029
- Fix #1027: Ensure SC can be build with Java 17 by @rzo1 in #1030
- Indexer ES document id by @Mikwiss in #1028
- JsoupFilter as Interface by @Mikwiss in #1026
- Create method to add SearchHit info to metadata by @Mikwiss in #1034
- Status ES document id by @Mikwiss in #1036
- Limit the amount of text to be returned by the text extraction, #1038 by @jnioche in #1039
- Allow override on HttpProtocol's method addHeadersToRequest by @Mikwiss in #1041
- Fixes #1045. Remove range syntax from snakeyaml by @rzo1 in #1046
- Fix #1032: Catch the exception inside the loop to avoid breaking if one remote instance is misbehaving by @rzo1 in #1047
New Contributors
Full Changelog: 2.7...2.8
What's new in StormCrawler 2.7
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
What's Changed
- Dependency upgrades #1016
- Opensearch module in #1011
- Maven archetype for Opensearch
- [WARC] Backward compatible storage of HTTP/2 headers by @sebastian-nagel in #1010
- Ignore empty fields indexer in #1019
- Handle single quotes in value of http-equiv="refresh" #1020
Full Changelog: 2.6...2.7
What's new in StormCrawler 2.6
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
Highlights
- Using URLFrontier in archetype
- URLFilter becomes an abstract class
- Fixed deactivation of maxDepthFilter
- JSoupParserBolt improve performance of link extraction
- Multiple dependency upgrades
Full Changelog: storm-crawler-2.5...2.6
What's new in Stormcrawler 2.5
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
In a nutshell
- various dependency upgrades (JSoup, CrawlerCommons, Tika, Elasticsearch)
- Java 11
- bugfix AggregationSpout does not release IsInQuery boolean sometimes
- various improvements to URLFrontier module
In more details
- FEATURE-964: custom crawl delay per page by @juli-alvarez in #967
- Issue 970 HttpProtocol doesn't consider http.content.limit in test for filesize by @wowasa in #972
- Add ChannelManager for local channel management and constants to Spout.java by @FelixEngl in #982
- Fix error when spaces in path to test-resources of StatusBoltTest in ElasticSearch-Module by @FelixEngl in #985
- Add unit test basics for URLFrontier. by @FelixEngl in #984
- Fix starvation and busy waiting of StatusUpdaterBolt.java, add Constants. by @FelixEngl in #983
- Fix starvation and busy waiting of ES StatusUpdaterBolt (Fixes #986) by @FelixEngl in #988
- Fix starvation and busy waiting of ES IndexerBolt by @FelixEngl in #989
- HttpProtocol use the md protocol.set-headers to add custom header by url by @Mikwiss in #993
New Contributors
Full Changelog: 2.4...storm-crawler-2.5
StormCrawler 2.4
Disclaimer
This is a Pre-ASF release and did not undergo a formal review by the PMC.
Upgrade to Apache Storm 2.4
Upgrade to Elasticsearch 7.17.2
bugfix Setting "maxDepth": 0 in urlfilter.json prevents ES seed injection #959
Allow compatibility.mode for rest client to connect to ES8+ #962
Full Changelog: 2.3...2.4