Releases: Norconex/crawlers
norconex-collector-http-2.1.0
Release notes, binary downloads, and documentation can all be found on Norconex HTTP Collector website: http://www.norconex.com/collectors/collector-http/
norconex-collector-http-2.0.2
Bug fix release:
- Fixed the collector "stop" action having no effect (github #49).
- Fixed crawl data wrongfully applied as metadata after the import phase.
- Fixed NullPointerException when sitemap support is disabled.
- Fixed incorrect deletion behavior for embedded orphan documents.
- Improved log4j.properties logging options for crawler events.
- Upgraded Norconex Collector Core dependency to 1.0.2.
Binary download: http://www.norconex.com/collectors/collector-http/download
norconex-collector-http-2.0.1
- From collector-core-1.0.1: When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions.
- Fixed errors in example configuration files.
Binary download: http://www.norconex.com/collectors/collector-http/download
norconex-collector-http-2.0.0
- Upgraded Norconex Importer to version 2.0.0, which brings to Norconex HTTP Collector a lot of new features, such as:
- Document content splitting
- Splitting of embedded documents into individual documents
- New taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more.
- Read the Norconex Importer release notes for a complete list of changes at: http://www.norconex.com/product/importer/changes-report.html#a2.0.0
- Can now supply a "urlsFile" as part of the startURLs, acting as a seed list.
- New fast MVStore database implementation for URL database (from Norconex Collector Core).
- New H2 database implementation for URL database (crawl data store).
- Now keeps track of parent references (for embedded/split documents).
- Now support also extracting a link title and text (github #23 ), and they also support the "nofollow" robot rule.
- It is now possible to configure multiple link extraction classes, each taking effect on particular URLs and/or content-types.
- IHtmlLinkExtractor can be configured to use specified HTML tags and attributes to find URLs.
- Now licensed under The Apache License, Version 2.0.
- Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
- The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
- Refactored to use the new Norconex Collector Core library.
- New and more scalable crawler event model along with new listeners.
- More...
Binary download: http://www.norconex.com/collectors/collector-http/download
norconex-collector-http-1.3.4
- MongoCrawlURLDatabase now supports user authentication.
- Now requires Java 7 or higher.
- Fixed DefaultRobotsTxtProvider failing to parse some robots.txt patterns.
Binary download: http://www.norconex.com/product/collector-http/download.html
norconex-collector-http-1.3.3
- Upgraded JEF to 3.0.1 to fix stop action not working.
- Fixed NullPointerException in robots.txt resolution under some circumstances.
Binary download: http://www.norconex.com/product/collector-http/download.html
norconex-collector-http-1.3.2
- DefaultURLExtractor no longer treat empty href as being a URL ending with a double-quote.
- GenericURLNormallizer no longer rejects URLs with spaces in them.It now logs a warning instead.
Binary download: http://www.norconex.com/product/collector-http/download.html
norconex-collector-http-1.3.1
- Header and document checksum value are no longer added by default to prevent the issue described in github issue #24. Instead, adding checksum is now an optional feature of DefaultHttpDocumentChecksummer and DefaultHttpHeadersChecksummer.
Binary download: http://www.norconex.com/product/collector-http/download.html
norconex-collector-http-1.3.0
- Now supports NTLM authentication. Experimental support added for SPNEGO and Kerberos.
- Document checksums are added to each document metadata.
- Refactoring of HTTPClient creation with many new configuration options added (connection timeout, charset, maximum redirects, and several more).
- Can optionally trust all SSL certificate now.
- Integrates new features of Norconex Importer 1.2.0 such as support for WordPerfect document parsing, new filter and transformers, etc.
- Integrates new features of Norconex Committer 1.2.0 such as defining multiple committers, retrying upon commit failure, etc.
- Other third-party library upgrades.
Binary download: http://www.norconex.com/product/collector-http/download.html
norconex-collector-http-1.2.0
Feature release:
- New optional Mongo URL Database implementation.
- New TikaURLExtractor class providing an alternate IURLExtractor implementation based on Apache Tika HTMLParser.
- New SegmentCountURLFilter class for filtering URLs having a specified number of segments (can check duplicate segments too).
- New unit tests.
- MapDB URL Database classes moved to its own "mapdb" package. DefaultCrawlURLDatabaseFactory still exists, but is just a pointer to MapDBCrawlURLDatabaseFactory.
- Example configurations now point to Norconex test pages to ensure their stability.
- Upgraded dependent libraries: Norconex Committer 1.1.0, Norconex Commons Lang 1.2.0, MapDB 0.9.8 and other thrid party libraries.
- Improved Javadoc.