Skip to content

Releases: Norconex/crawlers

norconex-collector-http-2.1.0

09 Apr 02:34
Compare
Choose a tag to compare

Release notes, binary downloads, and documentation can all be found on Norconex HTTP Collector website: http://www.norconex.com/collectors/collector-http/

norconex-collector-http-2.0.2

04 Feb 19:55
Compare
Choose a tag to compare

Bug fix release:

  • Fixed the collector "stop" action having no effect (github #49).
  • Fixed crawl data wrongfully applied as metadata after the import phase.
  • Fixed NullPointerException when sitemap support is disabled.
  • Fixed incorrect deletion behavior for embedded orphan documents.
  • Improved log4j.properties logging options for crawler events.
  • Upgraded Norconex Collector Core dependency to 1.0.2.

Binary download: http://www.norconex.com/collectors/collector-http/download

norconex-collector-http-2.0.1

03 Dec 22:55
Compare
Choose a tag to compare
  • From collector-core-1.0.1: When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions.
  • Fixed errors in example configuration files.

Binary download: http://www.norconex.com/collectors/collector-http/download

norconex-collector-http-2.0.0

27 Nov 20:00
Compare
Choose a tag to compare
  • Upgraded Norconex Importer to version 2.0.0, which brings to Norconex HTTP Collector a lot of new features, such as:
    • Document content splitting
    • Splitting of embedded documents into individual documents
    • New taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more.
    • Read the Norconex Importer release notes for a complete list of changes at: http://www.norconex.com/product/importer/changes-report.html#a2.0.0
  • Can now supply a "urlsFile" as part of the startURLs, acting as a seed list.
  • New fast MVStore database implementation for URL database (from Norconex Collector Core).
  • New H2 database implementation for URL database (crawl data store).
  • Now keeps track of parent references (for embedded/split documents).
  • Now support also extracting a link title and text (github #23 ), and they also support the "nofollow" robot rule.
  • It is now possible to configure multiple link extraction classes, each taking effect on particular URLs and/or content-types.
  • IHtmlLinkExtractor can be configured to use specified HTML tags and attributes to find URLs.
  • Now licensed under The Apache License, Version 2.0.
  • Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
  • The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
  • Refactored to use the new Norconex Collector Core library.
  • New and more scalable crawler event model along with new listeners.
  • More...

Binary download: http://www.norconex.com/collectors/collector-http/download

norconex-collector-http-1.3.4

25 Aug 00:19
Compare
Choose a tag to compare
  • MongoCrawlURLDatabase now supports user authentication.
  • Now requires Java 7 or higher.
  • Fixed DefaultRobotsTxtProvider failing to parse some robots.txt patterns.

Binary download: http://www.norconex.com/product/collector-http/download.html

norconex-collector-http-1.3.3

08 Aug 03:30
Compare
Choose a tag to compare
  • Upgraded JEF to 3.0.1 to fix stop action not working.
  • Fixed NullPointerException in robots.txt resolution under some circumstances.

Binary download: http://www.norconex.com/product/collector-http/download.html

norconex-collector-http-1.3.2

17 Jun 15:21
Compare
Choose a tag to compare
  • DefaultURLExtractor no longer treat empty href as being a URL ending with a double-quote.
  • GenericURLNormallizer no longer rejects URLs with spaces in them.It now logs a warning instead.

Binary download: http://www.norconex.com/product/collector-http/download.html

norconex-collector-http-1.3.1

10 Apr 15:35
Compare
Choose a tag to compare
  • Header and document checksum value are no longer added by default to prevent the issue described in github issue #24. Instead, adding checksum is now an optional feature of DefaultHttpDocumentChecksummer and DefaultHttpHeadersChecksummer.

Binary download: http://www.norconex.com/product/collector-http/download.html

norconex-collector-http-1.3.0

24 Mar 18:33
Compare
Choose a tag to compare
  • Now supports NTLM authentication. Experimental support added for SPNEGO and Kerberos.
  • Document checksums are added to each document metadata.
  • Refactoring of HTTPClient creation with many new configuration options added (connection timeout, charset, maximum redirects, and several more).
  • Can optionally trust all SSL certificate now.
  • Integrates new features of Norconex Importer 1.2.0 such as support for WordPerfect document parsing, new filter and transformers, etc.
  • Integrates new features of Norconex Committer 1.2.0 such as defining multiple committers, retrying upon commit failure, etc.
  • Other third-party library upgrades.

Binary download: http://www.norconex.com/product/collector-http/download.html

norconex-collector-http-1.2.0

11 Jan 08:17
Compare
Choose a tag to compare

Feature release:

  • New optional Mongo URL Database implementation.
  • New TikaURLExtractor class providing an alternate IURLExtractor implementation based on Apache Tika HTMLParser.
  • New SegmentCountURLFilter class for filtering URLs having a specified number of segments (can check duplicate segments too).
  • New unit tests.
  • MapDB URL Database classes moved to its own "mapdb" package. DefaultCrawlURLDatabaseFactory still exists, but is just a pointer to MapDBCrawlURLDatabaseFactory.
  • Example configurations now point to Norconex test pages to ensure their stability.
  • Upgraded dependent libraries: Norconex Committer 1.1.0, Norconex Commons Lang 1.2.0, MapDB 0.9.8 and other thrid party libraries.
  • Improved Javadoc.