Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Norconex / crawlers Public

Notifications You must be signed in to change notification settings
Fork 67
Star 186

Code
Issues 31
Pull requests
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: Norconex/crawlers

Releases · Norconex/crawlers

norconex-collector-http-2.1.0

09 Apr 02:34

essiembre

norconex-collector-http-2.1.0

Compare

Choose a tag to compare

Loading

norconex-collector-http-2.1.0

Release notes, binary downloads, and documentation can all be found on Norconex HTTP Collector website: http://www.norconex.com/collectors/collector-http/

Assets 2

Loading

All reactions

norconex-collector-http-2.0.2

04 Feb 19:55

essiembre

norconex-collector-http-2.0.2

Compare

Choose a tag to compare

Loading

norconex-collector-http-2.0.2

Bug fix release:

Fixed the collector "stop" action having no effect (github #49).
Fixed crawl data wrongfully applied as metadata after the import phase.
Fixed NullPointerException when sitemap support is disabled.
Fixed incorrect deletion behavior for embedded orphan documents.
Improved log4j.properties logging options for crawler events.
Upgraded Norconex Collector Core dependency to 1.0.2.

Binary download: http://www.norconex.com/collectors/collector-http/download

Assets 2

Loading

All reactions

norconex-collector-http-2.0.1

03 Dec 22:55

essiembre

norconex-collector-http-2.0.1

Compare

Choose a tag to compare

Loading

norconex-collector-http-2.0.1

From collector-core-1.0.1: When keepDownloads is true, saved files and directories are now prefixed with "f." and "d." respectively to avoid collisions.
Fixed errors in example configuration files.

Binary download: http://www.norconex.com/collectors/collector-http/download

Assets 2

Loading

All reactions

norconex-collector-http-2.0.0

27 Nov 20:00

essiembre

norconex-collector-http-2.0.0

Compare

Choose a tag to compare

Loading

norconex-collector-http-2.0.0

Upgraded Norconex Importer to version 2.0.0, which brings to Norconex HTTP Collector a lot of new features, such as:
- Document content splitting
- Splitting of embedded documents into individual documents
- New taggers for language detection, changing character case, parsing and formatting dates, providing content statistics, and more.
- Read the Norconex Importer release notes for a complete list of changes at: http://www.norconex.com/product/importer/changes-report.html#a2.0.0
Can now supply a "urlsFile" as part of the startURLs, acting as a seed list.
New fast MVStore database implementation for URL database (from Norconex Collector Core).
New H2 database implementation for URL database (crawl data store).
Now keeps track of parent references (for embedded/split documents).
Now support also extracting a link title and text (github #23 ), and they also support the "nofollow" robot rule.
It is now possible to configure multiple link extraction classes, each taking effect on particular URLs and/or content-types.
IHtmlLinkExtractor can be configured to use specified HTML tags and attributes to find URLs.
Now licensed under The Apache License, Version 2.0.
Replaced the configuration option "deleteOrphans(true|false)" with "orphansStrategy(DELETE|PROCESS|IGNORE)".
The collector now references document content as reusable InputStream with memory caching instead of relying only on files. This saves a great deal of disk I/O and improves performance in most cases.
Refactored to use the new Norconex Collector Core library.
New and more scalable crawler event model along with new listeners.
More...

Binary download: http://www.norconex.com/collectors/collector-http/download

Assets 2

Loading

All reactions

norconex-collector-http-1.3.4

25 Aug 00:19

essiembre

norconex-collector-http-1.3.4

Compare

Choose a tag to compare

Loading

norconex-collector-http-1.3.4

MongoCrawlURLDatabase now supports user authentication.
Now requires Java 7 or higher.
Fixed DefaultRobotsTxtProvider failing to parse some robots.txt patterns.

Binary download: http://www.norconex.com/product/collector-http/download.html

Assets 2

Loading

All reactions

norconex-collector-http-1.3.3

08 Aug 03:30

essiembre

norconex-collector-http-1.3.3

Compare

Choose a tag to compare

Loading

norconex-collector-http-1.3.3

Upgraded JEF to 3.0.1 to fix stop action not working.
Fixed NullPointerException in robots.txt resolution under some circumstances.

Binary download: http://www.norconex.com/product/collector-http/download.html

Assets 2

Loading

All reactions

norconex-collector-http-1.3.2

17 Jun 15:21

essiembre

norconex-collector-http-1.3.2

Compare

Choose a tag to compare

Loading

norconex-collector-http-1.3.2

DefaultURLExtractor no longer treat empty href as being a URL ending with a double-quote.
GenericURLNormallizer no longer rejects URLs with spaces in them.It now logs a warning instead.

Binary download: http://www.norconex.com/product/collector-http/download.html

Assets 2

Loading

All reactions

norconex-collector-http-1.3.1

10 Apr 15:35

essiembre

norconex-collector-http-1.3.1

Compare

Choose a tag to compare

Loading

norconex-collector-http-1.3.1

Header and document checksum value are no longer added by default to prevent the issue described in github issue #24. Instead, adding checksum is now an optional feature of DefaultHttpDocumentChecksummer and DefaultHttpHeadersChecksummer.

Binary download: http://www.norconex.com/product/collector-http/download.html

Assets 2

Loading

All reactions

norconex-collector-http-1.3.0

24 Mar 18:33

essiembre

norconex-collector-http-1.3.0

Compare

Choose a tag to compare

Loading

norconex-collector-http-1.3.0

Now supports NTLM authentication. Experimental support added for SPNEGO and Kerberos.
Document checksums are added to each document metadata.
Refactoring of HTTPClient creation with many new configuration options added (connection timeout, charset, maximum redirects, and several more).
Can optionally trust all SSL certificate now.
Integrates new features of Norconex Importer 1.2.0 such as support for WordPerfect document parsing, new filter and transformers, etc.
Integrates new features of Norconex Committer 1.2.0 such as defining multiple committers, retrying upon commit failure, etc.
Other third-party library upgrades.

Binary download: http://www.norconex.com/product/collector-http/download.html

Assets 2

Loading

All reactions

norconex-collector-http-1.2.0

11 Jan 08:17

essiembre

norconex-collector-http-1.2.0

Compare

Choose a tag to compare

Loading

norconex-collector-http-1.2.0

Feature release:

New optional Mongo URL Database implementation.
New TikaURLExtractor class providing an alternate IURLExtractor implementation based on Apache Tika HTMLParser.
New SegmentCountURLFilter class for filtering URLs having a specified number of segments (can check duplicate segments too).
New unit tests.
MapDB URL Database classes moved to its own "mapdb" package. DefaultCrawlURLDatabaseFactory still exists, but is just a pointer to MapDBCrawlURLDatabaseFactory.
Example configurations now point to Norconex test pages to ensure their stability.
Upgraded dependent libraries: Norconex Committer 1.1.0, Norconex Commons Lang 1.2.0, MapDB 0.9.8 and other thrid party libraries.
Improved Javadoc.

Assets 2

Loading

All reactions

Previous 1 2 3 4 Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.