Skip to content

Commit

Permalink
The OrphansStrategy default in crawler config is now PROCESS.
Browse files Browse the repository at this point in the history
  • Loading branch information
essiembre committed Jul 17, 2015
1 parent e0fac92 commit ab6a92a
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 7 deletions.
6 changes: 6 additions & 0 deletions norconex-collector-core/src/changes/changes.xml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,12 @@
given once chance to recover before a deletion request gets sent.
This can be overwritten.
</action>
<action dev="essiembre" type="update">
The OrphansStrategy default in crawler config is now PROCESS
to get around cases where temporary conditions prevent accessing
some documents that normally should (and should not avoid re-processing
on incremental crawls).
</action>
<action dev="essiembre" type="update">
MD5DocumentChecksummer#setField(String) has been deprecated in favor
of MD5DocumentChecksummer#setFields(String...).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/* Copyright 2014 Norconex Inc.
/* Copyright 2014-2015 Norconex Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -67,7 +67,7 @@ public abstract class AbstractCrawlerConfig implements ICrawlerConfig {
private int numThreads = 2;
private File workDir = new File("./work");
private int maxDocuments = -1;
private OrphansStrategy orphansStrategy = OrphansStrategy.IGNORE;
private OrphansStrategy orphansStrategy = OrphansStrategy.PROCESS;

private ICrawlDataStoreFactory crawlDataStoreFactory =
new MapDBCrawlDataStoreFactory();
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/* Copyright 2014 Norconex Inc.
/* Copyright 2014-2015 Norconex Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -80,14 +80,28 @@ enum OrphansStrategy {
int getMaxDocuments();

/**
* Gets the strategy to adopt when there are orphans. Orphans are
* <p>Gets the strategy to adopt when there are orphans. Orphans are
* references that were processed in a previous run, but were not in the
* current run. In other words, they are leftovers from a previous run
* that were not re-encountered in the current.
* <br><br>
* </p><p>
* Unless explicitly stated otherwise by an implementing class, the default
* strategy is to DELETE orphans. Setting a <code>null</code> value is
* the same as setting IGNORE.
* strategy is to <code>PROCESS</code> orphans.
* Setting a <code>null</code> value is the same as setting
* <code>IGNORE</code>.
* </p><p>
* Since 1.2.0, unless otherwise stated in implementing classes,
* the default orphan strategy is now <code>PROCESS</code>.
* </p><p>
* <b>Be careful:</b> Setting the orphan strategy to <code>DELETE</code>
* is NOT recommended in most cases. With some collectors, a temporary
* failure such as a network outage or a web page timing out, may cause
* some documents not to be crawled. When this happens, unreachable
* documents would be considered "orphans" and be deleted while under
* normal circumstances, they should be kept. Re-processing them
* (default), is usually the safest approach to confirm they still
* exist before deleting or updating them.
* </p>
* @return orphans strategy
*/
OrphansStrategy getOrphansStrategy();
Expand Down

0 comments on commit ab6a92a

Please sign in to comment.