GitHub - cguzel/nutch-sitemapCrawler: Nutch Sitemap Crawler

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1,714 Commits
conf		conf
docker/hbase		docker/hbase
ivy		ivy
lib/native		lib/native
src		src
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
KEYS		KEYS
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.txt		README.txt
build.xml		build.xml
default.properties		default.properties
eclipse-codeformat.xml		eclipse-codeformat.xml

Repository files navigation

Apache Nutch for Sitemap Crawler README

For the information about Sitemap Crawler for Nutch, please visit our wiki:

   https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler

For the latest information about Nutch, please visit our website at:

   http://nutch.apache.org

and our wiki, at:

   http://wiki.apache.org/nutch/

To get started using Nutch read Tutorial:

   http://wiki.apache.org/nutch/Nutch2Tutorial
   
Export Control

This distribution includes cryptographic software.  The country in which you 
currently reside may have restrictions on the import, possession, use, and/or 
re-export to another country, of encryption software.  BEFORE using any encryption 
software, please check your country's laws, regulations and policies concerning the
import, possession, or use, and re-export of encryption software, to see if this is 
permitted.  See <http://www.wassenaar.org/> for more information. 

The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has 
classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which 
includes information security software using or performing cryptographic functions with 
asymmetric algorithms.  The form and manner of this Apache Software Foundation 
distribution makes it eligible for export under the License Exception ENC Technology 
Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, 
Section 740.13) for both object code and source code.

The following provides more details on the included cryptographic software:

Apache Nutch uses the PDFBox API in its parse-tika plugin for extracting textual content 
and metadata from encrypted PDF files. See http://pdfbox.apache.org for more 
details on PDFBox.