Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(deps): update dependency org.apache.nutch:nutch to v2 #62

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anaconda-renovate[bot]
Copy link

@anaconda-renovate anaconda-renovate bot commented Sep 26, 2023

This PR contains the following updates:

Package Type Update Change
org.apache.nutch:nutch (source) compile major 1.18 -> 2.4

Release Notes

apache/nutch (org.apache.nutch:nutch)

v2.4

Compare Source

v2.3.1

Compare Source

v2.3

Compare Source

v2.2.1

Compare Source

v2.2

Compare Source

v2.0

Compare Source

v1.20

Compare Source

Release Report: https://s.apache.org/ovjf3

Sub-task
Bug
  • NUTCH-2634 - Some links marked as "nofollow" are followed anyway.
  • NUTCH-2820 - Review sample files used in any23 unit tests
  • NUTCH-2924 - Generate maxCount expr evaluated only once
  • NUTCH-2937 - parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
  • NUTCH-2973 - Single domain names (eg https://localnet) can't be crawled - filtering fails
  • NUTCH-2974 - Ant build fails with "Unparseable date" on certain platforms
  • NUTCH-2979 - Upgrade Commons Text to 1.10.0
  • NUTCH-2982 - Generator: parameter for URL normalization not passed forward
  • NUTCH-2985 - Disable plugin urlfilter-validator by default
  • NUTCH-2992 - Fetcher: always block fetch queues when exceptions threshold is reached
  • NUTCH-3000 - protocol-selenium returns only the body, strips off the <head/> element
  • NUTCH-3001 - protocol-selenium requires Content-Type header
  • NUTCH-3002 - Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive
  • NUTCH-3008 - indexer-elastic: downgrade to ES 7.10.2 to address licensing issues
  • NUTCH-3012 - SegmentReader when dumping with option -recode: NPE on unparsed documents
  • NUTCH-3027 - Trivial resource leak patch in DomainSuffixes.java
  • NUTCH-3035 - Update license and notice file for release of 1.20
New Feature
  • NUTCH-2832 - Create tutorial on sending Nutch logs to Elasticsearch
  • NUTCH-2888 - Selenium Protocol: Support for Selenium 4
  • NUTCH-2920 - Implement a indexer-opensearch plugin
  • NUTCH-2991 - Support HTTP/S Header Authorization for Solr connections
  • NUTCH-3029 - Host specific max. and min. intervals in adaptive scheduler
Improvement
  • NUTCH-2853 - bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean
  • NUTCH-2883 - Provide means to run server as a persistent service in Docker container
  • NUTCH-2897 - Do not supress deprecated API warnings
  • NUTCH-2961 - Upgrade dependencies of parsefilter-naivebayes
  • NUTCH-2980 - Upgrade Selenium Java to 4.7.2
  • NUTCH-2983 - nutch-default.xml improvements
  • NUTCH-2990 - HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
  • NUTCH-2993 - ScoringDepth plugin to skip depth check based on URL Pattern
  • NUTCH-2995 - Upgrade to crawler-commons 1.4
  • NUTCH-2996 - Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
  • NUTCH-2997 - Add Override annotations where applicable
  • NUTCH-3004 - Avoid NPE in HttpResponse
  • NUTCH-3005 - Upgrade selenium as needed
  • NUTCH-3009 - Upgrade to Hadoop 3.3.6
  • NUTCH-3010 - Injector: count unique number of injected URLs
  • NUTCH-3011 - HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)
  • NUTCH-3013 - Employ commons-lang3's StopWatch to simplify timing logic
  • NUTCH-3014 - Standardize Job names
  • NUTCH-3015 - Add more CI steps to GitHub master-build.yml
  • NUTCH-3017 - Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
  • NUTCH-3025 - urlfilter-fast to filter based on the length of the URL
  • NUTCH-3031 - ProtocolFactory host mapper to support domains
  • NUTCH-3032 - Indexing plugin as an adapter for end user's own POJO instances
  • NUTCH-3036 - Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
Task
  • NUTCH-2959 - Upgrade to Apache Tika 2.9.0
  • NUTCH-2977 - Support for showing dependency tree
  • NUTCH-2978 - Move to slf4j2 and remove log4j1 and reload4j
  • NUTCH-2984 - Drop test proxy server and benchmark tool
  • NUTCH-2989 - Can't have username/pw AND https on elastic-indexer?!
  • NUTCH-2998 - Remove the Any23 plugin
  • NUTCH-2999 - Update Lucene version to latest 8.x
  • NUTCH-3016 - Upgrade Apache Ivy to 2.5.2
  • NUTCH-3019 - Upgrade to Apache Tika 2.9.1
  • NUTCH-3020 - ParseSegment should check for protocol's flags for truncation
  • NUTCH-3024 - Remove flaky 'dependency check' target
  • NUTCH-3033 - Upgrade Ivy to v2.5.2
  • NUTCH-3037 - Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
  • NUTCH-3038 - Address issues discovered during 1.20 release management dryrun

v1.19

Compare Source

Release Report: https://s.apache.org/lf6li

Breaking Changes
Sub-task
  • NUTCH-2819 - Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime
  • NUTCH-2846 - Fix various bugs spotted by NUTCH-2815
  • NUTCH-2850 - Method ignores exceptional return value
  • NUTCH-2851 - Random object created and used only once
  • NUTCH-2855 - Update org.elasticsearch.client
Bug
  • NUTCH-2290 - Update licenses of bundled libraries
  • NUTCH-2512 - Nutch does not build under JDK9
  • NUTCH-2821 - Deduplicate licenses in LICENSE.txt file
  • NUTCH-2822 - Split the LICENSE.txt file into two files for source resp. binary releases
  • NUTCH-2831 - Elastic indexer does not support SSL
  • NUTCH-2843 - Duplicate declaration of dependencies in ivy.xml
  • NUTCH-2858 - urlnormalizer-protocol: URL port is lost during normalization
  • NUTCH-2862 - Do not include Ivy jar in source release package
  • NUTCH-2863 - Injector to parse command-line flags case-insensitive
  • NUTCH-2866 - MetaData.toString() should return "key=value ..."
  • NUTCH-2868 - urlnormalizer-protocol fails with StringIndexOutOfBoundsException when reading invalid line in configuration file
  • NUTCH-2881 - bug in 'nutch' symlink in docker container
  • NUTCH-2889 - nutch indexer-elasticsearch plugin, doesn't work with https protocol
  • NUTCH-2890 - Protocol-okhttp: upgrade okhttp to 4.9.1 to address infinite connection retries
  • NUTCH-2894 - Java plugin compilation classpath: priorize plugin dependencies
  • NUTCH-2899 - Remove needless warning about missing o/a/rat/anttasks/antlib.xml
  • NUTCH-2902 - Jexl parsing error on statements
  • NUTCH-2905 - Mask sensitive strings in log output of index writers
  • NUTCH-2910 - FetchItemQueues overloaded constructor also interprets fetcher timeout as -1 e.g. no-timeout.
  • NUTCH-2915 - Upgrade to log4j 2.15.0
  • NUTCH-2916 - Fix log file rotation / rename default log file
  • NUTCH-2917 - Remove transitive dependency to log4j 1.x
  • NUTCH-2922 - Upgrade to log4j 2.17.0
  • NUTCH-2935 - DeduplicationJob: failure on URLs with invalid percent encoding
  • NUTCH-2936 - Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used
  • NUTCH-2945 - Solr Index Writer pluging schema.xml missing a copyToField
  • NUTCH-2947 - Fetcher: keep state of empty fetch queues unless queue feeder is finished
  • NUTCH-2949 - Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers
  • NUTCH-2951 - Crawl datum with metadata WRITABLE_GENERATE_TIME_KEY awaits fetching forever
  • NUTCH-2955 - indexer-solr: replace deprecated/removed field type solr.LatLonType
  • NUTCH-2969 - Javadoc: Javascript search is not working when built on JDK 11
New Feature
Improvement
  • NUTCH-1403 - Add default ScoringFilter for manipulating metadata
  • NUTCH-2429 - Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
  • NUTCH-2449 - Usage of Tika LanguageIdentifier in language-identifier plugin
  • NUTCH-2573 - Suspend crawling if robots.txt fails to fetch with 5xx status
  • NUTCH-2795 - CrawlDbReader: compress CrawlDb dumps if configured
  • NUTCH-2807 - SitemapProcessor to warn that ignoring robots.txt affects detection of sitemaps
  • NUTCH-2808 - Document side effects of ignoring robots.txt
  • NUTCH-2840 - Fix 'report-vulnerabilities' ant target in build.xml
  • NUTCH-2842 - Fix Javadoc warnings, errors and add Javadoc check to Github Action and Jenkins
  • NUTCH-2845 - Update urlfilter-suffix rules
  • NUTCH-2847 - HttpDateFormat: Simplify based on new Java 8 DateTime API
  • NUTCH-2849 - Replace remaining package.html files with package-info.java
  • NUTCH-2857 - Upgrade from JDK1.8 --> JDK11
  • NUTCH-2859 - urlnormalizer-protocol: allow to normalize domains
  • NUTCH-2861 - Remove parse-swf
  • NUTCH-2864 - Upgrade Dockerfile to use JDK 11
  • NUTCH-2865 - WARC exporter support for metadata and dropping empty responses
  • NUTCH-2867 - Support for custom HostDb aggregators
  • NUTCH-2869 - Add @​Override annotations to Nutch plugins
  • NUTCH-2879 - fireant upgrade dependency hadoop-hdfs in ivy/ivy.xml from 3.1.3 to 3.3.1
  • NUTCH-2882 - Configure NutchUiServer for DEPLOYMENT and improve logging
  • NUTCH-2885 - Upgrade to Log4j2
  • NUTCH-2886 - Move Nutch WebApp to separate repository
  • NUTCH-2891 - Upgrade to Tika 2.1
  • NUTCH-2892 - Upgrade to Any23 2.5
  • NUTCH-2893 - fireant upgrade dependency elasticsearch-rest-high-level-client in src/plugin/indexer-elastic/ivy.xml from 7.11.1 to 7.13.2
  • NUTCH-2896 - Protocol-okhttp: make connection pool configurable
  • NUTCH-2898 - IDE Setup for nutch with Intellij IDEA is not well documented
  • NUTCH-2903 - Unable to Connect to Elasticsearch over HTTPS
  • NUTCH-2904 - Upgrade to crawler-commons 1.2
  • NUTCH-2908 - Log mapreduce job messages and counters in local mode
  • NUTCH-2911 - Add cleanup call in Fetcher.java
  • NUTCH-2914 - nutch-default.xml: remove obsolete and unused properties
  • NUTCH-2918 - Upgrade to log4j 2.16.0
  • NUTCH-2919 - Upgrade to Tika 2.2.1 and Any23 2.6
  • NUTCH-2923 - Add Job Id in Job Failure messages
  • NUTCH-2929 - Fetcher: start threads slowly to avoid that resources are temporarily exhausted
  • NUTCH-2930 - Protocol-okhttp: implement IP filter
  • NUTCH-2946 - Fetcher: optionally slow down fetching from hosts with repeated exceptions
  • NUTCH-2948 - Upgrade dependencies to Any23 2.7 and Tika 2.3.0
  • NUTCH-2950 - UpdateHostDb: performance improvements
  • NUTCH-2952 - Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
  • NUTCH-2953 - Indexer Elastic to ignore SSL issues
  • NUTCH-2956 - index-geoip: dependency upgrades and improvements
  • NUTCH-2957 - indexer-solr / Solr schema: add fall-back field definitions for unknown index fields
  • NUTCH-2958 - Upgrade to crawler-commons 1.3
  • NUTCH-2962 - Update and complete package info of protocol plugins
  • NUTCH-2963 - Upgrade dependencies before release of 1.19
Task
  • NUTCH-2826 - Migrate Nutch Site from Apache CMS to Hugo
  • NUTCH-2870 - fireant upgrade dependency junit in ivy/ivy.xml from 4.13.1 to 4.13.2

Configuration

📅 Schedule: Branch creation - "every weekday" in timezone UTC, Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

Copy link
Author

Edited/Blocked Notification

Renovate will not automatically rebase this PR, because it does not recognize the last commit author and assumes somebody else may have edited the PR.

You can manually request rebase by checking the rebase/retry box above.

⚠️ Warning: custom changes will be lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants