Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-3108 Fix SLF4J Class Loader Conflict in language-identifier #849

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

maciejpuzianowski
Copy link
Contributor

When running Apache Nutch 1.20 on a distributed Hadoop cluster with the language-identifier plugin enabled, a class loader conflict occurs during the parse process. This results in the following error:

2025-02-24 08:58:59,152 INFO mapreduce.Job: Task Id : attempt_1740061418437_0135_m_000001_0, Status : FAILED
Error: loader constraint violation: when resolving method 'org.slf4j.ILoggerFactory org.slf4j.impl.StaticLoggerBinder.getLoggerFactory()' the class loader org.apache.nutch.plugin.PluginClassLoader @4c5228e7 of the current class, org/slf4j/LoggerFactory, and the class loader 'app' for the method's defining class, org/slf4j/impl/StaticLoggerBinder, have different Class objects for the type org/slf4j/ILoggerFactory used in the signature (org.slf4j.LoggerFactory is in unnamed module of loader org.apache.nutch.plugin.PluginClassLoader @4c5228e7, parent loader 'app'; org.slf4j.impl.StaticLoggerBinder is in unnamed module of loader 'app')

I have managed to resolve this issue by modifying following files:
ivy.xml ->

<dependency org="org.apache.tika" name="tika-langdetect-optimaize" rev="2.9.0" conf="*->default">
      <!-- exclusions of dependencies provided in Nutch core (ivy/ivy.xml) -->
      <exclude org="org.apache.tika" name="tika-core" />
      <exclude org="com.google.guava" name="guava" />
      <exclude org="org.slf4j" name="slf4j-api" />
      <!-- exclusions of dependencies provided in Nutch core (ivy/ivy.xml) -->
    </dependency>

and plugin.xml ->

<library name="annotations-12.0.jar"/>
      <library name="checker-qual-3.33.0.jar"/>
      <library name="error_prone_annotations-2.18.0.jar"/>
      <library name="failureaccess-1.0.1.jar"/>
      <library name="j2objc-annotations-2.8.jar"/>
      <library name="jsonic-1.2.11.jar"/>
      <library name="jsr305-3.0.2.jar"/>
      <library name="language-detector-0.6.jar"/>
      <library name="listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar"/>
      <library name="tika-langdetect-optimaize-2.9.0.jar"/>

@lewismc lewismc requested a review from tballison February 25, 2025 23:41
@lewismc
Copy link
Member

lewismc commented Feb 25, 2025

@maciejpuzianowski thanks for the PR

@tballison any comments on the tika dependency chain in Nutch. Should we be looking to upgrade tika to 3.1.0? Would that address the slf4j issue.
Can you please link to the PR which introduced your shaded jar in place of the official tika jar(s)?
Also, from memory, I believe that an upgrade to the core Hadoop libraries may also impact this dependency tree.

@sebastian-nagel
Copy link
Contributor

upgrade tika to 3.1.0

We currently use a shaded Tika package (2.9.1.0, thanks @tballison!) because of a conflict with the commons-io version required by Tika (or POI) and provided by Hadoop, see NUTCH-2959. Upgrading will force everybody to use at least Hadoop 3.4.0 in distributed mode.

@maciejpuzianowski, could you provide the Hadoop version of your cluster? This may help to reproduce the issue and test alternative solutions, such as an upgrade to a more recent version of Tika. Thanks!

@maciejpuzianowski
Copy link
Contributor Author

Sure @sebastian-nagel,
I am using Hadoop 3.4.1.

@tballison
Copy link
Contributor

Thank you @sebastian-nagel for beating me to it. Y, we had to shade commons-io because hadoop was using an old version, and Tika and POI were using some of the newer API calls.

I just released this shim for 2.9.3 and 3.1.0. Maybe give those a try?

I did notice this: https://github.com/tballison/hadoop-safe-tika/blob/main/tika-parsers-standard-package-shaded/pom.xml#L67

Which may be causing the problems. If we have to downgrade slf4j to match hadoop, we can do that...or maybe we shade logging too?

I'm happy to make those updates and release a 2.9.3.1 and/or 3.1.0.1. :D

Let me know what makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants