Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Spark-Solr Connection via spark executor #363

Open
Rems143 opened this issue Nov 1, 2024 · 0 comments
Open

Issue: Spark-Solr Connection via spark executor #363

Rems143 opened this issue Nov 1, 2024 · 0 comments

Comments

@Rems143
Copy link

Rems143 commented Nov 1, 2024

We're attempting to execute a basic Spark job to read/write data from Solr, using the following environment:

CDP version: 7.1.9
Spark: Spark3
Solr: 8.11
Spark-Solr Connector: opt/cloudera/parcels/SPARK3/lib/spark3/spark-solr/spark-solr-3.9.3000.3.3.7191000.0-78-shaded.jar

When we try to interact with Solr through Spark, the execution process hangs indefinitely, without any errors or results. Other components, such as Hive and HBase, integrate smoothly with Spark, and we’re using a valid Kerberos ticket that successfully authenticates with other Hadoop components. Additionally, we’ve tested REST API calls to Solr via both curl and Python’s requests library, and we’re able to retrieve data with the Kerberos ticket.

The problem appears isolated to Spark’s connection with Solr, as all other systems interact as expected. Has anyone experienced a similar issue or have ideas on what might be causing this?

solr_options = {
    "zkhost": "zkURL-01.orgis.ie:2181,zkURL-02.orgis.ie:2181,zkURL.orgis.ie:2181/solr",
    "collection": "collection_phoectic_test2"
}

# Read data from Solr
df = spark.read.format("solr").options(**solr_options).load()
df.show()

Interestingly, if I specify a non-existent Solr collection, I get an error stating that the collection doesn’t exist. This leads me to believe that Zookeeper is managing the initial connection, as it has the metadata for the Solr collections. However, it seems the Spark executor might be connecting to Zookeeper but failing to establish a connection between Spark executor nodes and Solr nodes.

spark error logs:

DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "POST /solr/collection_phonectic_test2_shard1_replica_n1/select HTTP/1.1[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "Content-Type: application/x-www-form-urlencoded; charset=UTF-8[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "Content-Length: 652[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "Host: worker-02.xx:8985[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "Connection: Keep-Alive[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 >> "q=*%3A*&rows=5000&qt=%2Fselect&fq=_version_%3A%5B*+TO+1812630352655548416%5D&fq=%7B%21hash+workers%3D2+worker%3D0%7D&collection=collection_phonectic_test2&fl=address%2Cmade%2Ccategory%2Ccompanyname%2Cuserfeedback&distrib=false&start=0&sort=id+asc&partitionKeys=_version_&cursorMark=*&wt=javabin&version=2"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "HTTP/1.1 401 Authentication required[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 'self'; form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 'self'; media-src 'self'; style-src 'self' 'unsafe-inline'; script-src 'self'; worker-src 'self';[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "X-Content-Type-Options: nosniff[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "X-Frame-Options: SAMEORIGIN[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "X-XSS-Protection: 1; mode=block[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "Strict-Transport-Security: max-age=31536000; includeSubDomains[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "WWW-Authenticate: Negotiate[\r][\n]"
 DEBUG http.wire: [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)]: http-outgoing-0 << "Set-Cookie: hadoop.auth=; Secure; HttpOnly[\r][\n]"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant