Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run on Kubernetes Cluster #42

Open
randomthought opened this issue Jan 15, 2019 · 2 comments
Open

Unable to run on Kubernetes Cluster #42

randomthought opened this issue Jan 15, 2019 · 2 comments

Comments

@randomthought
Copy link

randomthought commented Jan 15, 2019

Firstly thanks for the great work!

I am having difficulties trying to get simple_tensorflow_serving working on a Kubernetes cluster. Seems to be something with H20, logs are not descriptive enough for me to pinpoint it. It just keeps hanging on the connection refused error below.

01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO: H2O started in 2983ms
01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO:
01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO: Open H2O Flow in your web browser: http://10.0.0.41:54321
01-15 20:11:07.287 10.0.0.41:54321       180    main      INFO:
01-15 20:11:09.699 10.0.0.41:54321       180    FJ-126-3  INFO: Cloud of size 2 formed [/10.0.0.5:54321, /10.0.0.41:54321]
2019-01-15 20:11:14 INFO     Try to get function from file: ./models/h2o_prostate_model/preprocess_function.marshal
2019-01-15 20:11:14 INFO     Try to get function from file: ./models/h2o_prostate_model/postprocess_function.marshal
2019-01-15 20:11:14 INFO     Try to initialize and connect the h2o server
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Warning: Your H2O cluster version is too old (8 months and 27 days)! Please download and install the latest version from http://h2o.ai/download/
01-15 20:11:14.371 10.0.0.41:54321       180    #28758-13 INFO: POST /4/sessions, parms: {}
01-15 20:11:14.377 10.0.0.41:54321       180    #28758-13 INFO: Locking cloud to new members, because water.api.schemas4.SessionIdV4
01-15 20:11:14.414 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:14.717 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.020 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.322 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.625 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.928 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
@tobegit3hub
Copy link
Owner

Thanks for reporting.

Have you setup the H2O cluster to run with one H2O instance? It seems to be the problem of network but I'm not sure why it fails to connect with localhost service.

@DivyaMereddy007
Copy link

I also got the same issue. Error Log:11-19 14:40:18.737 10.237.73.201:54321 18656 #80:54323 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused. Below is the config I tried conf$spark.executor.instances <- 171
spark.yarn.executor.memoryOverhead<- 2048
conf$spark.executor.memory <- "18g"
conf$spark.executor.cores <- 5

spark.yarn.driver.memoryOverhead<- 39936
conf$spark.driver.memory<-"57.6g"
conf$spark.driver.cores<- 5

conf$'sparklyr.shell.executor-memory' <- "32g"
conf$'sparklyr.shell.driver-memory' <- "32g"
conf$spark.yarn.am.memory <- "32g"
conf$spark.dynamicAllocation.enabled <- "false"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants