Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaeger (and presumably similar) testcontainers test flaky #1871

Closed
anuraaga opened this issue Oct 23, 2020 · 10 comments · Fixed by #2426
Closed

Jaeger (and presumably similar) testcontainers test flaky #1871

anuraaga opened this issue Oct 23, 2020 · 10 comments · Fixed by #2426
Labels
Bug Something isn't working help wanted

Comments

@anuraaga
Copy link
Contributor

anuraaga commented Oct 23, 2020

Lately I see jaeger tests fail with a timeout relatively frequently - I've even noticed this on my machine, not just on CI. Docker Hub has been introducing pull limits from what I understand and perhaps it's affecting us. We probably want to cache .docker or whatever to reduce docker pulls because of the rate limiting, though it wouldn't solve the problem for a new contributor. Another option is to rehost on ghcr or bintray, but that seems not great either.

@anuraaga anuraaga added the Bug Something isn't working label Oct 23, 2020
@pavolloffay
Copy link
Member

Or migrate to quay.io which should not have pull limits.

@anuraaga
Copy link
Contributor Author

Yeah there are a few options for container registry - the problem (maybe not a big one) is we effectively rehost the official image, it'd be nice to avoid that if possible I think.

@anuraaga
Copy link
Contributor Author

I hit this on instrumentation repo locally, it's an image we host on bintray I believe. Maybe a bug in testcontainers? /cc @iNikem

io.opentelemetry.smoketest.SpringBootSmokeTest > spring boot smoke test on JDK 8 FAILED
    org.testcontainers.containers.ContainerLaunchException: Container startup failed
        at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:331)
        at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:312)
        at io.opentelemetry.smoketest.SmokeTest.startTarget(SmokeTest.groovy:95)
        at io.opentelemetry.smoketest.SpringBootSmokeTest.spring boot smoke test on JDK #jdk(SpringBootSmokeTest.groovy:26)

        Caused by:
        org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception
            at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88)
            at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:324)
            ... 3 more

            Caused by:
            org.testcontainers.containers.ContainerLaunchException: Could not create/start container
                at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:498)
                at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:326)
                at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
                ... 4 more

                Caused by:
                org.testcontainers.containers.ContainerLaunchException: Timed out waiting for container port to open (localhost ports: [33437] should be listening)
                    at org.testcontainers.containers.wait.strategy.HostPortWaitStrategy.waitUntilReady(HostPortWaitStrategy.java:49)
                    at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:35)
                    at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:893)
                    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:441)
                    ... 6 more

@iNikem
Copy link
Contributor

iNikem commented Oct 24, 2020

Without further digging it seems to me as the problem with container start, not image pull

@anuraaga
Copy link
Contributor Author

anuraaga commented Oct 24, 2020

Yeah you're right - realized it doesn't actually pull this image usually since it's already pulled (was why my tests were reliably failing before 😅 probably good to set the image pull policy always)

@anuraaga
Copy link
Contributor Author

Noticed that even for the ones I raised timeout from 1 to 2 min, it still fails pretty frequently, both on CI and my macbook. Wonder what's up

@jkwatson
Copy link
Contributor

Interesting. these never fail for me locally on my MBP. I wonder what's different.

@jkwatson
Copy link
Contributor

And now I've jinxed it and they're failing for me a ton as well. 🤕

@breedx-splk
Copy link
Contributor

I've been looking into this and thought that I would add some color for future us.

To reproduce, I have been doing:

$ cd exporters/jaeger
$ ../../gradlew cleanTest test --tests JaegerIntegrationTest

and if you run that a bunch you might end up seeing an error like the above. It's very inconsistent. To run in a loop until failure:

while [ "1" == "1" ] ; do 
    ../../gradlew cleanTest test --tests JaegerIntegrationTest; 
    if [ "$?" != "0" ] ; 
        then break; 
    fi; 
done

when there are failures, the container logs look like this:

{"level":"info","ts":1609786804.5741603,"caller":"app/server.go:123","msg":"Starting HTTP server","port":16686}
{"level":"info","ts":1609786804.5741765,"caller":"app/server.go:146","msg":"Starting CMUX server","port":16686}
{"level":"info","ts":1609786804.5741982,"caller":"app/server.go:136","msg":"Starting GRPC server","port":16686}
{"level":"warn","ts":1609786805.566884,"caller":"[email protected]/server.go:669","msg":"grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams received bogus greeting from client: \\\"GET / HTTP/1.1\\\\r\\\\nUser-Age\\\"\"","system":"grpc","grpc_log":true}
{"level":"warn","ts":1609786806.5768697,"caller":"[email protected]/server.go:669","msg":"grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams received bogus greeting from client: \\\"GET / HTTP/1.1\\\\r\\\\nUser-Age\\\"\"","system":"grpc","grpc_log":true}
{"level":"warn","ts":1609786807.586088,"caller":"[email protected]/server.go:669","msg":"grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams received bogus greeting from client: \\\"GET / HTTP/1.1\\\\r\\\\nUser-Age\\\"\"","system":"grpc","grpc_log":true}
[potentially many more]

I don't know what's going on with that, but it almost looks like the client is sending a broken/truncated User-Agent?
There is something similar here: jaegertracing/jaeger-kubernetes#124
where it looks like the host was being mangled?

Anyway, just sharing in case this triggers ideas in others.

@anuraaga
Copy link
Contributor Author

anuraaga commented Jan 5, 2021

@breedx-splk Thanks a lot for the detailed investigation! I have a hunch the problem is sending an HTTP/1 health check to the gRPC port rather than HTTP port. It's interesting that this sometimes works in a sporadic way, but let me try changing the port and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants