Jaeger (and presumably similar) testcontainers test flaky #1871

anuraaga · 2020-10-23T04:25:28Z

Lately I see jaeger tests fail with a timeout relatively frequently - I've even noticed this on my machine, not just on CI. Docker Hub has been introducing pull limits from what I understand and perhaps it's affecting us. We probably want to cache .docker or whatever to reduce docker pulls because of the rate limiting, though it wouldn't solve the problem for a new contributor. Another option is to rehost on ghcr or bintray, but that seems not great either.

The text was updated successfully, but these errors were encountered:

pavolloffay · 2020-10-23T07:00:23Z

Or migrate to quay.io which should not have pull limits.

anuraaga · 2020-10-23T09:59:37Z

Yeah there are a few options for container registry - the problem (maybe not a big one) is we effectively rehost the official image, it'd be nice to avoid that if possible I think.

anuraaga · 2020-10-23T13:55:40Z

I hit this on instrumentation repo locally, it's an image we host on bintray I believe. Maybe a bug in testcontainers? /cc @iNikem

io.opentelemetry.smoketest.SpringBootSmokeTest > spring boot smoke test on JDK 8 FAILED
    org.testcontainers.containers.ContainerLaunchException: Container startup failed
        at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:331)
        at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:312)
        at io.opentelemetry.smoketest.SmokeTest.startTarget(SmokeTest.groovy:95)
        at io.opentelemetry.smoketest.SpringBootSmokeTest.spring boot smoke test on JDK #jdk(SpringBootSmokeTest.groovy:26)

        Caused by:
        org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception
            at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88)
            at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:324)
            ... 3 more

            Caused by:
            org.testcontainers.containers.ContainerLaunchException: Could not create/start container
                at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:498)
                at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:326)
                at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
                ... 4 more

                Caused by:
                org.testcontainers.containers.ContainerLaunchException: Timed out waiting for container port to open (localhost ports: [33437] should be listening)
                    at org.testcontainers.containers.wait.strategy.HostPortWaitStrategy.waitUntilReady(HostPortWaitStrategy.java:49)
                    at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:35)
                    at org.testcontainers.containers.GenericContainer.waitUntilContainerStarted(GenericContainer.java:893)
                    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:441)
                    ... 6 more

iNikem · 2020-10-24T11:05:46Z

Without further digging it seems to me as the problem with container start, not image pull

anuraaga · 2020-10-24T12:49:41Z

Yeah you're right - realized it doesn't actually pull this image usually since it's already pulled (was why my tests were reliably failing before 😅 probably good to set the image pull policy always)

anuraaga · 2020-10-27T09:12:48Z

Noticed that even for the ones I raised timeout from 1 to 2 min, it still fails pretty frequently, both on CI and my macbook. Wonder what's up

jkwatson · 2020-10-27T16:10:25Z

Interesting. these never fail for me locally on my MBP. I wonder what's different.

jkwatson · 2020-10-27T21:42:52Z

And now I've jinxed it and they're failing for me a ton as well. 🤕

breedx-splk · 2021-01-04T19:43:02Z

I've been looking into this and thought that I would add some color for future us.

To reproduce, I have been doing:

$ cd exporters/jaeger
$ ../../gradlew cleanTest test --tests JaegerIntegrationTest

and if you run that a bunch you might end up seeing an error like the above. It's very inconsistent. To run in a loop until failure:

while [ "1" == "1" ] ; do 
    ../../gradlew cleanTest test --tests JaegerIntegrationTest; 
    if [ "$?" != "0" ] ; 
        then break; 
    fi; 
done

when there are failures, the container logs look like this:

{"level":"info","ts":1609786804.5741603,"caller":"app/server.go:123","msg":"Starting HTTP server","port":16686}
{"level":"info","ts":1609786804.5741765,"caller":"app/server.go:146","msg":"Starting CMUX server","port":16686}
{"level":"info","ts":1609786804.5741982,"caller":"app/server.go:136","msg":"Starting GRPC server","port":16686}
{"level":"warn","ts":1609786805.566884,"caller":"[email protected]/server.go:669","msg":"grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams received bogus greeting from client: \\\"GET / HTTP/1.1\\\\r\\\\nUser-Age\\\"\"","system":"grpc","grpc_log":true}
{"level":"warn","ts":1609786806.5768697,"caller":"[email protected]/server.go:669","msg":"grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams received bogus greeting from client: \\\"GET / HTTP/1.1\\\\r\\\\nUser-Age\\\"\"","system":"grpc","grpc_log":true}
{"level":"warn","ts":1609786807.586088,"caller":"[email protected]/server.go:669","msg":"grpc: Server.Serve failed to create ServerTransport: connection error: desc = \"transport: http2Server.HandleStreams received bogus greeting from client: \\\"GET / HTTP/1.1\\\\r\\\\nUser-Age\\\"\"","system":"grpc","grpc_log":true}
[potentially many more]

I don't know what's going on with that, but it almost looks like the client is sending a broken/truncated User-Agent?
There is something similar here: jaegertracing/jaeger-kubernetes#124
where it looks like the host was being mangled?

Anyway, just sharing in case this triggers ideas in others.

anuraaga · 2021-01-05T00:16:50Z

@breedx-splk Thanks a lot for the detailed investigation! I have a hunch the problem is sending an HTTP/1 health check to the gRPC port rather than HTTP port. It's interesting that this sometimes works in a sporadic way, but let me try changing the port and see what happens.

anuraaga added the Bug Something isn't working label Oct 23, 2020

jkwatson added the help wanted label Oct 27, 2020

anuraaga mentioned this issue Oct 28, 2020

testcontainers fail to launch "sporadically" in semaphoreci environment. testcontainers/testcontainers-java#3380

Closed

anuraaga mentioned this issue Jan 5, 2021

Use health port for jaeger container startup check. #2426

Merged

jkwatson closed this as completed in #2426 Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger (and presumably similar) testcontainers test flaky #1871

Jaeger (and presumably similar) testcontainers test flaky #1871

anuraaga commented Oct 23, 2020 •

edited

Loading

pavolloffay commented Oct 23, 2020

anuraaga commented Oct 23, 2020

anuraaga commented Oct 23, 2020

iNikem commented Oct 24, 2020

anuraaga commented Oct 24, 2020 •

edited

Loading

anuraaga commented Oct 27, 2020

jkwatson commented Oct 27, 2020

jkwatson commented Oct 27, 2020

breedx-splk commented Jan 4, 2021

anuraaga commented Jan 5, 2021

Jaeger (and presumably similar) testcontainers test flaky #1871

Jaeger (and presumably similar) testcontainers test flaky #1871

Comments

anuraaga commented Oct 23, 2020 • edited Loading

pavolloffay commented Oct 23, 2020

anuraaga commented Oct 23, 2020

anuraaga commented Oct 23, 2020

iNikem commented Oct 24, 2020

anuraaga commented Oct 24, 2020 • edited Loading

anuraaga commented Oct 27, 2020

jkwatson commented Oct 27, 2020

jkwatson commented Oct 27, 2020

breedx-splk commented Jan 4, 2021

anuraaga commented Jan 5, 2021

anuraaga commented Oct 23, 2020 •

edited

Loading

anuraaga commented Oct 24, 2020 •

edited

Loading