Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run megalinter container image on OpenShift since v7.4.0 #3176

Closed
very-doge-wow opened this issue Nov 30, 2023 · 14 comments
Closed

Can't run megalinter container image on OpenShift since v7.4.0 #3176

very-doge-wow opened this issue Nov 30, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@very-doge-wow
Copy link

We're using GitLab and GitLab CI pipelines in our organisation. Here we defined a linter job, which uses the megalinter image provided by this repository. Jobs are then executed by GitLab runners of type kubernetes executor which then spawn a pod which subsequently executes the aformentioned job.

Since v7.4.0 however, we can't run those jobs anymore, since the pod which executes the job can't be created successfully anymore. It exits with the following error message:

Log Output
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x3c0d0 pc=0x3c0d0]
runtime stack:
runtime.throw({0x5586b5e7d789?, 0x7fffd3bf2f80?})
	/usr/lib/golang/src/runtime/panic.go:992 +0x71 fp=0x7fffd3bf27c8 sp=0x7fffd3bf2798 pc=0x5586b5a9c291
runtime.sigpanic()
	/usr/lib/golang/src/runtime/signal_unix.go:802 +0x389 fp=0x7fffd3bf2818 sp=0x7fffd3bf27c8 pc=0x5586b5ab1be9
goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x5586b5e5ee30, 0xc0000dd5c8)
	/usr/lib/golang/src/runtime/cgocall.go:157 +0x5c fp=0xc0000dd5a0 sp=0xc0000dd568 pc=0x5586b5a6bc5c
crypto/internal/boring._Cfunc__goboringcrypto_DLOPEN_OPENSSL()
	_cgo_gotypes.go:304 +0x4d fp=0xc0000dd5c8 sp=0xc0000dd5a0 pc=0x5586b5ca270d
crypto/internal/boring.init.0()
	/usr/lib/golang/src/crypto/internal/boring/boring.go:52 +0x45 fp=0xc0000dd600 sp=0xc0000dd5c8 pc=0x5586b5ca3825
runtime.doInit(0x5586b64c3c00)
	/usr/lib/golang/src/runtime/proc.go:6230 +0x128 fp=0xc0000dd730 sp=0xc0000dd600 pc=0x5586b5aab8e8
runtime.doInit(0x5586b64c0e60)
	/usr/lib/golang/src/runtime/proc.go:6207 +0x71 fp=0xc0000dd860 sp=0xc0000dd730 pc=0x5586b5aab831
runtime.doInit(0x5586b64c4d40)
	/usr/lib/golang/src/runtime/proc.go:6207 +0x71 fp=0xc0000dd990 sp=0xc0000dd860 pc=0x5586b5aab831
runtime.doInit(0x5586b64c5bc0)
	/usr/lib/golang/src/runtime/proc.go:6207 +0x71 fp=0xc0000ddac0 sp=0xc0000dd990 pc=0x5586b5aab831
runtime.doInit(0x5586b64c36c0)
	/usr/lib/golang/src/runtime/proc.go:6207 +0x71 fp=0xc0000ddbf0 sp=0xc0000ddac0 pc=0x5586b5aab831
runtime.doInit(0x5586b64c3ca0)
	/usr/lib/golang/src/runtime/proc.go:6207 +0x71 fp=0xc0000ddd20 sp=0xc0000ddbf0 pc=0x5586b5aab831
runtime.doInit(0x5586b64c7aa0)
	/usr/lib/golang/src/runtime/proc.go:6207 +0x71 fp=0xc0000dde50 sp=0xc0000ddd20 pc=0x5586b5aab831
runtime.doInit(0x5586b64c72a0)
	/usr/lib/golang/src/runtime/proc.go:6207 +0x71 fp=0xc0000ddf80 sp=0xc0000dde50 pc=0x5586b5aab831
runtime.main()
	/usr/lib/golang/src/runtime/proc.go:233 +0x1d4 fp=0xc0000ddfe0 sp=0xc0000ddf80 pc=0x5586b5a9e974
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0000ddfe8 sp=0xc0000ddfe0 pc=0x5586b5acc581
goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/lib/golang/src/runtime/proc.go:361 +0xd6 fp=0xc00005cfb0 sp=0xc00005cf90 pc=0x5586b5a9ed76
runtime.goparkunlock(...)
	/usr/lib/golang/src/runtime/proc.go:367
runtime.forcegchelper()
	/usr/lib/golang/src/runtime/proc.go:301 +0xad fp=0xc00005cfe0 sp=0xc00005cfb0 pc=0x5586b5a9ec0d
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00005cfe8 sp=0xc00005cfe0 pc=0x5586b5acc581
created by runtime.init.7
	/usr/lib/golang/src/runtime/proc.go:289 +0x25
goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/lib/golang/src/runtime/proc.go:361 +0xd6 fp=0xc00005d790 sp=0xc00005d770 pc=0x5586b5a9ed76
runtime.goparkunlock(...)
	/usr/lib/golang/src/runtime/proc.go:367
runtime.bgsweep(0x0?)
	/usr/lib/golang/src/runtime/mgcsweep.go:278 +0x8e fp=0xc00005d7c8 sp=0xc00005d790 pc=0x5586b5a8bbae
runtime.gcenable.func1()
	/usr/lib/golang/src/runtime/mgc.go:177 +0x26 fp=0xc00005d7e0 sp=0xc00005d7c8 pc=0x5586b5a81766
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00005d7e8 sp=0xc00005d7e0 pc=0x5586b5acc581
created by runtime.gcenable
	/usr/lib/golang/src/runtime/mgc.go:177 +0x6b
goroutine 4 [GC scavenge wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/lib/golang/src/runtime/proc.go:361 +0xd6 fp=0xc00005df20 sp=0xc00005df00 pc=0x5586b5a9ed76
runtime.goparkunlock(...)
	/usr/lib/golang/src/runtime/proc.go:367
runtime.bgscavenge(0x0?)
	/usr/lib/golang/src/runtime/mgcscavenge.go:272 +0xec fp=0xc00005dfc8 sp=0xc00005df20 pc=0x5586b5a8984c
runtime.gcenable.func2()
	/usr/lib/golang/src/runtime/mgc.go:178 +0x26 fp=0xc00005dfe0 sp=0xc00005dfc8 pc=0x5586b5a8[170](XXX#L170)6
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00005dfe8 sp=0xc00005dfe0 pc=0x5586b5acc581
created by runtime.gcenable
	/usr/lib/golang/src/runtime/mgc.go:178 +0xaa
goroutine 18 [finalizer wait]:
runtime.gopark(0xc0000924e0?, 0x0?, 0x70?, 0xc7?, 0x5586b5aab831?)
	/usr/lib/golang/src/runtime/proc.go:361 +0xd6 fp=0xc00005c630 sp=0xc00005c610 pc=0x5586b5a9ed76
runtime.goparkunlock(...)
	/usr/lib/golang/src/runtime/proc.go:367
runtime.runfinq()
	/usr/lib/golang/src/runtime/mfinal.go:[177](XXX/-/jobs/171913749#L177) +0xb3 fp=0xc00005c7e0 sp=0xc00005c630 pc=0x5586b5a80813
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00005c7e8 sp=0xc00005c7e0 pc=0x5586b5acc581
created by runtime.createfing
	/usr/lib/golang/src/runtime/mfinal.go:157 +0x45
time="[202](XXX#L202)3-11-30T14:54:52Z" level=error msg="exec failed: unable to start container process: read init-p: connection reset by peer"

On AWS with GitLab runners of type docker executor running on EC2 instances however, it still works as expected.
On AWS in an EKS cluster with runners of type kubernetes executor it works as well.

I therefore suspect a problem with the underlying runtime, which for the docker executors is docker of course, für EKS containerd and for OpenShift cri-o. Could also be permission-related, as in our OpenShift clusters a lot of capabilities are prohibited. We haven't however changed those restrictions, and past versions of the image still work.

This happens only with versions higher than v7.4.0, all versions below this work just fine.

To Reproduce
Steps to reproduce the behavior:

  1. Setup GitLab project and create a job running current megalinter image
  2. Use GitLab k8s executor runner in OpenShift to run it
  3. Watch log output
  4. See error

Expected behavior
Image works as it did before in all runtimes.

@very-doge-wow very-doge-wow added the bug Something isn't working label Nov 30, 2023
@echoix
Copy link
Collaborator

echoix commented Nov 30, 2023

In all cases, that is a real bug for the runner/platform you are using, since they throw a segmentation violation where other runners don't.

The only change I can suspect is that the base image changed from an alpine3.17 python image to an alpine3.18 python image. v7.3.0...v7.4.0#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R52

It has a newer kernel version, 6.1. Is there anything here that rises flags? https://wiki.alpinelinux.org/wiki/Release_Notes_for_Alpine_3.18.0

If you think of permission errors, we had an issue once, but it was before these releases. Some node packages have file user id / group id too big for some environments where they are more limited. The "fix" was to map uid/gid when building the image. When I encountered it, I didn't have the problem when building the image inside the problematic environment, only when pulling the image. You may want to try building the image inside that environment to make sure.

But maybe before that, to make sure, 7.3.0 still works as of today on that environment, and 7.4.0/7.5.0/7.6.0 all don't work? And what about the beta images (it is the main branch). Is there a specific flavor that works? Or even a very simple single linter image that works? Something like https://hub.docker.com/r/oxsecurity/megalinter-only-go_revive that has nothing complicated inside.

@very-doge-wow
Copy link
Author

I don't think it's caused by the runtime per se, since all versions below the v7.3.0 are working without any problems. I have also taken a look at the changes from v7.3.0 to v7.4.0 (especially the Dockerfile of course) but couldn't make anything particular out which might cause a segmentation fault during startup. The beta image doesn't work either.

You might actually be on to something with the uid/gid stuff since OpenShift assigns rather large ids in that respect. However this is done for all images during deploy time, so I don't think setting any uids during the image build would solve the issue in this particular case. Can you provide any sort of issue/PR to that past problem so I can dig a little deeper?

I will test the single linter images next week and report back here.

@echoix
Copy link
Collaborator

echoix commented Nov 30, 2023

#2348

(End of:
#2318)

#2434

#2435

But I'm not that sure it would be that, since if your environment is able to pull (and extract) the image, then it's probably not it. A test for this is to have a container that you could go into and have docker inside, in order to pull.

There was also this that was useful in understanding at that time.
https://circleci.com/docs/high-uid-error/

@echoix
Copy link
Collaborator

echoix commented Dec 1, 2023

Other than that, I would have difficulty helping you out, as you didn't mention any version numbers of the failing software, and all the forums/docs for openshift are paywalled, so it's quite difficult to figure out what it is. All your three environments are not somethings that I have access or experience :S. maybe someone else can help

@very-doge-wow
Copy link
Author

very-doge-wow commented Dec 4, 2023

Okay so I have tested with all 50 megalinter-only-XXX images that I found in Dockerhub. All of those ran successfully except for one, that one however exited with a different error than the original error message I posted above. The one failing is the megalinter-only-repository_devskim. Here's the error message:

[Sarif] ERROR: there is no SARIF output file found, and stdout doesn't contain SARIF
[Sarif] stdout: Fatal error while calling devskim: [Errno 13] Permission denied: 'devskim'
Error while getting total errors from SARIF output.
Error:while parsing a block mapping
  in "<unicode string>", line 1, column 1:
    Fatal error while calling devski ... 
    ^
expected <block end>, but found '<scalar>'
  in "<unicode string>", line 1, column 47:
     ... ile calling devskim: [Errno 13] Permission denied: 'devskim'
                                         ^
stdout: Fatal error while calling devskim: [Errno 13] Permission denied: 'devskim'
Unable to process reporter CONSOLE[Errno 13] Permission denied: 'devskim'

This should be something unrelated to the original issue, because in this job the script execution actually started whereas for the full image we can't even enter the script section.
Since all of the others worked, I don't think the high uid/gid idea is of interest anymore...

The original problem only occurs when using the normal, i.e. the full image. I am at a loss

@echoix
Copy link
Collaborator

echoix commented Dec 4, 2023

Okay so I have tested with all 50 megalinter-only-XXX images that I found in Dockerhub. All of those ran successfully except for one, that one however exited with a different error than the original error message I posted above. The one failing is the megalinter-only-repository_devskim. Here's the error message:

[Sarif] ERROR: there is no SARIF output file found, and stdout doesn't contain SARIF

[Sarif] stdout: Fatal error while calling devskim: [Errno 13] Permission denied: 'devskim'

Error while getting total errors from SARIF output.

Error:while parsing a block mapping

  in "<unicode string>", line 1, column 1:

    Fatal error while calling devski ... 

    ^

expected <block end>, but found '<scalar>'

  in "<unicode string>", line 1, column 47:

     ... ile calling devskim: [Errno 13] Permission denied: 'devskim'

                                         ^

stdout: Fatal error while calling devskim: [Errno 13] Permission denied: 'devskim'

Unable to process reporter CONSOLE[Errno 13] Permission denied: 'devskim'

This should be something unrelated to the original issue, because in this job the script execution actually started whereas for the full image we can't even enter the script section.

Since all of the others worked, I don't think the high uid/gid idea is of interest anymore...

Well, the "_only" images are the linters that are configured to output sarif, so there are some more linters possible that could fail. So at least it shows us that the base image isn't the problem. If we try normal flavors, that include a little more: cupcake (tradeoff of most common/useful linters and size), ci_light, do some of them work?

The original problem only occurs when using the normal, i.e. the full image. I am at a loss

We'll get there by elimination I think :)

@very-doge-wow
Copy link
Author

True, I didn't see those images because I was searching for megalinter-only-.
So I have now tested with these images additionally:

  • oxsecurity/megalinter-javascript
  • oxsecurity/megalinter-terraform
  • oxsecurity/megalinter-python
  • oxsecurity/megalinter-dotnet
  • oxsecurity/megalinter-dotnet
  • oxsecurity/megalinter-ci_light
  • oxsecurity/megalinter-security
  • oxsecurity/megalinter-java
  • oxsecurity/megalinter-cupcake
  • oxsecurity/megalinter-php
  • oxsecurity/megalinter-go
  • oxsecurity/megalinter-salesforce
  • oxsecurity/megalinter-rust
  • oxsecurity/megalinter-dotnetweb
  • oxsecurity/megalinter-ruby
  • oxsecurity/megalinter-swift

These all ran successfully as well 🤔

@echoix
Copy link
Collaborator

echoix commented Dec 4, 2023

Ok, that's a good thing. Since cupcake works and not full, it limits the scope again more ;).

Just like that, does the full image work now, in these conditions before going too far?
And that swift works, that's another weird one, so it's a good thing

@very-doge-wow
Copy link
Author

The full image still doesn't work, same error as before.

@echoix
Copy link
Collaborator

echoix commented Dec 4, 2023

And does any of the beta tags, created like 2 hours ago work?

Depending on your workloads, does the cupcake flavor cover enough to be used for you temporarily?

@very-doge-wow
Copy link
Author

Tested again using the beta tag, now it actually works. Do you know what the exact changes were?
How long does it usually take until the beta codebase is released?

@echoix
Copy link
Collaborator

echoix commented Dec 4, 2023

Tested again using the beta tag, now it actually works. Do you know what the exact changes were?

How long does it usually take until the beta codebase is released?

The changes are all the commits in main since the last release.
For now, it's 32 commits.
v7.6.0...main

If you want to pinpoint the commit, to know if it was the last commit (this morning) that largely changed the way dotnet is installed, or any other, you can try taking the sha256 from the action logs here https://github.com/oxsecurity/megalinter/actions/workflows/deploy-BETA.yml, more specifically at that output for each run https://github.com/oxsecurity/megalinter/actions/runs/7085248249/job/19281122693#step:15:7209 and use something like this to get the older betas:

oxsecurity/megalinter:beta@ sha256:69eae5a79d450b18180bc5b84234306ced91288939f55dbfc525b356b508ee88

The one you tried and worked was:

oxsecurity/megalinter:beta@ sha256:9d0a148de04f0f889d40003382958b888d7a4d5d1cea7d9cdb25bc0fe1782714

Commits of interest:

ce82f95

75df660

1c3420e

The others are really just continuous updates of the tools versions, not the structure/installed packages changes. That means that if the breaking point is between the commits of interest, a linter should have caused your bug.

Did your environment really pull the latest images when you tried the latest, ie, really pulling the 7.6.0 instead of an old tag (like when you run docker locally, you have to run docker pull ... to have an already pulled image get updated locally)

@nvuillam
Copy link
Member

nvuillam commented Dec 4, 2023

Thanks for reporting the issue, and thanks @echoix for this great analysis and support as usual :)

If the issue is solved in beta, I suggest we release a new minor version next week-end :)

@very-doge-wow
Copy link
Author

Yes, I pinned the image using tag and digest, therefore it must be the correct one.
Thanks for your support, we can close the issue now and we'll wait for the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants