Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged #1452

amaltaro · 2024-03-05T22:15:21Z

[PLEASE DO NOT MERGE]

This PR provides 2 dockerfiles:

for a base Almalinux 9 image (uploaded to harbor with cmsweb/pypi/alma-base:alma9-20240305)
for MSUnmerged based on the fresh Almalinux 9 image (uploaded to harbor with cmsweb/pypi/reqmgr2ms-unmerged:2.3.1-20240305)

This is still not finished, and these images rely on the latest OS version (which likely explains the lack of security vulnerabilities), but current stats for these images are:

alma-base has no vulnerabilities and compressed size (in harbor) of 109MiB (while dmwm-base image has 384MiB, with no security scan available)
reqmgr2ms-unmerged has a total of 4 vulnerabilities, with an image size of 142MiB (while the debian-based image has 2468 vulnerabilities and size of 880MiB)

For the moment, I copied all the manage/run/monitor scripts from the pypi/dmwm-base folder to the pypi/alma-base one. The only change on those scripts is that for now it does not use rotatelogs to start the service up. It needs further discussion.

NOTE though that rotatelogs is not available in Almalinux (provided by apache2-utils package). So this is something that we must change if adopting Almalinux; or find an alternative way to deploy that package.

========== Update as of Apr/18 ===========
The reqmgr2ms-unmerged Dockerfile has been updated with the gfal2-plugins and a new image created with 2.3.2rc6-20240419. These are the plugins available to GFAL2 now:

>>> ctx.get_plugin_names()
['dcap-2.22.2', 'file-2.22.2', 'gridftp-2.22.2', 'http-2.22.2', 'sftp-2.22.2', 'srm-2.22.2', 'xrootd-2.22.2']

Important references:

GFAL2 documentation
EPEL8 repository

vkuznet · 2024-03-06T13:09:39Z

Alan, thanks for putting this together. May I suggest few things:

please move alma9, debian areas you create from a docker directory into pypi area since those are not per-se base images but rather base-images required for WM pypi distribution. I would like to avoid misleading directories.
please put in description list of bare minimum packages we need for base image
please put in description stats about image sizes, number of vulnerabilities
please specify python version almalinux:latest brings, it is python 3.9.18 which is different from WM current requirements (3.8)
consider using specific tag for almalinux instead of latest which will allow reproducibility of software stack.

Regarding apache rotatelogs. As you well aware it is our legacy approach based on VM based deployment. In k8s the logs can easily handed by kubernetes itself if we will yield them to stdout. We'll need to decide if this is still our mandatory requirements. If it is, I suggest to create another base image for only this package, then build it there from the source and install into custom area. Then use COPY approach used to copy tools/areas from one image to another. In this case, we'll build it from source (which requires to install gcc, make, autoconf, etc.), install it into local area, then copy local area to a final image and properly setup LD_LIBRARY_PATH and PATH to locate the executable and its libraries.

amaltaro · 2024-03-06T13:51:39Z

We still have to investigate whether rotatelogs is really needed for the services. Spitting logs out to stdout will be a problem for multi-thread services, where multiple logs are supposed to be created for each thread.

I refactored the PR description and also cleaned up this development branch. @vkuznet I didn't address all your points yet, but you might want to revisit the description. Thanks

vkuznet · 2024-03-06T14:08:16Z

Alan, thanks for description update, it looks good and properly state the issue.

d-ylee · 2024-04-01T21:38:43Z

@amaltaro I made use of WMCore.WMLogging.getTimeRotatingLogger. I passed None as the name of the logger, assuming we are using the root logger. This seems to be the case in getMSLogger:

https://github.com/dmwm/WMCore/blob/4bc4a58a6c86a2206131ad23da66a58dc889539c/src/python/WMCore/MicroService/Tools/Common.py#L63

I haven't tested this yet, since I'm figuring out how to use test8 cluster.

amaltaro · 2024-04-17T14:54:32Z

@arooshap Aroosha, can you please review a new yaml file that I pushed in (named reqmgr2ms-unmerged-cern.yaml) and confirm if that is all that I need in order to install another service flavor under the reqmgr2ms-unmerged umbrella?

I want to test a new docker image (based on Alma9), so I wanted to have a specific configuration for it for the moment.

amaltaro · 2024-04-19T03:47:34Z

I updated the Dockerfile and built/uploaded a new image for MSUnmerged, with tag 2.3.2rc6-20240419. Initial description has been updated with this.

The service crashes - automatically restarted by CherryPy - whenever the service tries to remove a non-empty directory. This happened for all of the 4 RSEs that I enabled for the service so far and a signature of the logs look like:

2024-04-19 03:27:31,429:INFO:MSUnmerged: Trying to remove nonempty directory: /store/unmerged/Run3Summer22EENanoAODv12/GluGlutoBulkGravitontoHHto2G2Vto2G4Q_M-700_narrow_TuneCP5_13p6TeV_madgraph-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_postEE_v6-v2
[19/Apr/2024:03:27:32]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting

google suggests this to be due to a SEGV.

In addition, I also got the container killed with:

>>> command terminated with exit code 137

which seems to be caused due to exceeded usage of memory. Current memory limit is set to 2GB, so I really doubt this is the actual problem...

For the record, I did not manage to reproduce this interactively inside the POD with the following script:

import os
import gfal2

def createGfal2Context(logLevel="trace"):
    ctx = gfal2.creat_context()
    gfal2.set_verbose(gfal2.verbose_level.names[logLevel])
    return ctx

ctx = createGfal2Context()

dirPfn = "davs://***/store/unmerged/Run3Summer22EEMiniAODv3/ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v2"
ctx.rmdir(dirPfn)

vkuznet · 2024-04-30T16:30:16Z

I want to provide additional insight into pod failure on k8s cluster:

I observed with Prometheus process_exporter that pod is interrupted after some time with some load. The CPU load increase on a pod and it is killed
I found that in k8s event for that specific pod we see failure of liveness probe which basically trigger the kill process

k get event --namespace dmwm --field-selector involvedObject.name=ms-unmer-cern-75bbffc947-w85gh
LAST SEEN   TYPE      REASON      OBJECT                               MESSAGE
3m8s        Warning   Unhealthy   pod/ms-unmer-cern-75bbffc947-w85gh   Liveness probe failed: command "cmsweb-ping --url=http://localhost:8242/ms-unmerged/data/status --authz=/etc/hmac/hmac -verbose 0" timed out
13m         Normal    Pulled      pod/ms-unmer-cern-75bbffc947-w85gh   Container image "registry.cern.ch/cmsweb/reqmgr2ms-unmerged:2.3.2rc6-20240419" already present on machine
33m         Warning   Unhealthy   pod/ms-unmer-cern-75bbffc947-w85gh   Liveness probe failed: Unable to get response from http://localhost:8242/ms-unmerged/data/status, error: Get "http://localhost:8242/ms-unmerged/data/status": read tcp 127.0.0.1:39376->127.0.0.1:8242: read: connection reset by peer

Finally, I tried to measure time on /status end-point of the service and it is varies, e.g.

[_reqmgr2ms@ms-unmer-cern-75bbffc947-w85gh data]$ time curl -v http://localhost:8242/ms-unmerged/data/status
*   Trying ::1:8242...
* connect to ::1 port 8242 failed: Connection refused
*   Trying 127.0.0.1:8242...
* Connected to localhost (127.0.0.1) port 8242 (#0)
> GET /ms-unmerged/data/status HTTP/1.1
> Host: localhost:8242
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain;charset=utf-8
< Server: CherryPy/18.8.0
< Date: Tue, 30 Apr 2024 15:14:56 GMT
< Vary: Accept
< Pragma: no-cache
< Expires: Sun, 19 Nov 1978 05:00:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< X-Rest-Status: 100
< Etag: "6ff77790d9559c19ec59cc65b3f5fab4b9e9d0e1"
< Content-Length: 138
< X-Rest-Time: 2100.468 us
<
{"result": [
 {
  "wmcore_version": "2.3.2rc6",
  "microservice_version": "2.3.2rc6",
  "microservice": "MSManager",
  "status": "OK"
}]}
* Connection #0 to host localhost left intact

real    0m8.620s
user    0m0.001s
sys     0m0.004s

In above output it took 8 seconds to response to /status HTTP GET request.

According to deployed k8s manifest file we have

        livenessProbe:
          exec:
            command:
            - cmsweb-ping
            - --url=http://localhost:8242/ms-unmerged/data/status
            - --authz=/etc/hmac/hmac
            - -verbose
            - "0"
          failureThreshold: 3
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        name: ms-unmer-cern

which means that livemess probe will timeout after 5 seconds. Therefore, I conclude that we see pods restart due to poor performance of /status end-point. Here we have two options to fix the situation:

increase timeout of liveness probe in k8s manifest file to larger value (to be determined)
improve /status end-point to reduce its latency.

vkuznet · 2024-04-30T17:14:37Z

To test effect of liveness probe, I adjusted ms-unmer-cern deployment and scale it back to 1 replicas. The new liveness probe is set to 600sec. And, I'll report if it will affect pod stability or not.

vkuznet · 2024-04-30T17:55:23Z

I checked the pod and found no restarts after 40min,

ms-unmer-cern-5b9bcd4f44-4kztq          2/2     Running     0          40m

So, it seems we correctly identified the problem of pod crashing states.

Install EPEL repository and a few CA-related packages Use latest image

amaltaro changed the title ~~Investigating dockerfiles for Debian vs Alma9; plus MSUnmerged based on Alma9~~ Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged Mar 6, 2024

amaltaro mentioned this pull request Mar 6, 2024

Build a MSUnmerged docker image based on Almalinux dmwm/WMCore#11922

Open

d-ylee mentioned this pull request Apr 1, 2024

Use getTimeRotatingLogger for WMCore REST dmwm/WMCore#11955

Open

amaltaro force-pushed the alma9-images branch from f054735 to 3e32e13 Compare May 31, 2024 03:42

amaltaro added 9 commits June 1, 2024 08:07

Dockerfiles for Alma9 and Debian OS

be15daa

Fix dockerfiles and create brand new MSUnmerged based on Alma9

08c377a

Further fixes to the MSUnmerged alma9 Dockerfile

4981ff0

Separate Almalinux into base and specific service dockerfiles

8124d3f

Remove test dockerfiles and streamline alma-base and MSUnmerged

82af813

Create another service for MSUnmerged

13cf628

Install GFAL2 plugins

72a5240

Add gridftp and sftp plugins as well

99df92e

CMSWEB base image for Alma9

d957873

Install EPEL repository and a few CA-related packages Use latest image

amaltaro force-pushed the alma9-images branch from 3e32e13 to d957873 Compare June 1, 2024 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged #1452

Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged #1452

amaltaro commented Mar 5, 2024 •

edited

Loading

vkuznet commented Mar 6, 2024

amaltaro commented Mar 6, 2024

vkuznet commented Mar 6, 2024

d-ylee commented Apr 1, 2024

amaltaro commented Apr 17, 2024

amaltaro commented Apr 19, 2024

vkuznet commented Apr 30, 2024

vkuznet commented Apr 30, 2024

vkuznet commented Apr 30, 2024

Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged #1452

Are you sure you want to change the base?

Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged #1452

Conversation

amaltaro commented Mar 5, 2024 • edited Loading

vkuznet commented Mar 6, 2024

amaltaro commented Mar 6, 2024

vkuznet commented Mar 6, 2024

d-ylee commented Apr 1, 2024

amaltaro commented Apr 17, 2024

amaltaro commented Apr 19, 2024

vkuznet commented Apr 30, 2024

vkuznet commented Apr 30, 2024

vkuznet commented Apr 30, 2024

amaltaro commented Mar 5, 2024 •

edited

Loading