Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged #1452

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Mar 5, 2024

[PLEASE DO NOT MERGE]

This PR provides 2 dockerfiles:

  1. for a base Almalinux 9 image (uploaded to harbor with cmsweb/pypi/alma-base:alma9-20240305)
  2. for MSUnmerged based on the fresh Almalinux 9 image (uploaded to harbor with cmsweb/pypi/reqmgr2ms-unmerged:2.3.1-20240305)

This is still not finished, and these images rely on the latest OS version (which likely explains the lack of security vulnerabilities), but current stats for these images are:

  • alma-base has no vulnerabilities and compressed size (in harbor) of 109MiB (while dmwm-base image has 384MiB, with no security scan available)
  • reqmgr2ms-unmerged has a total of 4 vulnerabilities, with an image size of 142MiB (while the debian-based image has 2468 vulnerabilities and size of 880MiB)

For the moment, I copied all the manage/run/monitor scripts from the pypi/dmwm-base folder to the pypi/alma-base one. The only change on those scripts is that for now it does not use rotatelogs to start the service up. It needs further discussion.

NOTE though that rotatelogs is not available in Almalinux (provided by apache2-utils package). So this is something that we must change if adopting Almalinux; or find an alternative way to deploy that package.

========== Update as of Apr/18 ===========
The reqmgr2ms-unmerged Dockerfile has been updated with the gfal2-plugins and a new image created with 2.3.2rc6-20240419. These are the plugins available to GFAL2 now:

>>> ctx.get_plugin_names()
['dcap-2.22.2', 'file-2.22.2', 'gridftp-2.22.2', 'http-2.22.2', 'sftp-2.22.2', 'srm-2.22.2', 'xrootd-2.22.2']

Important references:

@vkuznet
Copy link
Collaborator

vkuznet commented Mar 6, 2024

Alan, thanks for putting this together. May I suggest few things:

  • please move alma9, debian areas you create from a docker directory into pypi area since those are not per-se base images but rather base-images required for WM pypi distribution. I would like to avoid misleading directories.
  • please put in description list of bare minimum packages we need for base image
  • please put in description stats about image sizes, number of vulnerabilities
  • please specify python version almalinux:latest brings, it is python 3.9.18 which is different from WM current requirements (3.8)
  • consider using specific tag for almalinux instead of latest which will allow reproducibility of software stack.

Regarding apache rotatelogs. As you well aware it is our legacy approach based on VM based deployment. In k8s the logs can easily handed by kubernetes itself if we will yield them to stdout. We'll need to decide if this is still our mandatory requirements. If it is, I suggest to create another base image for only this package, then build it there from the source and install into custom area. Then use COPY approach used to copy tools/areas from one image to another. In this case, we'll build it from source (which requires to install gcc, make, autoconf, etc.), install it into local area, then copy local area to a final image and properly setup LD_LIBRARY_PATH and PATH to locate the executable and its libraries.

@amaltaro amaltaro changed the title Investigating dockerfiles for Debian vs Alma9; plus MSUnmerged based on Alma9 Building a base Almalinux 9 image for WMCore services; plus specific build for MSUnmerged Mar 6, 2024
@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 6, 2024

We still have to investigate whether rotatelogs is really needed for the services. Spitting logs out to stdout will be a problem for multi-thread services, where multiple logs are supposed to be created for each thread.

I refactored the PR description and also cleaned up this development branch. @vkuznet I didn't address all your points yet, but you might want to revisit the description. Thanks

@vkuznet
Copy link
Collaborator

vkuznet commented Mar 6, 2024

Alan, thanks for description update, it looks good and properly state the issue.

@d-ylee
Copy link
Contributor

d-ylee commented Apr 1, 2024

@amaltaro I made use of WMCore.WMLogging.getTimeRotatingLogger. I passed None as the name of the logger, assuming we are using the root logger. This seems to be the case in getMSLogger:

https://github.com/dmwm/WMCore/blob/4bc4a58a6c86a2206131ad23da66a58dc889539c/src/python/WMCore/MicroService/Tools/Common.py#L63

I haven't tested this yet, since I'm figuring out how to use test8 cluster.

@amaltaro
Copy link
Contributor Author

@arooshap Aroosha, can you please review a new yaml file that I pushed in (named reqmgr2ms-unmerged-cern.yaml) and confirm if that is all that I need in order to install another service flavor under the reqmgr2ms-unmerged umbrella?

I want to test a new docker image (based on Alma9), so I wanted to have a specific configuration for it for the moment.

@amaltaro
Copy link
Contributor Author

I updated the Dockerfile and built/uploaded a new image for MSUnmerged, with tag 2.3.2rc6-20240419. Initial description has been updated with this.

The service crashes - automatically restarted by CherryPy - whenever the service tries to remove a non-empty directory. This happened for all of the 4 RSEs that I enabled for the service so far and a signature of the logs look like:

2024-04-19 03:27:31,429:INFO:MSUnmerged: Trying to remove nonempty directory: /store/unmerged/Run3Summer22EENanoAODv12/GluGlutoBulkGravitontoHHto2G2Vto2G4Q_M-700_narrow_TuneCP5_13p6TeV_madgraph-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_postEE_v6-v2
[19/Apr/2024:03:27:32]  WATCHDOG: server exited with exit code signal 11 (core dumped)... restarting

google suggests this to be due to a SEGV.

In addition, I also got the container killed with:

>>> command terminated with exit code 137

which seems to be caused due to exceeded usage of memory. Current memory limit is set to 2GB, so I really doubt this is the actual problem...

For the record, I did not manage to reproduce this interactively inside the POD with the following script:

import os
import gfal2

def createGfal2Context(logLevel="trace"):
    ctx = gfal2.creat_context()
    gfal2.set_verbose(gfal2.verbose_level.names[logLevel])
    return ctx

ctx = createGfal2Context()

dirPfn = "davs://***/store/unmerged/Run3Summer22EEMiniAODv3/ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v2"
ctx.rmdir(dirPfn)

@vkuznet
Copy link
Collaborator

vkuznet commented Apr 30, 2024

I want to provide additional insight into pod failure on k8s cluster:

  • I observed with Prometheus process_exporter that pod is interrupted after some time with some load. The CPU load increase on a pod and it is killed
  • I found that in k8s event for that specific pod we see failure of liveness probe which basically trigger the kill process
k get event --namespace dmwm --field-selector involvedObject.name=ms-unmer-cern-75bbffc947-w85gh
LAST SEEN   TYPE      REASON      OBJECT                               MESSAGE
3m8s        Warning   Unhealthy   pod/ms-unmer-cern-75bbffc947-w85gh   Liveness probe failed: command "cmsweb-ping --url=http://localhost:8242/ms-unmerged/data/status --authz=/etc/hmac/hmac -verbose 0" timed out
13m         Normal    Pulled      pod/ms-unmer-cern-75bbffc947-w85gh   Container image "registry.cern.ch/cmsweb/reqmgr2ms-unmerged:2.3.2rc6-20240419" already present on machine
33m         Warning   Unhealthy   pod/ms-unmer-cern-75bbffc947-w85gh   Liveness probe failed: Unable to get response from http://localhost:8242/ms-unmerged/data/status, error: Get "http://localhost:8242/ms-unmerged/data/status": read tcp 127.0.0.1:39376->127.0.0.1:8242: read: connection reset by peer

Finally, I tried to measure time on /status end-point of the service and it is varies, e.g.

[_reqmgr2ms@ms-unmer-cern-75bbffc947-w85gh data]$ time curl -v http://localhost:8242/ms-unmerged/data/status
*   Trying ::1:8242...
* connect to ::1 port 8242 failed: Connection refused
*   Trying 127.0.0.1:8242...
* Connected to localhost (127.0.0.1) port 8242 (#0)
> GET /ms-unmerged/data/status HTTP/1.1
> Host: localhost:8242
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain;charset=utf-8
< Server: CherryPy/18.8.0
< Date: Tue, 30 Apr 2024 15:14:56 GMT
< Vary: Accept
< Pragma: no-cache
< Expires: Sun, 19 Nov 1978 05:00:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< X-Rest-Status: 100
< Etag: "6ff77790d9559c19ec59cc65b3f5fab4b9e9d0e1"
< Content-Length: 138
< X-Rest-Time: 2100.468 us
<
{"result": [
 {
  "wmcore_version": "2.3.2rc6",
  "microservice_version": "2.3.2rc6",
  "microservice": "MSManager",
  "status": "OK"
}]}
* Connection #0 to host localhost left intact

real    0m8.620s
user    0m0.001s
sys     0m0.004s

In above output it took 8 seconds to response to /status HTTP GET request.

According to deployed k8s manifest file we have

        livenessProbe:
          exec:
            command:
            - cmsweb-ping
            - --url=http://localhost:8242/ms-unmerged/data/status
            - --authz=/etc/hmac/hmac
            - -verbose
            - "0"
          failureThreshold: 3
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        name: ms-unmer-cern

which means that livemess probe will timeout after 5 seconds. Therefore, I conclude that we see pods restart due to poor performance of /status end-point. Here we have two options to fix the situation:

  1. increase timeout of liveness probe in k8s manifest file to larger value (to be determined)
  2. improve /status end-point to reduce its latency.

@vkuznet
Copy link
Collaborator

vkuznet commented Apr 30, 2024

To test effect of liveness probe, I adjusted ms-unmer-cern deployment and scale it back to 1 replicas. The new liveness probe is set to 600sec. And, I'll report if it will affect pod stability or not.

@vkuznet
Copy link
Collaborator

vkuznet commented Apr 30, 2024

I checked the pod and found no restarts after 40min,

ms-unmer-cern-5b9bcd4f44-4kztq          2/2     Running     0          40m

So, it seems we correctly identified the problem of pod crashing states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants