Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecation of lxplus7 #3730

Open
DickyChant opened this issue Jun 27, 2024 · 9 comments
Open

Deprecation of lxplus7 #3730

DickyChant opened this issue Jun 27, 2024 · 9 comments

Comments

@DickyChant
Copy link
Contributor

qiansitian@sqmbp16 ~> ssh lxplus7
ssh: Could not resolve hostname lxplus7.cern.ch: nodename nor servname provided, or not known

Today I realized that lxplus7 it no longer there...

  • So we can still run gridpack generation in "local" mode with containers
  • But I foresee some issues if we try to use condor on lxplus with el7 OS.

For MadGraph gridpack generation, the issue is that we need to setup a CMSSW as working environment on the fly, which means

  1. if we use the run in one go option, we better need to run everything inside an environment matched to the target scram_arch, saying "el7" for ul.
  2. we could split them and try some trick to submit condor jobs from a environment different from target scram_arch, saying we run "CODEGEN" first in a container, then exit and submit... but this option seems to be not working due to the current implementation (
    echo "Reusing an existing process directory ${name} is not actually supported in production at the moment. Please clean or move the directory and start from scratch."
    , at least i cannot make it work)

I have 3 solutions in mind right now:

  • We could use Use singularity for gridpack generation cmssw#44900 that actually runs event generation from gridpack with a container, this is already there and we just need to bump up versions, no action needed.
  • I started a container today built upon 'dask-lxplus' that allows us to submit condor jobs to cern condor pool from it https://gitlab.cern.ch/cms-genprod-containers/lxplus_genprod_condor with necessary libraries. We can submit jobs from it, I can give detailed instructions if anyone is interested. But I got weird issues when querying jobs from inside the container due to some IP issue (typical container thing and there should be a workaround)
  • We could also rely on CMSConnect, @celia-lo made some progress, but again that is also not quite reliable...

Those are the options that I feel are feasible (some are already available, some need a little bit of work), but I'd like to go with recommendation from GEN since some of them are not really fitting the roadmap.

@DickyChant
Copy link
Contributor Author

DickyChant commented Jun 27, 2024

I am also not so sure if we could have a dirty workaround by setting scram_arch to be different from the OS that we are working with, I wont say it is not worth trying though, and which also needs some tackling of the condor wrap up but that's relatively easy thing.

@lviliani
Copy link
Contributor

Hi @DickyChant, thanks a lot for these checks!

I think another solution is to use https://gitlab.cern.ch/cms-cat/cmssw-lxplus/ , which emulates lxplus7 with condor support.

@lviliani
Copy link
Contributor

There is also a slightly modified version of the above (https://gitlab.cern.ch/lviliani/mg_cmssw_docker/) which I started working on, including also genproductions with the idea to have a container including everything we need to run gridpacks, but it's just a preliminary test for now.

@DickyChant
Copy link
Contributor Author

Thanks for the heads up!

Do we have the container from cat being unpacked to cvmfs? If so that’s a nice addition! I suffered a lot for getting mine being setup at lxplus…

One thing actually worrisome is the support of condor python API… if it requires a strict IP address check it is inevitable when using container (I never checked this part for singularity, but for docker it is a well known mess up… I never thought I’d experience a similar thing in my life because I decided to avoid docker as much as possible…)

Having container setup is actually nice, I started with dask-lxplus container because I thought it would have better python api support to be usable out of box, then I had to add a full set of dependencies copied from the CMSSW containers… Have genproduction being part of is actually not a bad idea, thou I am afraid that we need at least two things:

1: for NLO we often have libraries being compiled and installed on the fly… which does seem ridiculous because we basically loose many good part of using container… therefore I believe Dominic’s new PR should come before this!
2: And… we are downloading MG on the fly also…

@sihyunjeon I thought you’ve told me about make release of genproductions. Actually maybe instead of making legacy form of release, i.e. code tarball, of genproductions, we could instead release containers through ghcr. Getting a release would need at least download and untar/unzip/… etc but publish a container seems easier to access and to maintain as well (I could imagine that a natural thing is to have a CI job build the container and a follow up CI job test whether it is usable)

@lviliani
Copy link
Contributor

Yes the container is unpacked on cvmfs:
/cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus

I agree with you in case we decide to include genproductions.

The nice thing of such a container is that it can be easily used also within a CI job to produce gridpacks using gitlab runners.
I tested it for a light local gridpack production and it worked.
In principle could be extended to use also condor within the CI job but that requires more work I guess.

@DickyChant
Copy link
Contributor Author

DickyChant commented Jun 27, 2024

Yes the container is unpacked on cvmfs: /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus

I agree with you in case we decide to include genproductions.

The nice thing of such a container is that it can be easily used also within a CI job to produce gridpacks using gitlab runners. I tested it for a light local gridpack production and it worked. In principle could be extended to use also condor within the CI job but that requires more work I guess.

Actually, I tested my container with a desktop that I had at cern, and since it is within CERN network, it has access to cern htcondor schedds and did condor_q successfully, it should also in principle be able to submit condor jobs. Because basically the dask-lxplus did the same thing as instructed from cern ABP twiki, especially the part on how to get a local htcondor setup accessible to cern htcondor pool. I think CAT’s image is doing the same after a quick glance.

Now, if we think of the powheg CI jobs in this very repo that depends on a VM from cern open stack (i guess @mseidel42 knows more details), it would have afs as long as we activate it via locmaps, as well as the cern internal web env, so I do not see a technical issue to have a CI job being able to do htcondor if we make interactive solution work at lxplus. And such thing should not be technically impossible if we could get an account that has condor authorites like the pdmvserv account. Note that you can basically achieve the same thing with reana.

@DickyChant
Copy link
Contributor Author

And for sure no issue about doing it with CI, in fact I would imagine our common background team would benefit more since what has been done there is also basically a CI, push new cards first and then a machine picks it up and execute it with the form of submitting condor jobs.

@lviliani
Copy link
Contributor

Right, technically it is possible indeed. We just have to figure out some details in case we want to do that.
Yes, reana could also be an option and I think can also be integrated with gitlab CIs, but I'm not familiar with it.

@DickyChant
Copy link
Contributor Author

Right, technically it is possible indeed. We just have to figure out some details in case we want to do that. Yes, reana could also be an option and I think can also be integrated with gitlab CIs, but I'm not familiar with it.

I could give a short report some time in gen to show how to scale up 1000 jobs at reana with gitlab ci in the context of doing tuning, but I can do a spoil here that it doesn’t scale up well. Now we’ve been trying to fine tune it (with @sihyunjeon and @shimashimarin)

and I won’t waste this chance to comment on its super inconvenient condor submission which requires you to by hand upload krb5 keytab!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants