MSUnmerged T1 service crashing when pulling large storage json dump #12061

amaltaro · 2024-07-30T13:39:02Z

Impact of the bug
MSUnmerged

Describe the bug
After bumping the resources limit to up to 8GB of memory RAM, ms-unmerged-t1 is still unable to load data into memory for T1_US_FNAL_Disk, which makes the service to crash every minute and repeat this in an endless mode.

In addition, several T1 storage forbid the WMCore robot certificate to delete unneeded data from the unmerged area. Identified RSEs so far are: CNAF and JINR.

How to reproduce it
Fetch the FNAL unmerged dump and load it into memory in python.

Expected behavior
Ideally, we should have a better control over the data that is loaded into memory, either streaming data or loading data in chunks.
We also have to follow up on those permission issues with the site admins.

Additional context and error message
We seem to be crashing this service at least since Mar 21st, 2024(!!!)

[21/Mar/2024:00:00:31]  WATCHDOG: server exited with exit code signal 9... restarting

The text was updated successfully, but these errors were encountered:

amaltaro · 2024-07-30T13:49:33Z

Relevant tickets so far:
CNAF https://ggus.eu/index.php?mode=ticket_info&ticket_id=167706

PIC https://its.cern.ch/jira/browse/CMSDM-190

vkuznet · 2024-07-30T14:04:39Z

The main issue with big JSON is dictionary (nested) representation of the data. This leads any parser in any programming language to load it as one object into memory. Instead, the data should be represented as list of smaller objects. Here are two approaches to represent the same data:

Big JSON example

{
   "value1": {"abc": {...}},
   "value2": {"xyz": {...}}
}

The same data which allows low memory footprint (equal to the size of single record:

[
  {"key": "value1", ...},
  {"key": "value2", ...}
]

Here the memory footprint in parser can be equal to the size of one single record. Moreover, such data-representation allows data-streaming, e.g. usage of NDJSON data-format:

{"key": "value", ...}
{"key": "value", ...}

Observation I: quite often WM objects are represented (dumped from CouchDB) as nested dictionaries. There is no JSON schema, i.e. JSON is not defined with set of pre-defined keys, e.g. {"block#123": ...}, or {"T1_XXX": ...}. Instead, we should strive to define JSON schema with concrete keys, e.g. block, site, and then use it to define data-structure, e.g. {"block": "block1#xyz", "site": "T1_XXX"}.

Observation II: we often use CouchDB to store python dictionaries, and then look them up. Dictionaries have no serialization, i.e. it should be loaded at once into the memory. Therefore, quite often the main problem is lack of data-representation schema. The CouchDB (or MongoDB) allows to store unstructured data but it doesn't mean that we should not cary about data-representation schema. We should put a layer (data-service) in front of any database which should consume and yield data in specific data format (JSON or NDJSON) and translate (if necessary) incoming/outgoing data in/to database. Instead, we often use database directly for data-storage and this leads to discussed problem when data size of stored objects increases and data is stored as dictionaries.

amaltaro · 2024-07-30T14:17:16Z

Valentin, I think we all are aware of these limitations and legacy/habit of storing nested dictionaries with non-static keys (which does not have only disadvantages btw).

Nonetheless, I see you have not even looked at the structure that is dealt with in MSUnmerged, which is actually responsible for this blow up of memory footprint. Just in case, we are talking about a flat list, e.g.:

[
 "/store/unmerged/GenericNoSmearGENBackfillBackfillBackfill/InclusiveDileptonMinBias_TuneCP5Plus_13p6TeV_pythia8/GEN/BACKFILL-v8/100145/7f41787f-105e-418c-aa93-9b99c8974dc1.root",
 "/store/unmerged/GenericNoSmearGENBackfillBackfillBackfill/InclusiveDileptonMinBias_TuneCP5Plus_13p6TeV_pythia8/GEN/BACKFILL-v8/100205/e6a94b98-9d31-4093-bc21-7162792d9933.root",

In addition, we do not control this data, as it comes from one of the Rucio services, as mentioned in the initial description.

vkuznet · 2024-07-30T14:26:48Z

If data is represented as flat list of strings, the amount of memory required to parse it is equal to the longest string in a list. If you do not control it, you can still write a parser to read it as one line at a time (instead of using JSON load(s)). This will reduce parser memory requirement to a single string size. That said, I doubt that long list requires 8GB ram size unless you deal with billions of such strings. Here is a basic proof:

python
>>> s1="/store/unmerged/GenericNoSmearGENBackfillBackfillBackfill/InclusiveDileptonMinBias_TuneCP5Plus_13p6TeV_pythia8/GEN/BACKFILL-v8/100145/7f41787f-105e-418c-aa93-9b99c8974dc1.root"
>>> import sys
>>> sys.getsizeof(s1)
224
>>> arr=[s1 for _ in range(10000)]
>>> sys.getsizeof(arr)
85176
>>> arr=[s1 for _ in range(1000000)]
>>> sys.getsizeof(arr)
8448728

So, in other words, single filename costs 224 bytes, the array of such strings has size of 85KB for 10K entries, and array of 1M entries will cost 8MB. But it is still 3 orders of magnitude lower then 8GB.

But if you will use such list within nested dictionary, then your memory footprint may blow up significantly based on a structure and nest-ness level of the dictionary.

amaltaro · 2024-10-01T01:28:05Z

I just managed to resume working on this item and I tried to delete the following 2 directories:

    rseDirs = ["davs://***/dcache/uscmsdisk/store/unmerged/RunIISummer20UL17wmLHEGEN/TTHTo2B_TTToHadronic_M-125_TuneCP5_13TeV-powheg-pythia8/LHE/106X_mc2017_realistic_v6-v1",
               "davs://***/dcache/uscmsdisk/store/unmerged/Run3Summer22MiniAODv4/GluGluToContinto2Zto2E2Tau_TuneCP5_13p6TeV_mcfm701-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v2"]

from FNAL_Disk with the gfal2 plugin command ctx.rmdir(dirPfn).

Initially, this is the error I was getting inside the ms-unmer-t1 pod:

Error deleting: davs://***/dcache/uscmsdisk/store/unmerged/Run3Summer22MiniAODv4/GluGluToContinto2Zto2E2Tau_TuneCP5_13p6TeV_mcfm701-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v2, gfal code=112, gfal message: Result (Neon): Server certificate verification failed: issuer is not trusted after 1 attempts

It turns out the CAs are pretty old/stale in this POD. After copying them from lxplus inside the ms-unmer-t1 pod, the same command succeeded without any issues.

I will follow this up with Aroosha and see if we can start mounting /etc/grid-security/certificates into the backend pods (or at least the WM ones).

amaltaro · 2024-10-01T01:50:02Z

For my own record, the current snapshot from Rucio Consistency Enforcement from
https://cmsweb-prod.cern.ch/rucioconmon/unmerged/index says:

| RSE           | History | Last run | Elapsed time | Status | Files | Size |
| T1_US_FNAL_Disk |   | 2024/09/30 | 20m1s | done | 11670914 | 1.6P |

I decided to bump the resource requirements for the production ms-unmer-t1 from 2 to 7GB limit (according to Grafana, it peaked at 6.3GB then it stabilizes in 700MB), plus:

patched the production pod: ms-unmer-t1-56c5b575bf-jq74c with Consume raw/generator unmerged dump data in MSUnmerged #12059
fixed the grid-security CAs with a copy from lxplus;
manually patched the manage script to Ensure X509_USER_PROXY is set for MSUnmerged restarts CMSKubernetes#1532;
and enabled only FNAL in the configuration file with RSEEXPR = "(rse=T1_US_FNAL_Disk)"

The first ~100 deletions failed because I still did not have up-to-date CAs in the node, but after fixing those I see that directory deletion is successfully going through, example:

2024-10-01 01:48:49,970:INFO:MSUnmerged: Processing directory index 287 out of 13105
2024-10-01 01:48:49,970:INFO:MSUnmerged: Trying to remove the whole directory: davs://***/dcache/uscmsdisk/store/unmerged/Run3Summer22EEMiniAODv4/Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_postEE_v6-v1
2024-10-01 01:48:59,962:INFO:MSUnmerged: Directory successfully removed: davs://***/dcache/uscmsdisk/store/unmerged/Run3Summer22EEMiniAODv4/Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_postEE_v6-v1
2024-10-01 01:48:59,963:INFO:MSUnmerged: Processing directory index 288 out of 13105
2024-10-01 01:48:59,963:INFO:MSUnmerged: Trying to remove the whole directory: davs://***/dcache/uscmsdisk/store/unmerged/RunIISummer20UL16NanoAODAPVv9/VBF_LFV_HToETau_M125_TuneCH3_13TeV_powheg_herwig7/NANOAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v3
2024-10-01 01:49:00,559:INFO:MSUnmerged: Directory successfully removed: davs://***/dcache/uscmsdisk/store/unmerged/RunIISummer20UL16NanoAODAPVv9/VBF_LFV_HToETau_M125_TuneCH3_13TeV_powheg_herwig7/NANOAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v3

amaltaro self-assigned this Jul 30, 2024

amaltaro added BUG MSUnmerged labels Jul 30, 2024

amaltaro added this to WMCore quarterly developments Jul 30, 2024

amaltaro moved this to In Progress in WMCore quarterly developments Jul 30, 2024

This was referenced Jul 31, 2024

Consume raw/generator unmerged dump data in MSUnmerged #12059

Open

Ensure X509_USER_PROXY is set for MSUnmerged restarts dmwm/CMSKubernetes#1532

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSUnmerged T1 service crashing when pulling large storage json dump #12061

MSUnmerged T1 service crashing when pulling large storage json dump #12061

amaltaro commented Jul 30, 2024

amaltaro commented Jul 30, 2024

vkuznet commented Jul 30, 2024 •

edited

Loading

amaltaro commented Jul 30, 2024

vkuznet commented Jul 30, 2024 •

edited

Loading

amaltaro commented Oct 1, 2024

amaltaro commented Oct 1, 2024

MSUnmerged T1 service crashing when pulling large storage json dump #12061

MSUnmerged T1 service crashing when pulling large storage json dump #12061

Comments

amaltaro commented Jul 30, 2024

amaltaro commented Jul 30, 2024

vkuznet commented Jul 30, 2024 • edited Loading

amaltaro commented Jul 30, 2024

vkuznet commented Jul 30, 2024 • edited Loading

amaltaro commented Oct 1, 2024

amaltaro commented Oct 1, 2024

vkuznet commented Jul 30, 2024 •

edited

Loading

vkuznet commented Jul 30, 2024 •

edited

Loading