[cleaner] Option for distributed cleaning #3476

TrevorBenson · 2024-01-15T21:45:04Z

TrevorBenson
Jan 15, 2024

Subject: Proposal for Distributed Cleaner in Light of Performance Issues

Hi everyone,

I've been reflecting on the cleaner's performance and open issues, and the idea of a distributed version caught my attention. I haven't delved deep into the obfuscation functions to assess their compatibility, and I'm aware of the ongoing feature request #3097 to enhance parallelism.

Considering potential obstacles, including in syncing parsers, a few thoughts:

Potential Resource Discrepancy
- The collector system might have significantly fewer resources than the systems it collects from, impacting overall cleaning performance.
- In clusters, the master node often requires only a fraction of resources compared to cluster nodes, leading to sub-optimal cleaning speed as this system has a high probability of being the collector.
Performance Insights
- Feedback in Run cleaner concurrently also inside one archive #3097 highlights severe performance impact on single archives compared to the current cleaning process (comment link).
- Negative impact on single archives caused by concurrent cleaning could be avoided in distributed cleaning operations. At least if the current cleaner without more parallism was distributed, it would have the same (or better performance depending on the nodes resources) and the work would be divided among the entire cluster.

I'm not suggesting this be the default cleaner option, but exploring distributed cleaning could address scenarios where waiting for the cleaner on an under-powered system is impractical. Therefore I feel it's worth considering the requirements for a distributed approach which could also ensure consistent obfuscation across all archives and the produced mapping file.

Looking forward to your thoughts and insights.

pmoravec · 2024-01-16T14:20:49Z

pmoravec
Jan 16, 2024
Maintainer

Running cleaner concurrently on individual cluster nodes can gain significant performance gain (while implementing just this feature shall be easy, esp. comparing to concurrent cleaner inside one archive). But the cost is "incompatible" mapping where e.g. the same cluster node name is mapped once as host1 and once as host12, since the mapping on individual cluster nodes is being built independently.

Still this feature has sense even when having the full feature "cleaner runnng concurrently over one archive" for the performance reasons - user shall have an option to that.

2 replies

TrevorBenson Jan 16, 2024
Author

Yeah, when I was thinking of this I was contemplating obfuscation functions which took a hash/salt used to create the obfuscated data. Basically the collector would define it, pass it along, and then with each node using the same hash/salt result in the same transformation of host1.

I suspected the current functions did not support anything like this, but didn't review them yet. It was also just a first thought when considering the concept of a distributed obfuscation with a unified mapping file in the end.

I suspect there are better, more well thought out, ideas among the group on how this might be accomplished.

pmoravec Jan 17, 2024
Maintainer

Using a global salt sounds a great idea (if I interpret it right that sos generates a secret salt that encodes a hostname (or MAC address etc) to a hash to constitute host<HASH> uniform on any system). This approach can be beneficial also for concurrent cleaner on one system (thought Jake started preparations for another approach - but that should not be in conflict here..?).

The salt should be stored somewhere on the disk to re-use it on subsequent calls of the cleaner (currently we store the mappings "only" for the same purposes). And also storing it in a file can be the mechanism how collector tells individual cluster nodes what salt to use.. Sadly this will mean if one runs sosreport locally on a node (which will generate a first salt) and then run it from a cluster, the cluster's salt will overwrite the local one and mappings from those sosreports wont be compatible. But I expect this use case as rare.

TurboTurtle · 2024-01-23T22:07:41Z

TurboTurtle
Jan 23, 2024
Maintainer

I'm certainly interested in hearing more about a 'distributed' cleaner, but first let me recap where the performance discussion was heading as of ~6 months ago.

I was looking at transitioning the flow to use a set of processes (as opposed to threads) that would be responsible for reading files in parallel and finding items for obfuscation. This obfuscation would be handled by having the process add the to-be-obfuscated item to a Queue() in the main process, and the main process would solely be dedicated to retrieving and answering the items in this queue.

This should, in theory at least, allow our performance to now be limited by how fast the main thread can pull and respond to items in the queue, rather than how many files we can read concurrently.

This involves at minimum decoupling parsers from holding the obfuscation maps, and instead having them literally just as fancy abstractions for regex patterns. The mappings would be what the main process keeps control over to ensure obfuscations are consistent between processes.

Depending on the process management design, this could either be "1 process per archive" - akin to what the idea is today (but what we don't actually achieve) or it could be "just throw a bunch of processors at this huge list of files". The latter makes for an easier implementation to support multi-process single archive cleaning, but may not be the most efficient design as we'd probably need to enumerate all the files we'll be scanning before starting the process to ensure we're actively using all our desired processes at any given time. This in turn would likely mean we'd need to (or at least want to) unpack every archive before beginning the obfuscation process, which isn't great either. There's a lot of options here in the overall design to talk about.

All that said, we're a ways off right now. I was not able to finish decoupling the parsers and mappings before I left Red Hat. Then there is the fact that the queue-based workflow, while "simple" in design, is not a trivial amount of work on its own and would need significant testing to make sure our obfuscations are consistent regardless of scale.

0 replies

pmoravec · 2024-02-17T16:11:29Z

pmoravec
Feb 17, 2024
Maintainer

There might be an "elegant" alternative approach, though it has its cons as well.

The problem in cleaning concurrently is to maintain a sequence of IDs in obfuscateduser1-like strings, sequentially. What about replacing the numerical ID by a hash (of length, say, 6-8 chars)? The hash would be deterministically computed from the original string that we obfuscate. And it would allow to create the mapping absolutely concurrently / independently. Just a merge at the end would have to happen.

There are a few gotchas, however.

First, the length of the hash. It must be reasonably short to be human-understandable. Gladly a typical string to be obfuscated is relatively short length, that can be represented by relatively short hash. Very rule-of-thumb estimation is 6-8 chars, imho, to prevent hash collisions. Some better estimation is welcomed ("we obfuscate words usually up to X chars, so a hash of 6/7/8 chars would mean probability of Y% of a collision = that is too much / adequate / perfect").

Then, we would need adding some salt, private and semi-unique for the system that triggers the cleaner. Otherwise, one can run a vocabulary attack to generate list of such short hashes and get the obfuscated word from the given hash.

This salt would bring a new problem: running sosreports from system A with saltA would be in conflict of running sos collect from system B with saltB (that will collect sosreport also from system A) - since we need to have one salt for whole sos collect. But I think this type of problems are common to the current approach with sequential IDs as well..?

Last problem I am aware is incompatibility of this mapping with the current one. This new mapping would have to throw away the current existing /etc/sos/cleaner/default_mapping from current "sequential IDs" mapping - once for every such system. Not a big deal, but e.g. it will make difficult to correlate obfuscated strings from old mapping (obfuscateduser12) to the new one (obfuscateduser-asdf123).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sosreport

[cleaner] Option for distributed cleaning #3476

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

sosreport

[cleaner] Option for distributed cleaning #3476

TrevorBenson Jan 15, 2024

Replies: 3 comments · 2 replies

pmoravec Jan 16, 2024 Maintainer

TrevorBenson Jan 16, 2024 Author

pmoravec Jan 17, 2024 Maintainer

TurboTurtle Jan 23, 2024 Maintainer

pmoravec Feb 17, 2024 Maintainer

TrevorBenson
Jan 15, 2024

Replies: 3 comments 2 replies

pmoravec
Jan 16, 2024
Maintainer

TrevorBenson Jan 16, 2024
Author

pmoravec Jan 17, 2024
Maintainer

TurboTurtle
Jan 23, 2024
Maintainer

pmoravec
Feb 17, 2024
Maintainer