Skip to content

Theses & Reports

kka011098 edited this page Mar 5, 2024 · 15 revisions

Machine learning for compression of high-energy physics data

Sam Hill, University of Manchester, January 2024

Semester 1, Mphys

Host link: https://doi.org/10.5281/zenodo.10777008

TL;DR:

An analysis was carried out on the performance of Baler [1], a machine-learning-based data compression model, when compressing and reconstructing image data. The investigation focused on several factors that determine the applicability of Baler to high-energy physics experiments. This involved the compression and reconstruction quality of images of different types of particle traversing a liquid argon chamber and the quality of real-time online image compression. Baler was found to perform high-quality online compression on diffusion images from the ExaFEL X-ray free electron laser project [2]. The mean squared error of images from the LArIAT experiment [3] showed that reconstruction was improved by approximately two orders of magnitude for datasets containing only one particle type compared to those containing a mix of particles. Also, visualisations of the compressed LArIAT data were also produced to investigate the effect of clustering on image compression and reconstruction.

Deep autoencoders for ATLAS data compression

George Dialektakis, August 2021

Google Summer of Code 2021 Report

Host link: https://github.com/Autoencoders-compression-anomaly/Deep-Autoencoders-Data-Compression-GSoC-2021/blob/main/report.pdf

TL;DR:

Storage is one of the main limiting factors to the recording of information from proton-proton collision events at the Large Hadron Collider (LHC), at CERN in Geneva. Hence, the ATLAS experiment at the LHC uses a so-called trigger system, which selects and transfers interesting events to the data storage system while filtering out the rest. However, if interesting events are buried in very large backgrounds and difficult to identify as a signal by the trigger system, they will also be discarded together with the background. To alleviate this problem, different compression algorithms are already in use to reduce the size of the data that is recorded. One of those state-of-the-art algorithms is an autoencoder network that tries to implement an approximation to the identity, f(x) = x, and given some input data, its goal is to create a lower-dimensional representation of those data in a latent space using an encoder network. This way when collisions happen on the ATLAS Collider, we run the encoder on the produced data and we save only the latent space representation. Then using this latent representation offline the decoder network can reconstruct the original data. The goal of this project is to experiment with different types of Autoencoders for data compression in-depth and optimize their performance in reconstructing the ATLAS event data. For this reason, three kinds of Autoencoders are proposed, and in specific, the Stan- dard Autoencoder, the Variational Autoencoder, and the Sparse Autoencoder. The above Autoencoders and thoroughly tested using different parameters and data normalization tech- niques, as our ultimate goal is to obtain the best possible reconstructions of the original event data. The proposed implementations will be a decisive contribution towards future testing and analysis for the ATLAS experiment at CERN and will assist overcome the obstacle of needing much more storage space than in the past due to the increase in the size of the data generated by the continuous proton-proton collision events in CERN’s Large Hadron Collider.

Evaluation of float-truncation based compression techniques for the ATLAS jet trigger

Love Kildetoft, Lund University, June 2021

Bachelors thesis

Host link: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9049610&fileOId=9049621

TL;DR:

Data compression methods allow more data to be stored within a certain storage framework while still keeping the characteristics of the data in question. At the Large Hadron Collider on the grounds of CERN in Switzerland, limited data storage capability is and has always been an urgent problem. At the ATLAS experiment, one technique which allows researchers to save more data within the same storage framework is so called trigger level analysis [TLA]. This thesis work explores float truncation based data compression as an improvement to TLA. It is shown that this compression technique is promising for compressing several variables from TLA datasets, while however generating artifacts in the compressed distributions. This phenomenon is known as double quantization. It is explained how this effect is more or less unavoidable as it is an effect always present when discretizing a continuous distribution several times in succession. Furthermore, this thesis work explores the applicability of chaining float-truncation tech- niques with machine learning techniques (so called autoencoder compression). It is shown that the original dataset is still well represented after applying the two techniques in succesion.

Evaluation of float-truncation based compression techniques for the ATLAS jet trigger

Honey Gupta, Lund University, August 2020

Google Summer of Code 2020 Report

Host link: https://drive.google.com/file/d/159QRCM8-c3FUy-y6c0-SHkgtbNJNr9H8/view

TL;DR:

At CERN’s Large Hadron Collider (LHC), proton collisions are performed to study the fundamental particles and their interactions. To detect and record the outcome of these collisions, multiple detectors with different focus points have been built. The ATLAS detector is one such general purpose detectors at the LHC. There are approximately 1.7 billion events or collisions occurring inside the ATLAS detector each second and storage is one of the main limiting factors to the recording of information from these events. To filter out irrelevant information, the ATLAS experiment uses trigger systems which selects and sends interesting events to the data storage system while throwing away the rest. Storage of these events is limited by the amount of information to be stored and a reduction of the event size can allow for searches that were not previously possible. This project aims to investigate the use of deep neural autoencoders to compress event- level data generated by the HEP detector. The existing preliminary work investigates deep-compression algorithms on jets, which is the most common type of particle. The Get started Open in app Deep-compression for High Energy Physics data: Googl... https://medium.com/@hn.gpt1/deep-compression-for-hi... 1 of 7 12/14/21, 11:13 work shows promising results towards using deep-compression on HEP data. We build upon the existing work and extend the compression algorithm to event-level data, which means that the data contains information for multiple particles rather than just jets particles. We experiment with two open-sourced datasets and perform ablation studies to investigate the effect of deep compression on different particles from multiple processes.

Deep Autoencoders for Data Compression in High Energy Physics

Eric Wulff, Lund University, February 2020

Masters thesis

Host link: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9004751&fileOId=9004752

TL;DR:

Current technological limitations make it impossible to store the enormous amount of data produced from proton-proton collisions by the ATLAS detector at CERN’s Large Hadron Collider. Therefore, specialised hardware and software is being used to decide what data, or which proton-proton collision events, to save and which to discard. A reduction in the storage size of each collision event is desirable as it would allow for more data to be saved and thereby make a larger set of physics analyses possible. The focus of this thesis is to understand whether it is possible to reduce the storage size of previously mentioned collision events using machine learning techniques for dimensionality reduction. This has never before been tried within the ATLAS experiment and is an interesting forward-looking study for future experiments. Specifically, autoencoder neural networks are used to compress a number of variables into a smaller latent space used for storage. Different neural network architectures with varying width and depth are explored. The AEs are trained and validated on experimental data from the ATLAS detector and their performance is tested on an independent signal Monte-Carlo sample. The AEs are shown to successfully compress and decompress simple hadron jet data and preliminary results indicate that the reconstruction quality is good enough for certain applications where high precision is not paramount. The AEs are evaluated by their reconstruction error, the relative error of each compressed variable and the ability to retain good resolution of a dijet mass signal (from the previously mentioned Monte-Carlo sample) after encoding and decoding hadron jet data.

Tests of Autoencoder Compression of Trigger Jets in the ATLAS Experiment

Erik Wallin, Lund University, June 2020

Bachelors thesis

Host link: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9012882&fileOId=9012886

TL;DR:

Limited data storage capability is a large obstacle to saving data in high-energy particle physics. One method of partially circumventing these limitations is trigger level analysis (TLA) as used by the ATLAS experiment. The efficiency of TLA can be further increased by doing effective data compression. One class of artificial neural networks are called autoencoders, which may be used for data compression. This thesis further tests the use of autoencoders for the compression of TLA data, while showing that it may be difficult to generalize between different datasets. The processing resources needed to compress TLA data in real-time fit well within the computing constraints available, and the memory usage is predictable. The use of different compression techniques are used sequentially, by so-called float truncation then followed by autoencoder compression is evaluated. It is shown that autoencoders show the same potential to be used on both uncompressed and float truncated data. Compression artifacts from float truncation, called double quantization, are also explained and analytically predicted.

Clone this wiki locally