Skip to content

Theses & Reports

kka011098 edited this page Mar 5, 2024 · 15 revisions

CTRL-F for names, keywords, etc is encouraged!

Machine learning for compression of high-energy physics data

Sam Hill, University of Manchester, January 2024

Semester 1, MPhys

Host link: https://doi.org/10.5281/zenodo.10777008

TL;DR: Comparable results out of online and offline compression on the ExaFEL dataset; quality of reconstruction is better for images of the same particle.

Abstract: An analysis was carried out on the performance of Baler [1], a machine-learning-based data compression model, when compressing and reconstructing image data. The investigation focused on several factors that determine the applicability of Baler to high-energy physics experiments. This involved the compression and reconstruction quality of images of different types of particle traversing a liquid argon chamber and the quality of real-time online image compression. Baler was found to perform high-quality online compression on diffusion images from the ExaFEL X-ray free electron laser project [2]. The mean squared error of images from the LArIAT experiment [3] showed that reconstruction was improved by approximately two orders of magnitude for datasets containing only one particle type compared to those containing a mix of particles. Also, visualisations of the compressed LArIAT data were also produced to investigate the effect of clustering on image compression and reconstruction.

An Open-Source Autoencoder Compression Tool for High Energy Physics

Axel Gallén, Lund University, June 2023

Masters thesis

Host link: https://lup.lub.lu.se/student-papers/search/publication/9117991

TL;DR: Trained an autoencoder model to achieve reconstructed distributions sufficiently good for HEP data at a compression ratio of 1.6; negligible auxiliary data. For CFD compression ratios, we beat gzip substantially, but the auxiliary data file size issue becomes a issue.

Abstract: A common problem across scientific fields and industries is data storage. This thesis presents an open-source lossy data compression tool with its foundation in Machine Learning - Baler. Baler has been used to compress High Energy Physics (HEP) data, and initial compression tests on Computational Fluid Dynamics (CFD) toy data have been performed. For HEP, a compression ratio of R = 1.6 has generated reconstructions that can be deemed sufficiently accurate for physics analysis. In contrast, CFD data compression has successfully yielded sufficient results for a significantly lower compression ratio, R = 88. Baler’s reconstruction accuracy at different compression ratios has been compared to a lossless compression method, gzip, and a lossy compression method, Principal Component Analysis (PCA), with case-wise larger compression ratios over gzip; and accuracy at the same compression ratio overall exceeding that of PCA.

Autoencoder Compression in High Energy Physics

Sten Astrand, Lund University, December 2021

Masters thesis

Host link: https://lup.lub.lu.se/student-papers/search/publication/9075881

TL;DR: Jet correlations don’t help that much with compression.

Abstract: A common problem across scientific fields and industries is data storage. This thesis presents an open-source lossy data compression tool with its foundation in Machine Learning - Baler. Baler has been used to compress High Energy Physics (HEP) data, and initial compression tests on Computational Fluid Dynamics (CFD) toy data have been performed. For HEP, a compression ratio of R = 1.6 has generated reconstructions that can be deemed sufficiently accurate for physics analysis. In contrast, CFD data compression has successfully yielded sufficient results for a significantly lower compression ratio, R = 88. Baler’s reconstruction accuracy at different compression ratios has been compared to a lossless compression method, gzip, and a lossy compression method, Principal Component Analysis (PCA), with case-wise larger compression ratios over gzip; and accuracy at the same compression ratio overall exceeding that of PCA.

Deep autoencoders for ATLAS data compression

George Dialektakis, August 2021

Google Summer of Code 2021 Report

Host link: https://github.com/Autoencoders-compression-anomaly/Deep-Autoencoders-Data-Compression-GSoC-2021/blob/main/report.pdf

TL;DR: Sparse autoencoders with min-max normalization have the best performance.

Abstract: Storage is one of the main limiting factors to the recording of information from proton-proton collision events at the Large Hadron Collider (LHC), at CERN in Geneva. Hence, the ATLAS experiment at the LHC uses a so-called trigger system, which selects and transfers interesting events to the data storage system while filtering out the rest. However, if interesting events are buried in very large backgrounds and difficult to identify as a signal by the trigger system, they will also be discarded together with the background. To alleviate this problem, different compression algorithms are already in use to reduce the size of the data that is recorded. One of those state-of-the-art algorithms is an autoencoder network that tries to implement an approximation to the identity, f(x) = x, and given some input data, its goal is to create a lower-dimensional representation of those data in a latent space using an encoder network. This way when collisions happen on the ATLAS Collider, we run the encoder on the produced data and we save only the latent space representation. Then using this latent representation offline the decoder network can reconstruct the original data. The goal of this project is to experiment with different types of Autoencoders for data compression in-depth and optimize their performance in reconstructing the ATLAS event data. For this reason, three kinds of Autoencoders are proposed, and in specific, the Stan- dard Autoencoder, the Variational Autoencoder, and the Sparse Autoencoder. The above Autoencoders and thoroughly tested using different parameters and data normalization tech- niques, as our ultimate goal is to obtain the best possible reconstructions of the original event data. The proposed implementations will be a decisive contribution towards future testing and analysis for the ATLAS experiment at CERN and will assist overcome the obstacle of needing much more storage space than in the past due to the increase in the size of the data generated by the continuous proton-proton collision events in CERN’s Large Hadron Collider.

Evaluation of float-truncation based compression techniques for the ATLAS jet trigger

Love Kildetoft, Lund University, June 2021

Bachelors thesis

Host link: https://lup.lub.lu.se/student-papers/search/publication/9049610

TL;DR: Float truncation was tested and evaluated: for three variables the float truncated values do represent the original dataset well; Double quantization unavoidably leads to artefacts in all compressed distributions.

Abstract: Data compression methods allow more data to be stored within a certain storage framework while still keeping the characteristics of the data in question. At the Large Hadron Collider on the grounds of CERN in Switzerland, limited data storage capability has always been an urgent problem. At the ATLAS experiment, one technique that allows researchers to save more data within the same storage framework is the so-called trigger level analysis [TLA]. This thesis work explores float truncation-based data compression as an improvement to TLA. It is shown that this compression technique is promising for compressing several variables from TLA datasets, while however generating artefacts in the compressed distributions. This phenomenon is known as double quantization. It is explained how this effect is more or less unavoidable as it is an effect always present when discretizing a continuous distribution several times in succession. Furthermore, this thesis work explores the applicability of chaining float-truncation techniques with machine learning techniques (so-called autoencoder compression). It is shown that the original dataset is still well represented after applying the two techniques in succession.

Evaluation of float-truncation-based compression techniques for the ATLAS jet trigger

Honey Gupta, Lund University, August 2020

Google Summer of Code 2020 Report

Host links:
https://medium.com/@hn.gpt1/deep-compression-for-high-energy-physics-data-google-summer-of-code20-3dea5acc7bcf https://drive.google.com/file/d/159QRCM8-c3FUy-y6c0-SHkgtbNJNr9H8/view

TL;DR: Event-level compression instead of jet-level, provided scripts for running on the cluster.

Abstract: At CERN’s Large Hadron Collider (LHC), proton collisions are performed to study the fundamental particles and their interactions. To detect and record the outcome of these collisions, multiple detectors with different focus points have been built. The ATLAS detector is one such general purpose detectors at the LHC. There are approximately 1.7 billion events or collisions occurring inside the ATLAS detector each second and storage is one of the main limiting factors to the recording of information from these events. To filter out irrelevant information, the ATLAS experiment uses trigger systems which selects and sends interesting events to the data storage system while throwing away the rest. Storage of these events is limited by the amount of information to be stored and a reduction of the event size can allow for searches that were not previously possible. This project aims to investigate the use of deep neural autoencoders to compress event- level data generated by the HEP detector. The existing preliminary work investigates deep-compression algorithms on jets, which is the most common type of particle. The work shows promising results towards using deep-compression on HEP data. We build upon the existing work and extend the compression algorithm to event-level data, which means that the data contains information for multiple particles rather than just jets particles. We experiment with two open-sourced datasets and perform ablation studies to investigate the effect of deep compression on different particles from multiple processes.

Deep Autoencoders for Data Compression in High Energy Physics

Eric Wulff, Lund University, February 2020

Masters thesis

Host link: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9004751&fileOId=9004752

TL;DR: Initial work on the idea of jet-level compression and definition of the network parameters using rough hyperparameter scan.

Abstract: Current technological limitations make it impossible to store the enormous amount of data produced from proton-proton collisions by the ATLAS detector at CERN’s Large Hadron Collider. Therefore, specialised hardware and software is being used to decide what data, or which proton-proton collision events, to save and which to discard. A reduction in the storage size of each collision event is desirable as it would allow for more data to be saved and thereby make a larger set of physics analyses possible. The focus of this thesis is to understand whether it is possible to reduce the storage size of previously mentioned collision events using machine learning techniques for dimensionality reduction. This has never before been tried within the ATLAS experiment and is an interesting forward-looking study for future experiments. Specifically, autoencoder neural networks are used to compress several variables into a smaller latent space used for storage. Different neural network architectures with varying width and depth are explored. The AEs are trained and validated on experimental data from the ATLAS detector and their performance is tested on an independent signal Monte-Carlo sample. The AEs are shown to successfully compress and decompress simple hadron jet data and preliminary results indicate that the reconstruction quality is good enough for certain applications where high precision is not paramount. The AEs are evaluated by their reconstruction error, the relative error of each compressed variable and the ability to retain good resolution of a dijet mass signal (from the previously mentioned Monte-Carlo sample) after encoding and decoding hadron jet data.

Tests of Autoencoder Compression of Trigger Jets in the ATLAS Experiment

Erik Wallin, Lund University, June 2020

Bachelors thesis

Host link: https://lup.lub.lu.se/student-papers/search/publication/9012882

TL;DR: We have good results from the float truncation as well, so this is one of our “benchmarks to beat”.

Abstract: Limited data storage capability is a large obstacle to saving data in high-energy particle physics. One method of partially circumventing these limitations is trigger level analysis (TLA) as used by the ATLAS experiment. The efficiency of TLA can be further increased by doing effective data compression. One class of artificial neural networks are called autoencoders, which may be used for data compression. This thesis further tests the use of autoencoders for the compression of TLA data, while showing that it may be difficult to generalize between different datasets. The processing resources needed to compress TLA data in real-time fit well within the computing constraints available, and the memory usage is predictable. The use of different compression techniques are used sequentially, by so-called float truncation then followed by autoencoder compression is evaluated. It is shown that autoencoders show the same potential to be used on both uncompressed and float truncated data. Compression artifacts from float truncation, called double quantization, are also explained and analytically predicted.

Autoencoder Compression in High Energy Physics

Steb Åstrand, Lund University, March 2022

Bachelors thesis

Host link: https://lup.lub.lu.se/student-papers/search/publication/9075881

TL;DR: Jet correlations don’t help that much with compression.

Abstract: Situated in Geneva, Switzerland, the Large Hadron Collider is largest particle accelerator in the world, and as such, its operation carries with it some of the greatest technical challenges ever faced. Among them are the huge demands put on data storage capacity by experiments in particle physics, both in terms of rate and volume of data. Several systems are employed to manage and reduce the flow of data generated at the collider experiment stations. This comes at the cost of a reduced amount of material available for study.

This thesis analyses a relatively novel method of compressing, and thereby reducing the storage requirements of, data describing jets - showers of particles created in collisions between protons in the ATLAS experiment at the Large Hadron Collider. The main tool used for this compression is an artificial neural network of a type called an autoencoder. Such compression has previously been shown to be possible on single jets. As a continuation of that work, this thesis investigates whether it is possible to compress groups of jets with better results than when compressing them individually.

To that end, several autoencoder models are trained on jet groups of different configurations. These autoencoders are shown to be able to replicate the results of previous, single-jet studies, but the errors introduced during compression increase when jets are compressed in a group. This holds true for jets from the same proton-proton collision as well as jets randomly selected from a larger dataset. It is demonstrated that groups specifically made to contain jets with almost identical values of one variable can be compressed at a higher ratio than individual jets, with only slightly increased errors. However, this process carries with it the requirement of access to a large dataset, which is not possible if applied in a particle physics experiment, where data is gathered detection by detection.

Investigation of Autoencoders for Jet Images in Particle Physics

Jessica Lastow, Lund University, April 2021

Bachelors thesis

Host link: https://lup.lub.lu.se/student-papers/search/publication/9041079

TL;DR: Work in very early stages and no outstanding results in terms of performance of distinguishing signal wrt background, but we now have the tools to pre-process images of jets.

Abstract: Dark matter is an invisible type of matter believed to make up 85 % of the matter in the universe, but it has not yet been identified by experiments. According to certain particle physics theories, possible signatures of dark matter are so-called dark jets. They are the dark matter equivalent to classic jets, i.e collimated streams of particles. These jets could be produced from the proton-proton collisions at the Large Hadron Collider (LHC). The ATLAS experiment at the LHC is currently searching for such jets.

There are two challenges in discovering dark jets. Firstly, the traces left in the detector by dark jets are not well understood, so the best lead is to search for something anomalous. Secondly, current data storage limitations force us to discard data and this reduces the probability to register a dark jet.

Using machine learning techniques -- or more specifically autoencoders -- is a proposed solution as autoencoders can perform compression and anomaly detection simultaneously. This master's thesis investigates the use of autoencoders for jet images, a two-dimensional jet representation. A group of different autoencoders are trained separately to learn inherent structures in QCD jets (background, ordinary events). They are then used to recognize boosted W-boson jets (signal, anomalous events) with a different signature. The separation of boosted W-boson jets and QCD jets is a simplified version of the problem of separating possible dark jets from QCD jets.

The autoencoders were able to compress the background jet images threefold with an error of less than 5 % for over 95 % of the data. However, for signal jet images the error was found to be even smaller. This made anomaly detection impossible since the opposite is required for the method to work. The difference between signal and background could be too small for the simple autoencoders to distinguish.