-
Notifications
You must be signed in to change notification settings - Fork 32
Theses & Reports
Sam Hill, University of Manchester, January 2024
Semester 1, Mphys
Host link: https://doi.org/10.5281/zenodo.10777008
TL;DR:
An analysis was carried out on the performance of Baler [1], a machine-learning-based data compression model, when compressing and reconstructing image data. The investigation focused on several factors that determine the applicability of Baler to high-energy physics experiments. This involved the compression and reconstruction quality of images of different types of particle traversing a liquid argon chamber and the quality of real-time online image compression. Baler was found to perform high-quality online compression on diffusion images from the ExaFEL X-ray free electron laser project [2]. The mean squared error of images from the LArIAT experiment [3] showed that reconstruction was improved by approximately two orders of magnitude for datasets containing only one particle type compared to those containing a mix of particles. Also, visualisations of the compressed LArIAT data were also produced to investigate the effect of clustering on image compression and reconstruction.
George Dialektakis, August 2021
Google Summer of Code 2021 Report
TL;DR: sparse autoencoders with min-max normalization have the best performance
Storage is one of the main limiting factors to the recording of information from proton-proton collision events at the Large Hadron Collider (LHC), at CERN in Geneva. Hence, the ATLAS experiment at the LHC uses a so-called trigger system, which selects and transfers interesting events to the data storage system while filtering out the rest. However, if interesting events are buried in very large backgrounds and difficult to identify as a signal by the trigger system, they will also be discarded together with the background. To alleviate this problem, different compression algorithms are already in use to reduce the size of the data that is recorded. One of those state-of-the-art algorithms is an autoencoder network that tries to implement an approximation to the identity, f(x) = x, and given some input data, its goal is to create a lower-dimensional representation of those data in a latent space using an encoder network. This way when collisions happen on the ATLAS Collider, we run the encoder on the produced data and we save only the latent space representation. Then using this latent representation offline the decoder network can reconstruct the original data. The goal of this project is to experiment with different types of Autoencoders for data compression in-depth and optimize their performance in reconstructing the ATLAS event data. For this reason, three kinds of Autoencoders are proposed, and in specific, the Stan- dard Autoencoder, the Variational Autoencoder, and the Sparse Autoencoder. The above Autoencoders and thoroughly tested using different parameters and data normalization tech- niques, as our ultimate goal is to obtain the best possible reconstructions of the original event data. The proposed implementations will be a decisive contribution towards future testing and analysis for the ATLAS experiment at CERN and will assist overcome the obstacle of needing much more storage space than in the past due to the increase in the size of the data generated by the continuous proton-proton collision events in CERN’s Large Hadron Collider.
Love Kildetoft, Lund University, June 2021
Bachelors thesis
Host link: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9049610&fileOId=9049621
TL;DR:
Data compression methods allow more data to be stored within a certain storage framework while still keeping the characteristics of the data in question. At the Large Hadron Collider on the grounds of CERN in Switzerland, limited data storage capability is and has always been an urgent problem. At the ATLAS experiment, one technique which allows researchers to save more data within the same storage framework is so called trigger level analysis [TLA]. This thesis work explores float truncation based data compression as an improvement to TLA. It is shown that this compression technique is promising for compressing several variables from TLA datasets, while however generating artifacts in the compressed distributions. This phenomenon is known as double quantization. It is explained how this effect is more or less unavoidable as it is an effect always present when discretizing a continuous distribution several times in succession. Furthermore, this thesis work explores the applicability of chaining float-truncation tech- niques with machine learning techniques (so called autoencoder compression). It is shown that the original dataset is still well represented after applying the two techniques in succesion.
Honey Gupta, Lund University, August 2020
Google Summer of Code 2020 Report
Host link: https://medium.com/@hn.gpt1/deep-compression-for-high-energy-physics-data-google-summer-of-code20-3dea5acc7bcf https://drive.google.com/file/d/159QRCM8-c3FUy-y6c0-SHkgtbNJNr9H8/view
TL;DR: event-level compression instead of jet-level, provided scripts for running on cluster
Abstract: At CERN’s Large Hadron Collider (LHC), proton collisions are performed to study the fundamental particles and their interactions. To detect and record the outcome of these collisions, multiple detectors with different focus points have been built. The ATLAS detector is one such general purpose detectors at the LHC. There are approximately 1.7 billion events or collisions occurring inside the ATLAS detector each second and storage is one of the main limiting factors to the recording of information from these events. To filter out irrelevant information, the ATLAS experiment uses trigger systems which selects and sends interesting events to the data storage system while throwing away the rest. Storage of these events is limited by the amount of information to be stored and a reduction of the event size can allow for searches that were not previously possible. This project aims to investigate the use of deep neural autoencoders to compress event- level data generated by the HEP detector. The existing preliminary work investigates deep-compression algorithms on jets, which is the most common type of particle. The work shows promising results towards using deep-compression on HEP data. We build upon the existing work and extend the compression algorithm to event-level data, which means that the data contains information for multiple particles rather than just jets particles. We experiment with two open-sourced datasets and perform ablation studies to investigate the effect of deep compression on different particles from multiple processes.
Eric Wulff, Lund University, February 2020
Masters thesis
Host link: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9004751&fileOId=9004752
TL;DR: Initial work on the idea of jet-level compression and definition of the network parameters using rough hyperparameter scan
Abstract: Current technological limitations make it impossible to store the enormous amount of data produced from proton-proton collisions by the ATLAS detector at CERN’s Large Hadron Collider. Therefore, specialised hardware and software is being used to decide what data, or which proton-proton collision events, to save and which to discard. A reduction in the storage size of each collision event is desirable as it would allow for more data to be saved and thereby make a larger set of physics analyses possible. The focus of this thesis is to understand whether it is possible to reduce the storage size of previously mentioned collision events using machine learning techniques for dimensionality reduction. This has never before been tried within the ATLAS experiment and is an interesting forward-looking study for future experiments. Specifically, autoencoder neural networks are used to compress a number of variables into a smaller latent space used for storage. Different neural network architectures with varying width and depth are explored. The AEs are trained and validated on experimental data from the ATLAS detector and their performance is tested on an independent signal Monte-Carlo sample. The AEs are shown to successfully compress and decompress simple hadron jet data and preliminary results indicate that the reconstruction quality is good enough for certain applications where high precision is not paramount. The AEs are evaluated by their reconstruction error, the relative error of each compressed variable and the ability to retain good resolution of a dijet mass signal (from the previously mentioned Monte-Carlo sample) after encoding and decoding hadron jet data.
Erik Wallin, Lund University, June 2020
Bachelors thesis
Host link: https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=9012882&fileOId=9012886
TL;DR: we have good results from the float truncation as well, so this is one of our “benchmarks to beat”
Abstract: Limited data storage capability is a large obstacle to saving data in high-energy particle physics. One method of partially circumventing these limitations is trigger level analysis (TLA) as used by the ATLAS experiment. The efficiency of TLA can be further increased by doing effective data compression. One class of artificial neural networks are called autoencoders, which may be used for data compression. This thesis further tests the use of autoencoders for the compression of TLA data, while showing that it may be difficult to generalize between different datasets. The processing resources needed to compress TLA data in real-time fit well within the computing constraints available, and the memory usage is predictable. The use of different compression techniques are used sequentially, by so-called float truncation then followed by autoencoder compression is evaluated. It is shown that autoencoders show the same potential to be used on both uncompressed and float truncated data. Compression artifacts from float truncation, called double quantization, are also explained and analytically predicted.
Steb Åstrand, Lund University, March 2022
Bachelors thesis
Host link: https://lup.lub.lu.se/student-papers/search/publication/9075881
TL;DR: jet correlations don’t help that much with compression
Abstract: Situated in Geneva, Switzerland, the Large Hadron Collider is largest particle accelerator in the world, and as such, its operation carries with it some of the greatest technical challenges ever faced. Among them are the huge demands put on data storage capacity by experiments in particle physics, both in terms of rate and volume of data. Several systems are employed to manage and reduce the flow of data generated at the collider experiment stations. This comes at the cost of a reduced amount of material available for study.
This thesis analyses a relatively novel method of compressing, and thereby reducing the storage requirements of, data describing jets - showers of particles created in collisions between protons in the ATLAS experiment at the Large Hadron Collider. The main tool used for this compression is an artificial neural network of a type called an autoencoder. Such compression has previously been shown to be possible on single jets. As a continuation of that work, this thesis investigates whether it is possible to compress groups of jets with better results than when compressing them individually.
To that end, several autoencoder models are trained on jet groups of different configurations. These autoencoders are shown to be able to replicate the results of previous, single-jet studies, but the errors introduced during compression increase when jets are compressed in a group. This holds true for jets from the same proton-proton collision as well as jets randomly selected from a larger dataset. It is demonstrated that groups specifically made to contain jets with almost identical values of one variable can be compressed at a higher ratio than individual jets, with only slightly increased errors. However, this process carries with it the requirement of access to a large dataset, which is not possible if applied in a particle physics experiment, where data is gathered detection by detection.