This repository contains the project carried out during the Google Summer of Code (GSOC) 2024 by Lucas Martin Garcia under the mentoring of Jessica Krick and Shoubaneh Hemmati. The main goal of this project is to fill in missing values in light curve data from Active Galactic Nuclei (AGN). These light curves are essential for astronomical research but often have gaps due to various observation issues and equipment limitations. This approach uses a custom Bidirectional Recurrent Neural Network (BRNN) designed to predict and complete these missing values of AGN light curve data collected from the Zwicky Transient Facility (ZTF) and WISE telescopes.
This project was developed during Google Summer of Code 2024 by contributor Lucas Martin Garcia and mentors Jessica Krick and Shoubaneh Hemmati.
A blog post that covers the development process of this project has been published. You can read the blog post through the following link:
Read the Blog Post About the GSOC 2024 Project
All the data used for training, the datasets generated by the models and methods, and the best-trained BRNN model are available for download. You can access these resources through the following Google Drive link:
Access Project Data and Models on Google Drive
Please refer to the following Python notebooks in the repository for detailed code, analysis, and visualization related to the BRNN model:
Kauffman_custom_BRNN_HyperParameter_Selection.py
: Contains the code for hyperparameter selection and fine-tuning of the model.Kauffman_custom_BRNN_train.py
: Contains the implementation of the BRNN model and the code for the training process of the final model.Kauffman_custom_BRNN_test.py
: Contains the evaluation metrics and testing procedures used to assess the performance of the BRNN model.Kauffman_custom_BRNN_data_generation.py
: Contains the code for generating the final AGN dataset with missing values.
This custom Bidirectional Recurrent Neural Network (BRNN) is specifically designed for handling and imputing missing values in light curve data of Active Galactic Nuclei (AGN) from the Kauffman dataset. The light curves in this dataset are obtained from the Zwicky Transient Facility (ZTF) and the WISE telescopes, capturing observations across multiple bands, specifically zg
, zr
, zi
, W1
, and W2
.
Each input sample to the model represents the multi-band light curves of an AGN. These light curves encapsulate the brightness (flux) of the AGN over time, recorded in different bands. The model processes these sequences to predict the flux values at missing time points.
The core of the model consists of Long Short-Term Memory (LSTM) layers. These layers are adept at processing time series data due to their ability to remember long-term dependencies in the data, making them ideal for the sequential and temporal nature of light curves. Here’s a breakdown of how data flows through these layers:
- Forward LSTM Layer: Processes the light curve data from the beginning to the end. This layer captures the forward temporal dynamics, meaning it learns from past and present data to make predictions about the future.
- Backward LSTM Layer: Processes the light curve data from the end to the beginning. Contrary to the forward layer, it captures backward temporal dynamics, essentially learning from future data points to predict past values. This is particularly useful for imputing missing past observations when some future context is known.
The outputs from these two layers are then concatenated. This concatenated output contains a rich representation of the light curve that incorporates both past and future context, enhancing the model's ability to predict missing values accurately.
After processing through the Bidirectional LSTM layers, the data flows through the following additional network layers:
- Concatenation: The outputs of the forward and backward LSTMs are concatenated along the feature dimension. This step merges the information from both time directions, providing a comprehensive view of the time series.
- Intermediate Fully Connected (FC) Layer: This layer acts as an additional processing step that transforms the concatenated features into a new feature space, potentially making the features more suitable for the final prediction task.
- Output Fully Connected Layer: The final FC layer maps the features from the intermediate layer to the predicted output flux values for the missing time points.
- Activation Function (ReLU): The Rectified Linear Unit (ReLU) activation function introduces non-linearity into the network, helping to model more complex patterns in the data.
- Loss Function: Mean Squared Error (MSE) is employed to quantify the difference between the predicted flux values and the actual flux values at the known time points. This function is effective for regression tasks like ours, where the goal is to minimize the error between predicted and true values.
- Optimizer: The Adam optimizer is used for adjusting the network weights based on the loss gradients.
During training, the model learns by adjusting its parameters to minimize the loss function. The training involves feeding batches of data through the model, calculating the loss, and updating the model parameters via backpropagation. For prediction, the trained model takes incomplete light curves as input and outputs the imputed full light curves, filling in the missing values based on the learned temporal dynamics.
Please refer to the following Python notebooks in the repository for detailed code, analysis, and visualization related to the OIII Luminosity model:
Kauffman_O3_prediction.py
: Contains the code for the neural network model for OIII luminosity prediction using the generated datasets of AGNs.Kauffman_Basic_Interpolations.py
: Contains the implementation of basic models such as Linear interpolation and KNN to evaluate their performance and obtain the final dataset of AGNs.
This section of the project develops a neural network model designed to predict the OIII luminosity of AGN based on light curve data interpolated by both linear interpolation and our custom BRNN model. The goal was to assess the efficacy of using interpolated light curve data as input features for predicting critical astrophysical quantities.
The neural network designed for OIII luminosity prediction consists of several layers structured to process the interpolated light curve data effectively:
- Input Layer: Accepts the interpolated flux values from AGN light curves as input.
- Hidden Layers: Multiple dense layers with ReLU activations to introduce non-linearity and facilitate feature learning.
- Output Layer: A single neuron outputting the predicted OIII luminosity value.
The architecture aims to capture the underlying patterns in the light curve data that correlate with the OIII luminosity, using the following configuration:
- First Hidden Layer: 4096 neurons
- Second Hidden Layer: 4096 neurons
- Output Layer: 1 neuron
The model is trained using the Mean Squared Error (MSE) loss function and optimized with the Adam optimizer. Training involves feeding batches of interpolated light curve data (both from linear interpolation and the BRNN model) and adjusting model weights to minimize prediction errors.
Please refer to the following Python notebooks in the repository for detailed code, analysis, and visualization related to the Autoencoder model:
Kauffman_Autoencoder.py
: Contains the code for the training process of the Autoencoder model (encoder and decoder) to obtain and save the necessary encoder parameters that will be integrated into the BRNN model.Kauffman_custom_BRNN_Autoencoder.py
: Contains the implementation of training and testing the new custom BRNN model that incorporates the pretrained encoder.
This section discusses the implementation of an autoencoder designed to extract features from AGN light curve data. The extracted features are aimed at improving the efficiency of downstream tasks such as classification and prediction of AGN properties.
The autoencoder is structured into two main components:
- Encoder: Compresses the input light curve data into a lower-dimensional feature space.
- Decoder: Attempts to reconstruct the input data from the compressed feature space, ensuring that the most significant features are retained.
The features extracted by the autoencoder are fed into a Bidirectional Recurrent Neural Network (BRNN) to enhance the predictive models by providing a richer representation of the input data. This integration showcases a hybrid approach combining unsupervised and supervised learning techniques to boost performance.
The autoencoder is trained using a subset of AGN light curves, with the objective to minimize reconstruction error, measured by the Mean Squared Error (MSE) between the original and reconstructed data.