This repository contains the code and documentation for my project on the Elucidata Global AI Hackathon '25 Kaggle competition. The primary focus is a deep dive into building, debugging, and optimizing a complete PyTorch pipeline for a complex multi-target regression task on histology images.
The most significant achievement of this project was improving the initial competition score by 28.7% through rigorous post-competition analysis and code refactoring.
View the Polished Kaggle Notebook | Read the Full Story on my Blog (<-- Optional: Link to your blog post when ready)
Metric | Score |
---|---|
Initial Score (at Deadline) | 0.19677 |
Final Score (Post-Debugging) | 0.25311 |
Relative Improvement | 28.7% |
This project serves as a case study in the importance of data integrity and systematic debugging. The final, high-performing solution was achieved by identifying and fixing critical bugs in the initial data loading and processing pipeline.
The goal of the competition was to predict the spatial cellular organization from histology images. This translated to a multi-target regression problem: given a small image patch from a larger tissue slide, the model had to predict the values of 35 distinct biological markers.
The primary metric was the average per-spot Spearman correlation coefficient.
This project showcases a complete, end-to-end deep learning workflow:
- Data Pipeline:
- Efficient Data Handling: A custom
DataFactory
class was built to efficiently process metadata from large HDF5 files using vectorized Pandas operations. - On-the-Fly Patch Extraction: A custom PyTorch
Dataset
class (HESpotDataset
) extracts image patches on-the-fly, avoiding the need to store millions of small files. - Robust Caching: An LRU (Least Recently Used) cache was implemented to keep frequently accessed whole-slide images in memory, significantly speeding up epoch times.
- Efficient Data Handling: A custom
- Modeling & Training:
- Model Architecture: Fine-tuned a pre-trained
ResNet18
model, replacing the final layer with a regression head that includes Dropout for regularization. - Modern Training Practices: Utilized the
AdamW
optimizer,MSELoss
for regression, and aReduceLROnPlateau
learning rate scheduler. - Robust Training Loop: A modular
Trainer
class encapsulates the entire training and validation process, including early stopping based on the validation Spearman score and gradient clipping for stability.
- Model Architecture: Fine-tuned a pre-trained
- Systematic Debugging:
- The core of this project was a deep dive into debugging a "silent" failure where the model was not learning effectively.
- Diagnosed and fixed critical data pipeline bugs related to case-sensitive column names, incorrect data type handling (
numpy.object_
), and flawed data normalization.
-
Clone the Repository:
git clone https://github.com/your-username/kaggle-histology-image-regression.git cd kaggle-histology-image-regression
-
Set up the Environment: It is recommended to use a virtual environment.
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
Install the required packages:
pip install -r requirements.txt
-
Download the Data: Download the competition data from the Elucidata Global AI Hackathon '25 Kaggle Page and place it in a
/kaggle/input/
directory, or update the file paths in the notebook accordingly. -
Run the Notebook: Launch Jupyter and open the
your-notebook-name.ipynb
file. Run the cells sequentially to see the data preparation, training, and prediction process.jupyter notebook
This project was a powerful lesson in moving beyond baseline models and into the critical, real-world skill of pipeline optimization. The main takeaways were:
- Never Trust Your Data Pipeline: Always write explicit diagnostic checks to verify data types, ranges, and shapes at every step.
- Quantifiable Improvement is a Powerful Story: The journey from a working model to a high-performing model is where the most valuable learning occurs.
- Persistence Pays Off: The final 28.7% score improvement was achieved after the competition had already ended, driven by a desire to understand the "why" behind the initial results.