This repository contains the code and resources for a machine learning project focused on wafer fault detection. The goal of the project is to build a predictive model that can classify wafers as either working (+1) or faulty (-1) based on sensor data. The project involves data validation, database operations, model training, and deployment on AWS Elastic Beanstalk.
- Project Overview
- Table of Contents
- Problem Statement
- Data Description
- Data Validation
- Data Insertion in Database
- Model Training
- Prediction
- Deployment
- Project Structure
- Getting Started
- Dependencies
- Usage
- Contributing
- License
The aim of this project is to develop a machine learning model capable of predicting whether a wafer is working (+1) or faulty (-1) based on sensor data. The project involves data preprocessing, model training, and deployment of the predictive model on AWS Elastic Beanstalk.
The dataset contains wafer sensor data in CSV format. Each file represents a batch of wafers, with each row representing a wafer and its corresponding sensor values. The last column of the data indicates whether the wafer is good (+1) or bad (-1).
The data goes through rigorous validation steps, including checks for file name patterns, number of columns, column names, data types, and null values. Invalid files are moved to the "Bad_Data_Folder".
Validated data is inserted into a database table named "Good_Data" for further processing and model training.
The preprocessed data is used to train KMeans clustering models. For each cluster, the best model (Random Forest or XGBoost) is selected based on AUC scores.
Incoming data goes through validation similar to the training data. Validated data is then preprocessed, clustered using the pre-trained KMeans model, and predictions are made using the appropriate model for each cluster.
The project is deployed on AWS Elastic Beanstalk for easy scalability and management. The Flask framework is used to create a web application for the project's front end.
The project is structured as follows:
data_validation.py
: Contains data validation functions.database_operations.py
: Handles database operations.model_training.py
: Implements model training and selection.prediction.py
: Handles prediction using the trained models.app.py
: Implements the Flask web application for the project's front end.requirements.txt
: Lists project dependencies.config.json
: Contains configuration settings.schema.json
: Defines schema for data validation.logo.png
: Project logo.
- Clone this repository.
- Install the required dependencies using
pip install -r requirements.txt
. - Configure the
config.json
andschema.json
files according to your data and requirements. - Run the Flask application using
python app.py
.
The project uses the following main dependencies:
- Flask
- Pandas
- Scikit-learn
- XGBoost
Refer to requirements.txt
for a complete list of dependencies.
- Upload your training data files to the designated folder.
- Configure the
config.json
andschema.json
files. - Run the data validation, insertion, and model training scripts.
- Deploy the project on AWS Elastic Beanstalk.
- Use the provided web interface to interact with the deployed project.
Contributions are welcome! If you find any issues or want to enhance the project, feel free to open a pull request.
This project is licensed under the MIT License.