Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Machine Learning Model for Personally Identifiable Information (PII) Detection #10

Open
akshit-g opened this issue Jul 3, 2023 · 2 comments
Assignees
Labels
Advanced backend hacktoberfest Issues open for contribution under Hacktoberfest 2020

Comments

@akshit-g
Copy link
Contributor

akshit-g commented Jul 3, 2023

We need to implement a machine learning model capable of identifying regions within documents or images containing Personally Identifiable Information (PII). PII, including names, addresses, social security numbers, and email addresses, must be accurately detected to enhance user privacy and data security.

Data Collection and Annotation:
Collect a diverse dataset containing examples of documents or images with annotated regions of PII.
Ensure accurate and consistent annotations, marking the exact boundaries of PII regions.

Model Selection:
Choose or design a suitable object detection model architecture (e.g., Faster R-CNN, SSD, YOLO) for accurate and efficient region detection.

Data Preprocessing:
Preprocess the dataset, including resizing, normalization, and data augmentation, to prepare it for model training.

Model Training and Evaluation:
Split the annotated dataset into training and testing sets.

@shradiphylleia
Copy link
Contributor

This is my first contribution to ML model project please review my approach so that I can go about it.
approach for the training part.
1.Collect the data-I am trying to search for public datasets and pre-existing labeled dataset. If the dataset is not diverse enough than use data augmentation techniques.
2.Clean the data
3.OCR-use ocr technique to extract data from images extracted information then can help us label data to be considered sensitive.
4.Training model-then i will use the collected data and ocr extracted data for training ml model.
use cnn to recognize and classify sensitive information

@akshit-g
Copy link
Contributor Author

akshit-g commented Jul 7, 2023

Hey!
The approach checks out. Although you also need to check if there is PII in the image itself. Like a group picture, or a screenshot of a chat where my profile picture is visible.

Other than than, the approach looks good.

@akshit-g akshit-g changed the title Develop ML model Develop ML model to identify PII regions Jul 14, 2023
@akshit-g akshit-g changed the title Develop ML model to identify PII regions Develop Machine Learning Model for Personally Identifiable Information (PII) Detection Sep 29, 2023
@akshit-g akshit-g added hacktoberfest Issues open for contribution under Hacktoberfest 2020 and removed OSoC’23 labels Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Advanced backend hacktoberfest Issues open for contribution under Hacktoberfest 2020
Projects
None yet
Development

No branches or pull requests

2 participants