To enable prospective graduate students to evaluate their profiles.
In every graduate admission cycle students are expected to furnish a Statment of Purpose (SoP) as part of the application. This project aims to help students self-evaluate their SoPs for the particular program they are applying to.
- Data Collection: Samples of admitted and rejected SoPs are collected and collated continuously in the pos_samples.txt and neg_samples.txt files, repsectively.
- Data Clean-up:
- Natural Language Processing: Each profile from the above files is tokenized and tagged using
nltk
to extract the important words (nouns, verbs).
For example, student profile:I have always enjoyed science. I studied computer science at XYZ University.
Tokenized and tagged profile:[('have', 'VBP'), ('enjoyed', 'VBN'), ('science', 'NN'), ('studied', 'VBD'), ('computer', 'NN'), ('science', 'NN'), ('XYZ', 'NNP'), ('University', 'NNP')]
- Lemmatize: The extracted words are lemmatized so that each student profile can be compared for the presence or absence of these words.
Same example:['have', 'enjoy', 'science', 'study', 'computer', 'science', 'xyz', 'university']
- Uniquifying and sorting: The reduced profile is further simplified by removing duplicate occurences of lemmatized words.
Same example:['computer', 'enjoy', 'have', 'science', 'study', 'university', 'xyz']
- Storing: Each sorted and simplified student profile is stored in clean_file.txt.
- Natural Language Processing: Each profile from the above files is tokenized and tagged using
- Clean Data to Binary Vectors:
- Vector Definition: Simplified profiles are combined as follows to obtain vectors.
Simplified Profile 1:['computer', 'enjoy', 'have', 'science', 'study', 'university', 'xyz']
Simplified Profile 2:['abc', 'aim', 'become', 'computer', 'learning', 'machine', 'scientist', 'study', 'university']
Combined + uniquified:['abc', 'aim', 'become', 'computer', 'enjoy', 'have', 'learning', 'machine', 'science', 'scientist', 'study', 'university', 'xyz']
<-- has length 13
Assuming combined + uniquified as a vector:[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
<-- also has length 13
From this,
Simplified Profile 1 vector:[0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1]
<-- based on the words present in Simplified Profile 1
Simplified Profile 2 vector:[1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0]
- Storing: The vectors are stored in vector_file.txt.
- Vector Definition: Simplified profiles are combined as follows to obtain vectors.
- Binary Vectors to Labelled Data: Each line in vector_file.txt which contains a binary vector of profiles is converted into labelled data matrix for a neural network. The labelled data contains both input features and labels.
- Neural Network:
- Random Selection: Training (90%), Cross-Validation (5%), and Test (5%) Samples are randomly selected.
- Neural Network Structure: Hidden layers and corresponding parameters are generated in
tensorflow
. - Cross-Validation: Tensorflow model is trained to evaluate cross-validation error.
- Testing: Tensorflow model is trained to evaluate test error.
Once the neural network is satisfactorily trained, the user profile is added to profile.txt. This file under goes the steps mentioned in Method, above and the probability of it being a positive profile is returned.