Skip to content

Latest commit

 

History

History
32 lines (24 loc) · 3.23 KB

README.md

File metadata and controls

32 lines (24 loc) · 3.23 KB

Problem Background

Imagine you received a dataset from a client containing a list of names and their potential classifications. The classifications are generated by an upstream system, but it is not always correct. Your task is to correctly classify the names and critically evaluate your solution.

Problem Approach:

This problem was approached by first tokenising the words contained in the names column, mapping each word to an integer. Before this integer mapping, the words were all made lowercase and all special characters like !"#$%&()*+,-./:;<=>?@[\\]^_{|}~\t\n were filtered out. These integers represented the vocabulary of the model.

Using the tokeniser, each name was then converted into a sequence of integers. This sequence was passed through an embedding layer so that words were represented as vectors in an abstract 𝐷-dimensional space. The embedding layer then passed on its sequence of word-vectors to a LSMT layer. Finally, the LSTM layer was connected to a Dense layer with three neurons and a softmax activation function to output a probability distribution representing each of the tree possible classes (Person, Company, University).

In order to achieve higher accuracy, the training data set was augmented by up-sapling minority classes (University and Company) so that the training data set was balanced.

Summary

Exploratory Data Analysis

This included looking at the given data set and realising it was unbalanced.

For manual feature engineering, I would have looked at whether the names included common words that go with persons (mr,dr,ms), companies (llc,pty,cc), or universities (college,university,campus). Then I would have constructed binary categorical features that indicate whether these were present in the name. However, it is easy to miss some of these common words and this approach is limited to what I consider common. Instead, I opted for an embedding approach that places words in a 𝐷-dimensional abstract space with contextual proximity. This method uses all the words present in the training set and automatically groups similar words together.

Limitations

As seen in the confusion matrix for the test set, this model was able to classify universities and persons reasonably well. However, only 72% of companies were correctly classified and were miss-classified as persons 28% of the time. Some improvement can be done specifically on the model's ability to classify companies. This can be done by collecting more data on company names and adding it to the data set before training.

Deployment

To deploy this model, a container environment like Docker can be set up with all the necessary libraries installed. The trained model can also be saved in this environment. Whenever someone wants to use the model, they can simply reproduce an instance of the specified container and supply the model with some names and they'll be classified as either Person, Company, or University.

Labraries Used

  • Python 3.7.6
  • tensorflow 2.4.1
  • numpy 1.19.5
  • pandas 1.0.1
  • matplotlib 3.1.3
  • seaborn 0.10.0
  • scikit-learn 0.22.1