-
Notifications
You must be signed in to change notification settings - Fork 4
IV. Feature Generation
Natural language is very messy and unstructured. It requires several preprocessing steps to extract the main meaning behind a question. In order to extract the portions of text that best describes the question, preprocessing steps are applied in three phases:
-
In the cleaning phase, all the numbers, punctuation, hyperlinks are removed from the text. In this work, questions/answers are handled at word level as a set of words. Therefore, numbers, punctuation and hyperlinks don’t carry much information on their own when the question is split up into words.
-
In the filtering phase, only the words that convey the most amount of information is kept. As the first step, high frequency words are removed from the text. These words, named stopwords, don’t carry much information by themselves when the text is represented as a bag-of-words. E.g. article, prepositions. In addition to the stop words, upon inspecting the data, we noticed that there were words that didn’t carry much information being used repeatedly in the questions that didn’t carry much information, e.g. ‘LAW’, ‘barefootlaw’. We decided to create a custom stopwords list and remove them from the text as well. As the third step in the filtering phase, we made an assumption that the most information about the question is carried in nouns and verbs in the question. Therefore, a Part-of-Speech tagging step was carried out and we retained only nouns and verbs in the question.
-
In the combining phase, the focus was on reducing the number of words in the vocabulary by combining similar words. First, a stemming step was taken to combine the different versions of the same verb. E.g. ‘walked’ and ‘walking’ becomes ‘walk’. Further, a lemmatization step was taken to convert words to their root form and combine different forms of the same word (e.g. plurals). Another step that could be taken in this phase is the combining of synonyms. However, in this work, the synonym combination was not implemented.
The pre-processing functions can be found in the src/barefoot_winnie/d00_utils/preprocessing.py
script
Machine learning algorithms require numerical representations of data to learn and make predictions. Further, most machine learning algorithms require a fixed number of dimensions in the numerical representations. Therefore, the preprocessed questions need to be converted to a structured representation with fixed dimensionality across the dataset.
One simple method of converting the text word frequency matrix called the Term-Document Matrix (TDM). As a first step to create the TDM, a vocabulary is created using the set of training questions (corpus) in the database. The vocabulary is the set of unique words in the corpus after all the preprocessing steps are performed in the previous section. In the TDM, each row represents a question (document) and each column represents a word (term) in the vocabulary. Therefore, the dimensionality of the TDM is determined by the size of the vocabulary. Each cell of the TDM represents the frequency of the particular word in the question. The TDM is usually a very sparse matrix. Therefore, filtering and combining phases in preprocessing are crucial to control the dimensionality of the TDM.
Another limitation of the TDM is that it is assumed that each word is equally important in the corpus. To counter this, Term Frequency Inverse Document Frequency (TF-IDF) weighting can be used. First the term-frequency (TF) ratio is calculated. The ratio is a weight for the word given the number of words in the document/question. Each document has a TF vector.
Stepwise process is given below:
- Denoise the question
- Remove punctuations
- Remove numbers
- Replace contractions
- Remove stop words
- Stem and lemmatize the words
- Filter part of speech tags
- Remove custom stop words
The TF-IDF class is created in the src/barefoot_winnie/d04_modelling/tf_idf.py
script.
As the second approach, pre-trained word embeddings were used to get fixed length structured data. In this version, pre-trained google Word2Vec was used. Google Word2Vec contains numeric vector embeddings for three million words. The complete set can be downloaded here. The idea here is to obtain the word embedding for each word in the question, and combine them by averaging them across. Then, use that as the numeric embedding of the question. This is a naive approach as it ignores the order of the words and the semantic contexts of words.
Stepwise process is given below.
- Denoise the question
- Remove punctuations
- Remove numbers
- Split the question into words
- For each word, get the word vector from the trained set (if word not in the set, return a vector of 0s)
- Average the vectors of all the words in the question to get the embedding for the question
Note that stemming and lemmatization is not performed here. The idea is that the word embedding should have taken different forms into account and reflect them in the numerical vector.
The pretrained word2vec class is created in the src/barefoot_winnie/d04_modelling/word2vec_pretrained.py
script.