Skip to content

Data handlers

amin saied edited this page Sep 15, 2017 · 1 revision

Supervised machine learning algorithms required paired data of the form (X,Y) to train. This data is (usually) split into three parts: training data, test data and validation data. Our TrainingData object has three fields .training, .test and (optionally) .validation, each of which have .X and .Y subfields. For example,

data.training.X = np.array(...) # training data

Different machine learning algorithms require different forms of training data. The general format of training/(validation)/test data will be consistent, and so each specific algorithm's training data will inherit from the most general TrainingData class. We provide the following specialised training data classes:

  1. UnigramTrainingData
  2. RNNTrainingData
  3. SVMTrainingData
  4. TextToVecTrainingData
  5. WordToVecTrainingData

Each of these classes has a build method that returns a training data object ready to be passed on to train the specified model.

Example: Unigram Training Data

As an example, given pandas series X of abstracts, and Y of MSC codes,

	X	                                                Y
1	[paper, show, compute, norm, dyadic, grid, res...	[42]
2	[prove, pfaffian, versions, Lieb, inequalities...	[15, 46]
3	[paper, Hardy, Lorentz, Lorentz_spaces, spaces...	[42]
4	[goal, paper, construct, invariant, dynamical,...	[32, 18, 37, 55]
5	[treat, Koll, injectivity, theorem, analytic, ...	[32]
...

and given a vocab object and an msc_bank we can create training data for the unigram model with,

vocab = Vocab.load([...]) # load defaul vocab object from `containers`
msc_bank = MSC.load(2) # 2-digit MSC codes

unigram_data = UnigramTrainingData.build(X, Y, vocab, msc_bank)

to which one can call unigram_data.training.X or unigram.test.Y for example. Better still, this object is exactly what our unigram model expects down the line!

Remark. It is perhaps worth noting that the unigram training data is somewhat anomalous in that the training data and the test data are of different forms. To train the unigram model, put simply, we count the occurrences of words appearing under different MSC codes and use these counts to build a matrix. The training data is therefore a series, as this lends itself well to the construction of said matrix. The model accepts vectors (of length the size of the vocab), and makes a prediction by multiplication with this matrix. The test data is therefore in the form of a numpy array. In all our other examples, the training, test and validation data are of the same form as one another.

Clone this wiki locally