-
Notifications
You must be signed in to change notification settings - Fork 0
Data handlers
Supervised machine learning algorithms required paired data of the form (X,Y)
to train. This data is (usually) split into three parts: training data, test data and validation data. Our TrainingData
object has three fields .training
, .test
and (optionally) .validation
, each of which have .X
and .Y
subfields. For example,
data.training.X = np.array(...) # training data
Different machine learning algorithms require different forms of training data. The general format of training/(validation)/test data will be consistent, and so each specific algorithm's training data will inherit from the most general TrainingData
class. We provide the following specialised training data classes:
- UnigramTrainingData
- RNNTrainingData
- SVMTrainingData
- TextToVecTrainingData
- WordToVecTrainingData
Each of these classes has a build
method that returns a training data object ready to be passed on to train the specified model.
As an example, given pandas series X of abstracts, and Y of MSC codes,
X Y
1 [paper, show, compute, norm, dyadic, grid, res... [42]
2 [prove, pfaffian, versions, Lieb, inequalities... [15, 46]
3 [paper, Hardy, Lorentz, Lorentz_spaces, spaces... [42]
4 [goal, paper, construct, invariant, dynamical,... [32, 18, 37, 55]
5 [treat, Koll, injectivity, theorem, analytic, ... [32]
...
and given a vocab object and an msc_bank we can create training data for the unigram model with,
vocab = Vocab.load([...]) # load defaul vocab object from `containers`
msc_bank = MSC.load(2) # 2-digit MSC codes
unigram_data = UnigramTrainingData.build(X, Y, vocab, msc_bank)
to which one can call unigram_data.training.X
or unigram.test.Y
for example. Better still, this object is exactly what our unigram model expects down the line!
Remark. It is perhaps worth noting that the unigram training data is somewhat anomalous in that the training data and the test data are of different forms. To train the unigram model, put simply, we count the occurrences of words appearing under different MSC codes and use these counts to build a matrix. The training data is therefore a series, as this lends itself well to the construction of said matrix. The model accepts vectors (of length the size of the vocab), and makes a prediction by multiplication with this matrix. The test data is therefore in the form of a numpy array. In all our other examples, the training, test and validation data are of the same form as one another.