Skip to content

Classifying Papers with Neural Networks

goldturtle edited this page Jan 13, 2022 · 43 revisions

These instruction will show you how to classify papers using the Textpresso Neural Networks framework. If any instructions are unclear, feel free to drop us an email at textpresso (at) caltech (dot) edu.

Prepare Data and Image, then Run Image

Prepare the Image

docker load < tpc-hmm-tpnn.tgz

Prepare Directories on the Host System

  • Make your working directory and some sub-directories therein:

mkdir -p <full path of your classifier directory>

mkdir -p <full path of your classifier directory>/tmp

mkdir -p <full path of your classifier directory>/raw_files/pdf

mkdir -p <full path of your classifier directory>/tpcas-1

Prepare the Corpus on the Host System

The classifying algorithm relies on a pool of articles that are positive and negative with respect to the classification that is to be performed, a set of papers to be classified, and other articles that will make up a 'background' of articles. All three sets are necessary to compute robust word vectors from which document vectors are built. Perform the following steps in order to prepare the corpus for processing:

  • Create the directory <full path of your classifier directory>/tmp/corpus.

mkdir -p <full path of your classifier directory>/tmp/corpus.

  • Copy PDFs to <full path of your classifier directory>/tmp/corpus.
  • Execute the following commands:

cd <full path of your classifier directory>/tmp/corpus

ls > <full path of your classifier directory>/tmp/pdflist

mkdir -p <full path of your classifier directory>/raw_files/pdf/corpus

for i in $(cat <full path of your classifier directory>/tmp/pdflist); do mkdir -p <full path of your classifier directory>/raw_files/pdf/corpus/$i; mv $i <full path of your classifier directory>/raw_files/pdf/corpus/$i/.; done

Run the Docker Image

  • Start the image. A bash shell within the container will be started:

docker run -v <full path of your classifier directory>:/data/textpresso -it tpc-hmm-tpnn bash

PDF to Text Conversion.

  • In the bash shell of the container, type the following command to convert PDFs to machine-readable text:

03pdf2cas.sh

  • Wait for the process to finish.

Calculate Word Embeddings and Document Vectors

The algorithm uses Word Embeddings to calculate word vectors first. The latter are then used to compute a document vector for each pdf.

Word Embeddings

  • To compute word vectors type the following command in the container bash shell:

01computewordmodel.sh -m /data/textpresso/corpus.word -c /data/textpresso/tpcas-1/corpus

Document Vectors

  • Similarly, for computing the document vectors, type the following command in the container bash shell:

02createdocvectors.sh -m /data/textpresso/corpus.word.vec -c /data/textpresso/tpcas-1/corpus -d /data/textpresso/corpus.doc

Train and Test the Model

Downloading JSON Files and Scripts

JSON files and scripts are not part of of the Docker image as they change frequently. This also helps with familiarizing the user with the system and enables her to make changes confidently.

JSON files

  • On the host system create the json directory:

mkdir -p <full path of your classifier directory>/json

Scripts

  • On the host system create the script directory:

mkdir -p <full path of your classifier directory>/scripts

  • Download all scripts in the directory textmining.textpresso.org/tpnn/scripts and place them in the directory <full path of your classifier directory>/scripts.

  • Change the permissions for all scripts so they can be executed:

chmod 755 <full path of your classifier directory>/scripts/*.sh

Training

  • On the host system create a training and testing directory:

mkdir -p <full path of your classifier directory>/training

mkdir -p <full path of your classifier directory>/testing

  • Split the positive set into a training and testing set at a 1:1 or higher ratio. Make sure that the sets are randomized (to avoid biases). Put the larger set into <full path of your classifier directory>/training/positive and the smaller set into <full path of your classifier directory>/testing/positive. The files describing the sets should consists of unique identifiers and should uniquely (but possibly partially) match the corresponding entries in /data/textpresso/corpus.doc.lbs.
  • Repeat the last step with the negative set and put them into <full path of your classifier directory>/training/negative and <full path of your classifier directory>/testing/negative, respectively.
  • In the container shell, train the model:

/data/textpresso/scripts/mcc.train.sh /data/textpresso

Testing

  • Test the model by typing in the container shell:

/data/textpresso/scripts/mcc.test.sh /data/textpresso

Snapshot of MCC Test Run

The second and third line from the bottom of the screenshot describe the performance of the model. All negative samples are identified as negative with an average percentage of 90.75%. All positive samples are identified as positive with an average percentage of 96.46%.

Tuning

Sometimes the performance of the model is not satisfactory. There are two remedies that can be tried: get a better training set or tune some of the parameters the algorithm need. The relevant parameters are set in /data/textpresso/json/mcc_train_template.json and are:

beta

The algorithm includes an L2 regularization. beta is what's called lambda in other representation of this subject and is the coefficient with which the square of the weights are multiplied and then added to the objective function. It's used to prevent overfitting. beta=0 can cause overfitting as the model is fit as well as it can be. A large beta causes underfitting and renders the model useless, i.e., the true positive rate and true negative rate should be around 50%.

configuration

The neural network consists an input layer of 200 neurons (as the document vectors are 200-dimensional) and an output layer containing two neurons (for the two classes 'positive' and 'negative'). In-between those two layers are hidden layers and the configuration string describes their configuration. For example, a "100" would set one hidden layer with 100 neurons, connecting to the outputs of the 200 neurons of the input layer on the one side and then connecting to the inputs of the 2 neurons of the output layer. A "100 50" would set a layer of 100 neurons that are connected to the outputs of 200 neurons of the input layer one one side and to 50 neuron on the other side (the second hidden layer). The outputs of the second layer (50 neurons) are then automatically connected to the inputs of the 2 neurons of the output layer. And so on.

maximal number of iterations and tolerance

The algorithm stops training when either a maximal number of iterations is reached or the value of the objective function drops below a certain threshold (called tolerance here). Set your desired values with these two parameters.

Classify New Papers

Preparing the data

  • On the host system create a predicting and output directory:

mkdir -p <full path of your classifier directory>/predicting

mkdir -p <full path of your classifier directory>/output

  • Put the IDs of the papers to be classified into <full path of your classifier directory>/predicting/new_papers. The file should consists of unique identifiers and should uniquely (but possibly partially) match the corresponding entries in /data/textpresso/corpus.doc.lbs.

Classifying

  • In the container shell, run:

/data/textpresso/scripts/mcc.predict.sh /data/textpresso.

Snapshot of MCC Predict Run

In each line the predictions are sorted according to probabilities. For example, the last line (WBPaper00060155) predicts the paper to be negative with a 100% probability. For this run, there is only one paper classified as positive, WBPaper00035486 with a probability of 99.9995%. For each paper the probabilities of being identified as positive and negative are also stored in /data/textpresso/output/positive and /data/textpresso/output/negative, respectively.

What if I Don't Have a Negative Training Set?

We are currently working on a procedure to generate a negative training set if only a positive training set is available. The results will not be as good as with a model that has been trained with a given positive and negative set, but they are still useful. Contact us at textpresso (at) caltech (dot) edu for more info.