-
Notifications
You must be signed in to change notification settings - Fork 0
Classifying Papers with Neural Networks
These instruction will show you how to classify papers using the Textpresso Neural Networks framework. If any instructions are unclear, feel free to drop us an email at textpresso (at) caltech (dot) edu.
- Go to textmining.textpresso.org/tpnn/images, choose the version directory and download the image
tpc-hmm-tpnn.tgz
. - Load the Docker image:
docker load < tpc-hmm-tpnn.tgz
- Make your working directory and some sub-directories therein:
mkdir -p <full path of your classifier directory>
mkdir -p <full path of your classifier directory>/tmp
mkdir -p <full path of your classifier directory>/raw_files/pdf
mkdir -p <full path of your classifier directory>/tpcas-1
The classifying algorithm relies on a pool of articles that are positive and negative with respect to the classification that is to be performed, a set of papers to be classified, and other articles that will make up a 'background' of articles. All three sets are necessary to compute robust word vectors from which document vectors are built. Perform the following steps in order to prepare the corpus for processing:
- Create the directory
<full path of your classifier directory>/tmp/corpus
.
mkdir -p <full path of your classifier directory>/tmp/corpus
.
- Copy PDFs to
<full path of your classifier directory>/tmp/corpus
. - Execute the following commands:
cd <full path of your classifier directory>/tmp/corpus
ls > <full path of your classifier directory>/tmp/pdflist
mkdir -p <full path of your classifier directory>/raw_files/pdf/corpus
for i in $(cat <full path of your classifier directory>/tmp/pdflist); do mkdir -p <full path of your classifier directory>/raw_files/pdf/corpus/$i; mv $i <full path of your classifier directory>/raw_files/pdf/corpus/$i/.; done
- Start the image. A bash shell within the container will be started:
docker run -v <full path of your classifier directory>:/data/textpresso -it tpc-hmm-tpnn bash
- In the bash shell of the container, type the following command to convert PDFs to machine-readable text:
03pdf2cas.sh
- Wait for the process to finish.
The algorithm uses Word Embeddings to calculate word vectors first. The latter are then used to compute a document vector for each pdf.
- To compute word vectors type the following command in the container bash shell:
01computewordmodel.sh -m /data/textpresso/corpus.word -c /data/textpresso/tpcas-1/corpus
- Similarly, for computing the document vectors, type the following command in the container bash shell:
02createdocvectors.sh -m /data/textpresso/corpus.word.vec -c /data/textpresso/tpcas-1/corpus -d /data/textpresso/corpus.doc
JSON files and scripts are not part of of the Docker image as they change frequently. This also helps with familiarizing the user with the system and enables her to make changes confidently.
- On the host system create the json directory:
mkdir -p <full path of your classifier directory>/json
- Download all JSON files in the directory textmining.textpresso.org/tpnn/json and place them in the directory
<full path of your classifier directory>/json
.
- On the host system create the script directory:
mkdir -p <full path of your classifier directory>/scripts
-
Download all scripts in the directory textmining.textpresso.org/tpnn/scripts and place them in the directory
<full path of your classifier directory>/scripts
. -
Change the permissions for all scripts so they can be executed:
chmod 755 <full path of your classifier directory>/scripts/*.sh
- On the host system create a
training
andtesting
directory:
mkdir -p <full path of your classifier directory>/training
mkdir -p <full path of your classifier directory>/testing
- Split the positive set into a training and testing set at a 1:1 or higher ratio. Make sure that the sets are randomized (to avoid biases).
Put the larger set into
<full path of your classifier directory>/training/positive
and the smaller set into<full path of your classifier directory>/testing/positive
. The files describing the sets should consists of unique identifiers and should uniquely (but possibly partially) match the corresponding entries in/data/textpresso/corpus.doc.lbs
. - Repeat the last step with the negative set and put them into
<full path of your classifier directory>/training/negative
and<full path of your classifier directory>/testing/negative
, respectively. - In the container shell, train the model:
/data/textpresso/scripts/mcc.train.sh /data/textpresso
- Test the model by typing in the container shell:
/data/textpresso/scripts/mcc.test.sh /data/textpresso
The second and third line from the bottom of the screenshot describe the performance of the model. All negative samples are identified as negative with an average percentage of 90.75%. All positive samples are identified as positive with an average percentage of 96.46%.
Sometimes the performance of the model is not satisfactory. There are two remedies that can be tried: get a better training set or tune some of the parameters the algorithm need. The relevant parameters are set in /data/textpresso/json/mcc_train_template.json
and are:
The algorithm includes an L2 regularization. beta is what's called lambda in other representation of this subject and is the coefficient with which the square of the weights are multiplied and then added to the objective function. It's used to prevent overfitting. beta=0 can cause overfitting as the model is fit as well as it can be. A large beta causes underfitting and renders the model useless, i.e., the true positive rate and true negative rate should be around 50%.
The neural network consists an input layer of 200 neurons (as the document vectors are 200-dimensional) and an output layer containing two neurons (for the two classes 'positive' and 'negative'). In-between those two layers are hidden layers and the configuration string describes their configuration. For example, a "100" would set one hidden layer with 100 neurons, connecting to the outputs of the 200 neurons of the input layer on the one side and then connecting to the inputs of the 2 neurons of the output layer. A "100 50" would set a layer of 100 neurons that are connected to the outputs of 200 neurons of the input layer one one side and to 50 neuron on the other side (the second hidden layer). The outputs of the second layer (50 neurons) are then automatically connected to the inputs of the 2 neurons of the output layer. And so on.
The algorithm stops training when either a maximal number of iterations is reached or the value of the objective function drops below a certain threshold (called tolerance here). Set your desired values with these two parameters.
- On the host system create a
predicting
andoutput
directory:
mkdir -p <full path of your classifier directory>/predicting
mkdir -p <full path of your classifier directory>/output
- Put the IDs of the papers to be classified into
<full path of your classifier directory>/predicting/new_papers
. The file should consists of unique identifiers and should uniquely (but possibly partially) match the corresponding entries in/data/textpresso/corpus.doc.lbs
.
- In the container shell, run:
/data/textpresso/scripts/mcc.predict.sh /data/textpresso
.
In each line the predictions are sorted according to probabilities. For example, the last line (WBPaper00060155) predicts the paper to be negative with a 100% probability. For this run, there is only one paper classified as positive, WBPaper00035486 with a probability of 99.9995%. For each paper the probabilities of being identified as positive and negative are also stored in /data/textpresso/output/positive
and /data/textpresso/output/negative
, respectively.
We are currently working on a procedure to generate a negative training set if only a positive training set is available. The results will not be as good as with a model that has been trained with a given positive and negative set, but they are still useful. Contact us at textpresso (at) caltech (dot) edu for more info.
© 2024 Hans-Michael Müller