Skip to content

Dockerized TPC Documentation

goldturtle edited this page Aug 23, 2021 · 42 revisions

Build Images

Build ubuntu-tpc-hmm

  1. Clone the libtpc repository.
  2. Entering the libtpc directory, Change to the branch hmm.
  3. Build the image.

docker build --no-cache -t ubuntu-tpc-hmm ..

Build tpc-full-hmm and tpc-lite-hmm

  1. Clone the docker-tpc-hmm repository.
  2. Clone the textpressocentral, tpctools and textpressoapi repositories. For all three repositories, switch to branch hmm.
  3. Entering the docker-tpc-hmm directory, build the tpc-full-hmm image.

docker build -f Dockerfile-full -t tpc-hmm-full ..

  1. Edit build-lite.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
  2. Build the tpc-lite-hmm image.

./build-lite.sh tpc-lite-hmm ..

Build a Site

Start a tpc-full-hmm Instance

  1. Enter the docker-tpc-hmm directory.
  2. Edit run_tpc_full.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
  3. Start the instance by typing

./run_tpc_full.sh <data directory> <port for website> <port for api>.

Build and Install Software, Start Postgres and Load www-data Database, Populate Database with Obofiles

  1. Build and install the software by running:

sudo su.

/root/run.sh -t.

  1. Make sure that the file /data/textpresso/postgres/www-data.tar.gz is present. Then start Postgres and load www-data database by running:

/root/run.sh -p.

  1. Make sure obofiles exist in /data/textpresso/obofilesobofiles4production/ and /data/textpresso/oboheaderfiles/. Then populate database by running:

/root/run.sh -l.

Run Pipeline

  1. Download C. elegans PDFs: Make sure that /data/textpresso/raw_files and /data/textpresso/tmpexist and type

01downloadpdfs.sh &>/data/textpresso/tmp/01.out.

  1. Download NXMLs from PMCOA: Make sure that /data/textpresso/raw_files and /data/textpresso/raw_files exist and type

02downloadxmls.sh &>/data/textpresso/tmp/02.out.

  1. Convert PDFs to tpcas-1 files:

03pdf2cas.sh &>/data/textpresso/tmp/03.out.

Because of the way batch jobs are runs, and as faulty PDFs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more PDFs are converted.

The PDFs that caused a segmentation fault can be converted and processed by other means, i.e., through plain-text conversion. To do that, type

03catch.non-conv.pdfs.4.cas1.sh &>/data/textpresso/tmp/03a.out.

  1. Convert NXMLs to tpcas-1 files:

04xml2cas.sh &>/data/textpresso/tmp/04.out.

Because of the way batch jobs are runs, and as faulty NXMLs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted. This also applies if an article2cas process is a run-away process and needs to be killed.

  1. Add images from PMCOA to tpcas-1 files:

05addimages2cas1.sh &>/data/textpresso/tmp/05.out.

  1. Perform lexical markup of tpcas-1 files, resulting in tpcas-2 files:

07cas1tocas2.sh &>/data/textpresso/tmp/07.out.

Check for completeness by comparing tpcas-1 and tpcas-2 files.

  1. Get bibliographical information for PMCOA corpus:

09getpmcoabib.sh &>/data/textpresso/tmp/09.out.

  1. Get bibliographical information for C. elegans corpus:

10getcelegansbib.sh &>/data/textpresso/tmp/10.out.

  1. Perform conversion of images extracted from PDF files:

11invertimages.sh &>/data/textpresso/tmp/11.out.

  1. Index tpcas-2 files.

12index.sh &>tmp/12.out.

Check for segmentation faults. If they occur, remove tpcas-2 files that cause them.

Start Web Services

/root/run.sh -w.

Build NN Classifier Models

This step only has to be done once. Reuse the models until they are out of date.

Compile and Install tpneuralnet

cd /data/textpresso/tpneuralnets/

mkdir build && cd build

cmake -DCMAKE_BUILD_TYPE=Release ..

make -j 8 && make install.

Compile and Install wordembeddings

cd /data/textpresso/wordembeddings/

mkdir build && cd build

cmake -DCMAKE_BUILD_TYPE=Release ..

make -j 8 && make install.

Compute Word Vectors

mkdir -p /data/textpresso/classifiers/nn/tpcas-1/

cd /data/textpresso/classifiers/nn/

rsync -av --exclude 'images' /data/textpresso/tpcas-1/C.\ elegans tpcas-1/.

01computeceleganswordmodel.sh &>../../tmp/01computeceleganswordmodel.out.

Compute Document vectors

02createcelegansdocvectors.sh &>../../tmp/02createcelegansdocvectors.out.

Make Models

  1. Make directories for training sets and models

mkdir models sets4makingmodels.

  1. Deposit lists of paper IDs that serve as positive training examples into sets4makingmodels, one for each model. The file name of each list will be the model name.

  2. Edit a list of paper IDs that serve as negative training examples and save it as negative.list in /data/textpresso/classifiers/nn/`. The list serves as negative training examples for all models that are to be trained.

  3. Make a json directory and deposit a file

mkdir json.

Then edit a file named json/crossvalidate_mm_template.json. An example file is as follows:

{ "task" : "crossvalidate", "document model" : "WRKDIR/celegans.doc", "class 1 list" : "WRKDIR/list1", "class 2 list" : "WRKDIR/list2", "model name" : "WRKDIR/model", "cross validation factor" : 5, "number of iterations" : 1, "nn configuration" : "23 11" }.

  1. Compute models

tpnn-makemodels-high-recall.sh /data/textpresso/classifiers/nn &> ../../tmp/tmhr.out &.

Run NN Classifiers

  1. Update directory of papers that are to be classified:

cd /data/textpresso/classifiers/nn/tpcas-1

rsync -av --exclude 'images' ../../../tpcas-1/C.\ elegans ..

  1. Create document vectors for updated papers:

02createcelegansdocvectors.sh &>../../../tmp/02createcelegansdocvectors.out&.

  1. Make (incremental) list of new papers to be classified:

03makelist.sh.

  1. Edit a file named /data/textpresso/classifiers/nn/json/predict_pr_template.json. An example file is as follows:

{ "task" : "predict", "document model" : "WRKDIR/celegans.doc", "document list" : "WRKDIR/pool4predictions", "model name" : "WRKDIR/model" }.

  1. Classify papers:

mkdir /data/textpresso/classifiers/nn/predictions

04classify.sh &>../../../tmp/04classify.out&.

  1. Make HTML Pages for WormBase Curators:

cd /data/textpresso/classifiers/nn

mkdir results

makehtmls.sh predictions results

rsync -av /data/textpresso/classifiers/nn/results/ /data/textpresso/classifiers/NNClassification/.