Dockerized TPC Documentation

Build Images

Build ubuntu-tpc-hmm

Clone the libtpc repository.
Entering the libtpc directory, Change to the branch hmm.
Build the image.

docker build --no-cache -t ubuntu-tpc-hmm ..

Build tpc-full-hmm and tpc-lite-hmm

Clone the docker-tpc-hmm repository.
Clone the textpressocentral, tpctools and textpressoapi repositories. For all three repositories, switch to branch hmm.
Entering the docker-tpc-hmm directory, build the tpc-full-hmm image.

docker build -f Dockerfile-full -t tpc-hmm-full ..

Edit build-lite.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
Build the tpc-lite-hmm image.

./build-lite.sh tpc-lite-hmm ..

Build a Site

Start a tpc-full-hmm Instance

Enter the docker-tpc-hmm directory.
Edit run_tpc_full.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
Start the instance by typing

./run_tpc_full.sh <data directory> <port for website> <port for api>.

Build and Install Software, Start Postgres and Load www-data Database, Populate Database with Obofiles

Build and install the software by running:

sudo su.

/root/run.sh -t.

Make sure that the file /data/textpresso/postgres/www-data.tar.gz is present. Then start Postgres and load www-data database by running:

/root/run.sh -p.

Make sure obofiles exist in /data/textpresso/obofilesobofiles4production/ and /data/textpresso/oboheaderfiles/. Then populate database by running:

/root/run.sh -l.

Run Pipeline

Download C. elegans PDFs: Make sure that /data/textpresso/raw_files and /data/textpresso/tmpexist and type

01downloadpdfs.sh &>/data/textpresso/tmp/01.out.

Download NXMLs from PMCOA: Make sure that /data/textpresso/raw_files and /data/textpresso/raw_files exist and type

02downloadxmls.sh &>/data/textpresso/tmp/02.out.

Convert PDFs to tpcas-1 files:

03pdf2cas.sh &>/data/textpresso/tmp/03.out.

Because of the way batch jobs are runs, and as faulty PDFs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted.

Convert NXMLs to tpcas-1 files:

04xml2cas.sh &>/data/textpresso/tmp/04.out.

Because of the way batch jobs are runs, and as faulty NXMLs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted.

Add images from PMCOA to tpcas-1 files:

05addimages2cas1.sh &>/data/textpresso/tmp/05.out.

Perform lexical markup of tpcas-1 files, resulting in tpcas-2 files:

07cas1tocas2.sh &>/data/textpresso/tmp/07.out.

Check for completeness by comparing tpcas-1 and tpcas-2 files.

Get bibliographical information for PMCOA corpus:

09getpmcoabib.sh &>/data/textpresso/tmp/09.out.

Get bibliographical information for C. elegans corpus:

10getcelegansbib.sh &>/data/textpresso/tmp/10.out.

Perform conversion of images extracted from PDF files:

11invertimages.sh &>/data/textpresso/tmp/11.out.

Index tpcas-2 files.

12index.sh &>tmp/12.out.

Check for segmentation faults. If they occur, remove tpcas-2 files that cause them.

Start Web Services

/root/run.sh -w.

Build NN Classifier Models

This step only has to be done once. Reuse the models until they are out of date.

Compile and Install tpneuralnet

cd /data/textpresso/tpneuralnets/.

mkdir build && cd build.

cmake -DCMAKE_BUILD_TYPE=Release ...

make -j 8 && make install.

Compile and Install wordembeddings

cd /data/textpresso/wordembeddings/.

mkdir build && cd build.

cmake -DCMAKE_BUILD_TYPE=Release ...

make -j 8 && make install.

Compute Word Vectors

mkdir -p /data/textpresso/classifiers/nn/tpcas-1/.

cd /data/textpresso/classifiers/nn/.

rsync -av --exclude 'images' /data/textpresso/tpcas-1/C.\ elegans tpcas-1/..

01computeceleganswordmodel.sh &>../../tmp/01computeceleganswordmodel.out.

Compute Document vectors

02createcelegansdocvectors.sh &>../../tmp/02createcelegansdocvectors.out.

Run NN Classifiers

rsync -av --exclude 'images' ../../../tpcas-1/C.\ elegans
02createcelegansdocvectors.sh
03makelist.sh
04classify.sh
makehtmls.sh predictions results

Rsync with textpressocentral.org

rsync -av --delete-after celeganstpc/ textpressocentral.org:/data/celeganstpc/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly