Skip to content

Dockerized TPC Documentation

goldturtle edited this page Feb 9, 2021 · 42 revisions

Build Images

Build ubuntu-tpc-hmm

  1. Clone the libtpc repository.
  2. Entering the libtpc directory, Change to the branch hmm.
  3. Build the image.

docker build --no-cache -t ubuntu-tpc-hmm ..

Build tpc-full-hmm and tpc-lite-hmm

  1. Clone the docker-tpc-hmm repository.
  2. Clone the textpressocentral, tpctools and textpressoapi repositories. For all three repositories, switch to branch hmm.
  3. Entering the docker-tpc-hmm directory, build the tpc-full-hmm image.

docker build -f Dockerfile-full -t tpc-hmm-full ..

  1. Edit build-lite.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
  2. Build the tpc-lite-hmm image.

./build-lite.sh tpc-lite-hmm ..

Build a Site

Start a tpc-full-hmm Instance

  1. Enter the docker-tpc-hmm directory.
  2. Edit run_tpc_full.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
  3. Start the instance by typing

./run_tpc_full.sh <data directory> <port for website> <port for api>.

Build and Install Software, Start Postgres and Load www-data Database, Populate Database with Obofiles

  1. Build and install the software by running:

sudo su.

/root/run.sh -t.

  1. Make sure that the file /data/textpresso/postgres/www-data.tar.gz is present. Then start Postgres and load www-data database by running:

/root/run.sh -p.

  1. Make sure obofiles exist in /data/textpresso/obofilesobofiles4production/ and /data/textpresso/oboheaderfiles/. Then populate database by running:

/root/run.sh -l.

Run Pipeline

  1. Download C. elegans PDFs: Make sure that /data/textpresso/raw_files and /data/textpresso/tmpexist and type

01downloadpdfs.sh &>/data/textpresso/tmp/01.out.

  1. Download NXMLs from PMCOA: Make sure that /data/textpresso/raw_files and /data/textpresso/raw_files exist and type

02downloadxmls.sh &>/data/textpresso/tmp/02.out.

  1. Convert PDFs to tpcas-1 files:

03pdf2cas.sh &>/data/textpresso/tmp/03.out.

Because of the way batch jobs are runs, and as faulty PDFs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted.

  1. Convert NXMLs to tpcas-1 files:

04xml2cas.sh &>/data/textpresso/tmp/04.out.

Because of the way batch jobs are runs, and as faulty NXMLs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted.

  1. Add images from PMCOA to tpcas-1 files:

05addimages2cas1.sh &>/data/textpresso/tmp/05.out.

  1. Perform lexical markup of tpcas-1 files, resulting in tpcas-2 files:

07cas1tocas2.sh &>/data/textpresso/tmp/07.out.

Check for completeness by comparing tpcas-1 and tpcas-2 files.

  1. Get bibliographical information for PMCOA corpus:

09getpmcoabib.sh &>/data/textpresso/tmp/09.out.

  1. Get bibliographical information for C. elegans corpus:

10getcelegansbib.sh &>/data/textpresso/tmp/10.out.

  1. Perform conversion of images extracted from PDF files:

11invertimages.sh &>/data/textpresso/tmp/11.out.

  1. Index tpcas-2 files.

12index.sh &>tmp/12.out.

Check for segmentation faults. If they occur, remove tpcas-2 files that cause them.

Start Web Services

/root/run.sh -w.

Build NN Classifier Models

This step only has to be done once. Reuse the models until they are out of date.

Compile and Install tpneuralnet

cd /data/textpresso/tpneuralnets/.

mkdir build && cd build.

cmake -DCMAKE_BUILD_TYPE=Release ...

make -j 8 && make install.

Compile and Install wordembeddings

cd /data/textpresso/wordembeddings/.

mkdir build && cd build.

cmake -DCMAKE_BUILD_TYPE=Release ...

make -j 8 && make install.

Compute Word Vectors

mkdir -p /data/textpresso/classifiers/nn/tpcas-1/.

cd /data/textpresso/classifiers/nn/.

rsync -av --exclude 'images' /data/textpresso/tpcas-1/C.\ elegans tpcas-1/..

01computeceleganswordmodel.sh &>../../tmp/01computeceleganswordmodel.out.

Compute Document vectors

02createcelegansdocvectors.sh &>../../tmp/02createcelegansdocvectors.out.

Run NN Classifiers

  1. rsync -av --exclude 'images' ../../../tpcas-1/C.\ elegans
  2. 02createcelegansdocvectors.sh
  3. 03makelist.sh
  4. 04classify.sh
  5. makehtmls.sh predictions results

Rsync with textpressocentral.org

  1. rsync -av --delete-after celeganstpc/ textpressocentral.org:/data/celeganstpc/
Clone this wiki locally