-
Notifications
You must be signed in to change notification settings - Fork 0
Dockerized TPC Documentation
- Clone the libtpc repository.
- Entering the libtpc directory, Change to the branch hmm.
- Build the image.
docker build --no-cache -t ubuntu-tpc-hmm .
.
- Clone the docker-tpc-hmm repository.
- Clone the textpressocentral, tpctools and textpressoapi repositories. For all three repositories, switch to branch hmm.
- Entering the docker-tpc-hmm directory, build the tpc-full-hmm image.
docker build -f Dockerfile-full -t tpc-hmm-full .
.
- Edit build-lite.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
- Build the tpc-lite-hmm image.
./build-lite.sh tpc-lite-hmm .
.
- Enter the docker-tpc-hmm directory.
- Edit run_tpc_full.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
- Start the instance by typing
./run_tpc_full.sh <data directory> <port for website> <port for api>
.
Build and Install Software, Start Postgres and Load www-data Database, Populate Database with Obofiles
- Build and install the software by running:
sudo su
.
/root/run.sh -t
.
- Make sure that the file
/data/textpresso/postgres/www-data.tar.gz
is present. Then start Postgres and load www-data database by running:
/root/run.sh -p
.
- Make sure obofiles exist in
/data/textpresso/obofilesobofiles4production/
and/data/textpresso/oboheaderfiles/
. Then populate database by running:
/root/run.sh -l
.
- Download C. elegans PDFs: Make sure that
/data/textpresso/raw_files
and/data/textpresso/tmp
exist and type
01downloadpdfs.sh &>/data/textpresso/tmp/01.out
.
- Download NXMLs from PMCOA: Make sure that
/data/textpresso/raw_files
and/data/textpresso/raw_files
exist and type
02downloadxmls.sh &>/data/textpresso/tmp/02.out
.
- Convert PDFs to tpcas-1 files:
03pdf2cas.sh &>/data/textpresso/tmp/03.out
.
Because of the way batch jobs are runs, and as faulty PDFs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted.
- Convert NXMLs to tpcas-1 files:
04xml2cas.sh &>/data/textpresso/tmp/04.out
.
Because of the way batch jobs are runs, and as faulty NXMLs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted.
- Add images from PMCOA to tpcas-1 files:
05addimages2cas1.sh &>/data/textpresso/tmp/05.out
.
- Perform lexical markup of tpcas-1 files, resulting in tpcas-2 files:
07cas1tocas2.sh &>/data/textpresso/tmp/07.out
.
Check for completeness by comparing tpcas-1 and tpcas-2 files.
- Get bibliographical information for PMCOA corpus:
09getpmcoabib.sh &>/data/textpresso/tmp/09.out
.
- Get bibliographical information for C. elegans corpus:
10getcelegansbib.sh &>/data/textpresso/tmp/10.out
.
- Perform conversion of images extracted from PDF files:
11invertimages.sh &>/data/textpresso/tmp/11.out
.
- Index tpcas-2 files.
12index.sh &>tmp/12.out
.
Check for segmentation faults. If they occur, remove tpcas-2 files that cause them.
/root/run.sh -w
.
This step only has to be done once. Reuse the models until they are out of date.
cd /data/textpresso/tpneuralnets/
.
mkdir build && cd build
.
cmake -DCMAKE_BUILD_TYPE=release ..
.
make && make install
.
cd /data/textpresso/tpneuralnets/
.
mkdir build && cd build
.
cmake -DCMAKE_BUILD_TYPE=release ..
.
make && make install
.
- rsync -av --exclude 'images' ../../../tpcas-1/C.\ elegans
- 02createcelegansdocvectors.sh
- 03makelist.sh
- 04classify.sh
- makehtmls.sh predictions results
- rsync -av --delete-after celeganstpc/ textpressocentral.org:/data/celeganstpc/
© 2024 Hans-Michael Müller