-
Notifications
You must be signed in to change notification settings - Fork 0
Dockerized TPC Documentation
- Clone the libtpc repository.
- Entering the libtpc directory, Change to the branch hmm.
- Build the image.
docker build --no-cache -t ubuntu-tpc-hmm .
.
- Clone the docker-tpc-hmm repository.
- Clone the textpressocentral, tpctools and textpressoapi repositories. For all three repositories, switch to branch hmm.
- Entering the docker-tpc-hmm directory, build the tpc-full-hmm image.
docker build -f Dockerfile-full -t tpc-hmm-full .
.
- Edit build-lite.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
- Build the tpc-lite-hmm image.
./build-lite.sh tpc-lite-hmm .
.
- Enter the docker-tpc-hmm directory.
- Edit run_tpc_full.sh to point to the correct directories for the cloned repositories textpressocentral, tpctool and textpressoapi.
- Start the instance by typing
./run_tpc_full.sh <data directory> <port for website> <port for api>
.
Build and Install Software, Start Postgres and Load www-data Database, Populate Database with Obofiles
- Build and install the software by running:
sudo su
.
/root/run.sh -t
.
- Make sure that the file
/data/textpresso/postgres/www-data.tar.gz
is present. Then start Postgres and load www-data database by running:
/root/run.sh -p
.
- Make sure obofiles exist in
/data/textpresso/obofilesobofiles4production/
and/data/textpresso/oboheaderfiles/
. Then populate database by running:
/root/run.sh -l
.
- Download C. elegans PDFs: Make sure that
/data/textpresso/raw_files
and/data/textpresso/tmp
exist and type
01downloadpdfs.sh &>/data/textpresso/tmp/01.out
.
- Download NXMLs from PMCOA: Make sure that
/data/textpresso/raw_files
and/data/textpresso/raw_files
exist and type
02downloadxmls.sh &>/data/textpresso/tmp/02.out
.
- Convert PDFs to tpcas-1 files:
03pdf2cas.sh &>/data/textpresso/tmp/03.out
.
Because of the way batch jobs are runs, and as faulty PDFs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more PDFs are converted.
The PDFs that caused a segmentation fault can be converted and processed by other means, i.e., through plain-text conversion. To do that, type
03catch.non-conv.pdfs.4.cas1.sh &>/data/textpresso/tmp/03a.out
.
- Convert NXMLs to tpcas-1 files:
04xml2cas.sh &>/data/textpresso/tmp/04.out
.
Because of the way batch jobs are runs, and as faulty NXMLs might cause a segmentation fault that cannot be caught, this step needs to be repeated until no more NXMLs are converted. This also applies if an article2cas process is a run-away process and needs to be killed.
- Add images from PMCOA to tpcas-1 files:
05addimages2cas1.sh &>/data/textpresso/tmp/05.out
.
- Perform lexical markup of tpcas-1 files, resulting in tpcas-2 files:
07cas1tocas2.sh &>/data/textpresso/tmp/07.out
.
Check for completeness by comparing tpcas-1 and tpcas-2 files.
- Get bibliographical information for PMCOA corpus:
09getpmcoabib.sh &>/data/textpresso/tmp/09.out
.
- Get bibliographical information for C. elegans corpus:
10getcelegansbib.sh &>/data/textpresso/tmp/10.out
.
- Perform conversion of images extracted from PDF files:
11invertimages.sh &>/data/textpresso/tmp/11.out
.
- Index tpcas-2 files.
12index.sh &>tmp/12.out
.
Check for segmentation faults. If they occur, remove tpcas-2 files that cause them.
/root/run.sh -w
.
This step only has to be done once. Reuse the models until they are out of date.
cd /data/textpresso/tpneuralnets/
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j 8 && make install
.
cd /data/textpresso/wordembeddings/
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j 8 && make install
.
mkdir -p /data/textpresso/classifiers/nn/tpcas-1/
cd /data/textpresso/classifiers/nn/
rsync -av --exclude 'images' /data/textpresso/tpcas-1/C.\ elegans tpcas-1/.
01computeceleganswordmodel.sh &>../../tmp/01computeceleganswordmodel.out
.
02createcelegansdocvectors.sh &>../../tmp/02createcelegansdocvectors.out
.
- Make directories for training sets and models
mkdir models sets4makingmodels
.
-
Deposit lists of paper IDs that serve as positive training examples into
sets4makingmodels
, one for each model. The file name of each list will be the model name. -
Edit a list of paper IDs that serve as negative training examples and save it as
negative.list in
/data/textpresso/classifiers/nn/`. The list serves as negative training examples for all models that are to be trained. -
Make a json directory and deposit a file
mkdir json
.
Then edit a file named json/crossvalidate_mm_template.json
. An example file is as follows:
{ "task" : "crossvalidate", "document model" : "WRKDIR/celegans.doc", "class 1 list" : "WRKDIR/list1", "class 2 list" : "WRKDIR/list2", "model name" : "WRKDIR/model", "cross validation factor" : 5, "number of iterations" : 1, "nn configuration" : "23 11" }
.
- Compute models
tpnn-makemodels-high-recall.sh /data/textpresso/classifiers/nn &> ../../tmp/tmhr.out &
.
- Update directory of papers that are to be classified:
cd /data/textpresso/classifiers/nn/tpcas-1
rsync -av --exclude 'images' ../../../tpcas-1/C.\ elegans .
.
- Create document vectors for updated papers:
02createcelegansdocvectors.sh &>../../../tmp/02createcelegansdocvectors.out&
.
- Make (incremental) list of new papers to be classified:
03makelist.sh
.
- Edit a file named
/data/textpresso/classifiers/nn/json/predict_pr_template.json
. An example file is as follows:
{ "task" : "predict", "document model" : "WRKDIR/celegans.doc", "document list" : "WRKDIR/pool4predictions", "model name" : "WRKDIR/model" }
.
- Classify papers:
mkdir /data/textpresso/classifiers/nn/predictions
04classify.sh &>../../../tmp/04classify.out&
.
- Make HTML Pages for WormBase Curators:
cd /data/textpresso/classifiers/nn
mkdir results
makehtmls.sh predictions results
rsync -av /data/textpresso/classifiers/nn/results/ /data/textpresso/classifiers/NNClassification/
.
© 2024 Hans-Michael Müller