generated from gpauloski/python-template
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #21 from ramanathanlab/develop
Develop
- Loading branch information
Showing
76 changed files
with
674,703 additions
and
128 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -233,8 +233,11 @@ conda create -n pdfwf python=3.10 -y | |
conda activate pdfwf | ||
mamba install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia -y | ||
pip install -r requirements/oreo_requirements.txt | ||
git clone [email protected]:ultralytics/yolov5.git | ||
``` | ||
|
||
Then set the `yolov5_path` option to the path of the cloned `yolov5` repository. | ||
|
||
## CLI | ||
For running smaller jobs without using the Parsl workflow, the CLI can be used. | ||
The CLI provides different commands for running the various parsers. The CLI | ||
|
@@ -316,9 +319,9 @@ $ pdfwf oreo [OPTIONS] | |
|
||
* `-p, --pdf_path PATH`: The directory containing the PDF files to convert (recursive glob). [required] | ||
* `-o, --output_dir PATH`: The directory to write the output JSON lines file to. [required] | ||
* `-d, --detection_weights_path PATH`: Weights to layout detection model. [required] | ||
* `-t, --text_cls_weights_path PATH`: Model weights for (meta) text classifier. [required] | ||
* `-s, --spv05_category_file_path PATH`: Path to the SPV05 category file. [required] | ||
* `-d, --detection_weights_path PATH`: Weights to layout detection model. | ||
* `-t, --text_cls_weights_path PATH`: Model weights for (meta) text classifier. | ||
* `-s, --spv05_category_file_path PATH`: Path to the SPV05 category file. | ||
* `-d, --detect_only`: File type to be parsed (ignores other files in the input_dir) | ||
* `-m, --meta_only`: Only parse PDFs for meta data | ||
* `-e, --equation`: Include equations into the text categories | ||
|
@@ -329,7 +332,7 @@ $ pdfwf oreo [OPTIONS] | |
* `-b, --batch_yolo INTEGER`: Main batch size for detection/# of images loaded per batch [default: 128] | ||
* `-v, --batch_vit INTEGER`: Batch size of pre-processed patches for ViT pseudo-OCR inference [default: 512] | ||
* `-c, --batch_cls INTEGER`: Batch size K for subsequent text processing [default: 512] | ||
* `-o, --bbox_offset INTEGER`: Number of pixels along which [default: 2] | ||
* `-x, --bbox_offset INTEGER`: Number of pixels along which [default: 2] | ||
* `-nc, --num_conversions INTEGER`: Number of pdfs to convert (useful for debugging, by default convert every document) [default: 0]. | ||
* `--help`: Show this message and exit. | ||
|
||
|
@@ -357,3 +360,20 @@ To generate the CLI documentation, run: | |
``` | ||
typer pdfwf.cli utils docs --output CLI.md | ||
``` | ||
|
||
## Installation on HPC systems | ||
|
||
### Leonardo | ||
```bash | ||
module load python/3.11.6--gcc--8.5.0 | ||
python -m venv pdfwf-venv | ||
source pdfwf-venv/bin/activate | ||
pip install -U pip setuptools wheel | ||
pip install torch | ||
pip install numpy | ||
pip install -r requirements/nougat_requirements.txt | ||
pip install -r requirements/oreo_requirements.txt | ||
pip install -e . | ||
python -m nltk.downloader words | ||
python -m nltk.downloader punkt | ||
``` |
51 changes: 51 additions & 0 deletions
51
examples/nougat/leonardo-scaling/nougat.leonardo.nodes1.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# The directory containing the pdfs to convert | ||
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data | ||
|
||
# The output directory of the workflow | ||
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1 | ||
|
||
# Search for zip files in the pdf_dir | ||
iszip: true | ||
|
||
# Temporary storage directory for unzipping files | ||
tmp_storage: /dev/shm | ||
|
||
# The settings for the pdf parser | ||
parser_settings: | ||
name: nougat | ||
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB. | ||
batchsize: 12 | ||
# Path to download the checkpoint to. if already exists, will use existing. | ||
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base | ||
# Set mmd_out to null if you don't want to write mmd files as a byproduct. | ||
mmd_out: null | ||
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out | ||
recompute: false | ||
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended) | ||
full_precision: false | ||
# If set to true, output text will be formatted as a markdown file. | ||
markdown: true | ||
# Preempt processing a paper if mode collapse causes repetitions (true is recommended) | ||
skipping: true | ||
# Path for the nougat-specific logs for the run. | ||
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1/nougat_logs | ||
|
||
|
||
# The compute settings for the workflow | ||
compute_settings: | ||
# The name of the compute platform to use | ||
name: leonardo | ||
# The number of compute nodes to use | ||
num_nodes: 1 | ||
# Make sure to update the path to your conda environment and HF cache | ||
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache" | ||
# The scheduler options to use when submitting jobs | ||
scheduler_options: "#SBATCH --reservation=s_res_gb" | ||
# Partition to use. | ||
partition: boost_usr_prod | ||
# Quality of service. | ||
qos: qos_resv | ||
# Account to charge compute to. | ||
account: try24_Genomics_0 | ||
# The amount of time to request for your job | ||
walltime: 01:00:00 |
52 changes: 52 additions & 0 deletions
52
examples/nougat/leonardo-scaling/nougat.leonardo.nodes1024.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# The directory containing the pdfs to convert | ||
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data | ||
|
||
# The output directory of the workflow | ||
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1024 | ||
|
||
# Search for zip files in the pdf_dir | ||
iszip: true | ||
|
||
# Temporary storage directory for unzipping files | ||
tmp_storage: /dev/shm | ||
|
||
# The settings for the pdf parser | ||
parser_settings: | ||
name: nougat | ||
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB. | ||
batchsize: 12 | ||
# Path to download the checkpoint to. if already exists, will use existing. | ||
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base | ||
# Set mmd_out to null if you don't want to write mmd files as a byproduct. | ||
mmd_out: null | ||
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out | ||
recompute: false | ||
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended) | ||
full_precision: false | ||
# If set to true, output text will be formatted as a markdown file. | ||
markdown: true | ||
# Preempt processing a paper if mode collapse causes repetitions (true is recommended) | ||
skipping: true | ||
# Path for the nougat-specific logs for the run. | ||
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1024/nougat_logs | ||
|
||
|
||
# The compute settings for the workflow | ||
compute_settings: | ||
# The name of the compute platform to use | ||
name: leonardo | ||
# The number of compute nodes to use | ||
num_nodes: 1024 | ||
# Make sure to update the path to your conda environment and HF cache | ||
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache" | ||
# The scheduler options to use when submitting jobs | ||
scheduler_options: "#SBATCH --reservation=s_res_gb" | ||
# Partition to use. | ||
partition: boost_usr_prod | ||
# Quality of service. | ||
qos: qos_resv | ||
#qos: boost_qos_dbg | ||
# Account to charge compute to. | ||
account: try24_Genomics_0 | ||
# The amount of time to request for your job | ||
walltime: 01:00:00 |
52 changes: 52 additions & 0 deletions
52
examples/nougat/leonardo-scaling/nougat.leonardo.nodes1480.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# The directory containing the pdfs to convert | ||
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data | ||
|
||
# The output directory of the workflow | ||
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1480 | ||
|
||
# Search for zip files in the pdf_dir | ||
iszip: true | ||
|
||
# Temporary storage directory for unzipping files | ||
tmp_storage: /dev/shm | ||
|
||
# The settings for the pdf parser | ||
parser_settings: | ||
name: nougat | ||
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB. | ||
batchsize: 12 | ||
# Path to download the checkpoint to. if already exists, will use existing. | ||
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base | ||
# Set mmd_out to null if you don't want to write mmd files as a byproduct. | ||
mmd_out: null | ||
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out | ||
recompute: false | ||
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended) | ||
full_precision: false | ||
# If set to true, output text will be formatted as a markdown file. | ||
markdown: true | ||
# Preempt processing a paper if mode collapse causes repetitions (true is recommended) | ||
skipping: true | ||
# Path for the nougat-specific logs for the run. | ||
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1480/nougat_logs | ||
|
||
|
||
# The compute settings for the workflow | ||
compute_settings: | ||
# The name of the compute platform to use | ||
name: leonardo | ||
# The number of compute nodes to use | ||
num_nodes: 1480 | ||
# Make sure to update the path to your conda environment and HF cache | ||
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache" | ||
# The scheduler options to use when submitting jobs | ||
scheduler_options: "#SBATCH --reservation=s_res_gb" | ||
# Partition to use. | ||
partition: boost_usr_prod | ||
# Quality of service. | ||
qos: qos_resv | ||
#qos: boost_qos_dbg | ||
# Account to charge compute to. | ||
account: try24_Genomics_0 | ||
# The amount of time to request for your job | ||
walltime: 01:00:00 |
51 changes: 51 additions & 0 deletions
51
examples/nougat/leonardo-scaling/nougat.leonardo.nodes2.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# The directory containing the pdfs to convert | ||
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data | ||
|
||
# The output directory of the workflow | ||
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes2 | ||
|
||
# Search for zip files in the pdf_dir | ||
iszip: true | ||
|
||
# Temporary storage directory for unzipping files | ||
tmp_storage: /dev/shm | ||
|
||
# The settings for the pdf parser | ||
parser_settings: | ||
name: nougat | ||
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB. | ||
batchsize: 12 | ||
# Path to download the checkpoint to. if already exists, will use existing. | ||
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base | ||
# Set mmd_out to null if you don't want to write mmd files as a byproduct. | ||
mmd_out: null | ||
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out | ||
recompute: false | ||
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended) | ||
full_precision: false | ||
# If set to true, output text will be formatted as a markdown file. | ||
markdown: true | ||
# Preempt processing a paper if mode collapse causes repetitions (true is recommended) | ||
skipping: true | ||
# Path for the nougat-specific logs for the run. | ||
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes2.test2/nougat_logs | ||
|
||
|
||
# The compute settings for the workflow | ||
compute_settings: | ||
# The name of the compute platform to use | ||
name: leonardo | ||
# The number of compute nodes to use | ||
num_nodes: 2 | ||
# Make sure to update the path to your conda environment and HF cache | ||
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache" | ||
# The scheduler options to use when submitting jobs | ||
scheduler_options: "#SBATCH --reservation=s_res_gb" | ||
# Partition to use. | ||
partition: boost_usr_prod | ||
# Quality of service. | ||
qos: qos_resv | ||
# Account to charge compute to. | ||
account: try24_Genomics_0 | ||
# The amount of time to request for your job | ||
walltime: 01:00:00 |
52 changes: 52 additions & 0 deletions
52
examples/nougat/leonardo-scaling/nougat.leonardo.nodes256.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# The directory containing the pdfs to convert | ||
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data | ||
|
||
# The output directory of the workflow | ||
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes256 | ||
|
||
# Search for zip files in the pdf_dir | ||
iszip: true | ||
|
||
# Temporary storage directory for unzipping files | ||
tmp_storage: /dev/shm | ||
|
||
# The settings for the pdf parser | ||
parser_settings: | ||
name: nougat | ||
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB. | ||
batchsize: 12 | ||
# Path to download the checkpoint to. if already exists, will use existing. | ||
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base | ||
# Set mmd_out to null if you don't want to write mmd files as a byproduct. | ||
mmd_out: null | ||
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out | ||
recompute: false | ||
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended) | ||
full_precision: false | ||
# If set to true, output text will be formatted as a markdown file. | ||
markdown: true | ||
# Preempt processing a paper if mode collapse causes repetitions (true is recommended) | ||
skipping: true | ||
# Path for the nougat-specific logs for the run. | ||
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes256/nougat_logs | ||
|
||
|
||
# The compute settings for the workflow | ||
compute_settings: | ||
# The name of the compute platform to use | ||
name: leonardo | ||
# The number of compute nodes to use | ||
num_nodes: 256 | ||
# Make sure to update the path to your conda environment and HF cache | ||
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache" | ||
# The scheduler options to use when submitting jobs | ||
scheduler_options: "#SBATCH --reservation=s_res_gb" | ||
# Partition to use. | ||
partition: boost_usr_prod | ||
# Quality of service. | ||
qos: qos_resv | ||
#qos: boost_qos_dbg | ||
# Account to charge compute to. | ||
account: try24_Genomics_0 | ||
# The amount of time to request for your job | ||
walltime: 01:00:00 |
52 changes: 52 additions & 0 deletions
52
examples/nougat/leonardo-scaling/nougat.leonardo.nodes512.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# The directory containing the pdfs to convert | ||
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data | ||
|
||
# The output directory of the workflow | ||
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes512 | ||
|
||
# Search for zip files in the pdf_dir | ||
iszip: true | ||
|
||
# Temporary storage directory for unzipping files | ||
tmp_storage: /dev/shm | ||
|
||
# The settings for the pdf parser | ||
parser_settings: | ||
name: nougat | ||
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB. | ||
batchsize: 12 | ||
# Path to download the checkpoint to. if already exists, will use existing. | ||
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base | ||
# Set mmd_out to null if you don't want to write mmd files as a byproduct. | ||
mmd_out: null | ||
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out | ||
recompute: false | ||
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended) | ||
full_precision: false | ||
# If set to true, output text will be formatted as a markdown file. | ||
markdown: true | ||
# Preempt processing a paper if mode collapse causes repetitions (true is recommended) | ||
skipping: true | ||
# Path for the nougat-specific logs for the run. | ||
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes512/nougat_logs | ||
|
||
|
||
# The compute settings for the workflow | ||
compute_settings: | ||
# The name of the compute platform to use | ||
name: leonardo | ||
# The number of compute nodes to use | ||
num_nodes: 512 | ||
# Make sure to update the path to your conda environment and HF cache | ||
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache" | ||
# The scheduler options to use when submitting jobs | ||
scheduler_options: "#SBATCH --reservation=s_res_gb" | ||
# Partition to use. | ||
partition: boost_usr_prod | ||
# Quality of service. | ||
qos: qos_resv | ||
#qos: boost_qos_dbg | ||
# Account to charge compute to. | ||
account: try24_Genomics_0 | ||
# The amount of time to request for your job | ||
walltime: 01:00:00 |
Oops, something went wrong.