Skip to content

Commit

Permalink
Merge pull request #21 from ramanathanlab/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
braceal authored Apr 29, 2024
2 parents bfe7dbb + 18a4370 commit 19bf1bf
Show file tree
Hide file tree
Showing 76 changed files with 674,703 additions and 128 deletions.
28 changes: 24 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,8 +233,11 @@ conda create -n pdfwf python=3.10 -y
conda activate pdfwf
mamba install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install -r requirements/oreo_requirements.txt
git clone [email protected]:ultralytics/yolov5.git
```

Then set the `yolov5_path` option to the path of the cloned `yolov5` repository.

## CLI
For running smaller jobs without using the Parsl workflow, the CLI can be used.
The CLI provides different commands for running the various parsers. The CLI
Expand Down Expand Up @@ -316,9 +319,9 @@ $ pdfwf oreo [OPTIONS]

* `-p, --pdf_path PATH`: The directory containing the PDF files to convert (recursive glob). [required]
* `-o, --output_dir PATH`: The directory to write the output JSON lines file to. [required]
* `-d, --detection_weights_path PATH`: Weights to layout detection model. [required]
* `-t, --text_cls_weights_path PATH`: Model weights for (meta) text classifier. [required]
* `-s, --spv05_category_file_path PATH`: Path to the SPV05 category file. [required]
* `-d, --detection_weights_path PATH`: Weights to layout detection model.
* `-t, --text_cls_weights_path PATH`: Model weights for (meta) text classifier.
* `-s, --spv05_category_file_path PATH`: Path to the SPV05 category file.
* `-d, --detect_only`: File type to be parsed (ignores other files in the input_dir)
* `-m, --meta_only`: Only parse PDFs for meta data
* `-e, --equation`: Include equations into the text categories
Expand All @@ -329,7 +332,7 @@ $ pdfwf oreo [OPTIONS]
* `-b, --batch_yolo INTEGER`: Main batch size for detection/# of images loaded per batch [default: 128]
* `-v, --batch_vit INTEGER`: Batch size of pre-processed patches for ViT pseudo-OCR inference [default: 512]
* `-c, --batch_cls INTEGER`: Batch size K for subsequent text processing [default: 512]
* `-o, --bbox_offset INTEGER`: Number of pixels along which [default: 2]
* `-x, --bbox_offset INTEGER`: Number of pixels along which [default: 2]
* `-nc, --num_conversions INTEGER`: Number of pdfs to convert (useful for debugging, by default convert every document) [default: 0].
* `--help`: Show this message and exit.

Expand Down Expand Up @@ -357,3 +360,20 @@ To generate the CLI documentation, run:
```
typer pdfwf.cli utils docs --output CLI.md
```

## Installation on HPC systems

### Leonardo
```bash
module load python/3.11.6--gcc--8.5.0
python -m venv pdfwf-venv
source pdfwf-venv/bin/activate
pip install -U pip setuptools wheel
pip install torch
pip install numpy
pip install -r requirements/nougat_requirements.txt
pip install -r requirements/oreo_requirements.txt
pip install -e .
python -m nltk.downloader words
python -m nltk.downloader punkt
```
51 changes: 51 additions & 0 deletions examples/nougat/leonardo-scaling/nougat.leonardo.nodes1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# The directory containing the pdfs to convert
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data

# The output directory of the workflow
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1

# Search for zip files in the pdf_dir
iszip: true

# Temporary storage directory for unzipping files
tmp_storage: /dev/shm

# The settings for the pdf parser
parser_settings:
name: nougat
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
batchsize: 12
# Path to download the checkpoint to. if already exists, will use existing.
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
# Set mmd_out to null if you don't want to write mmd files as a byproduct.
mmd_out: null
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
recompute: false
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
full_precision: false
# If set to true, output text will be formatted as a markdown file.
markdown: true
# Preempt processing a paper if mode collapse causes repetitions (true is recommended)
skipping: true
# Path for the nougat-specific logs for the run.
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1/nougat_logs


# The compute settings for the workflow
compute_settings:
# The name of the compute platform to use
name: leonardo
# The number of compute nodes to use
num_nodes: 1
# Make sure to update the path to your conda environment and HF cache
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
# The scheduler options to use when submitting jobs
scheduler_options: "#SBATCH --reservation=s_res_gb"
# Partition to use.
partition: boost_usr_prod
# Quality of service.
qos: qos_resv
# Account to charge compute to.
account: try24_Genomics_0
# The amount of time to request for your job
walltime: 01:00:00
52 changes: 52 additions & 0 deletions examples/nougat/leonardo-scaling/nougat.leonardo.nodes1024.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# The directory containing the pdfs to convert
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data

# The output directory of the workflow
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1024

# Search for zip files in the pdf_dir
iszip: true

# Temporary storage directory for unzipping files
tmp_storage: /dev/shm

# The settings for the pdf parser
parser_settings:
name: nougat
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
batchsize: 12
# Path to download the checkpoint to. if already exists, will use existing.
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
# Set mmd_out to null if you don't want to write mmd files as a byproduct.
mmd_out: null
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
recompute: false
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
full_precision: false
# If set to true, output text will be formatted as a markdown file.
markdown: true
# Preempt processing a paper if mode collapse causes repetitions (true is recommended)
skipping: true
# Path for the nougat-specific logs for the run.
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1024/nougat_logs


# The compute settings for the workflow
compute_settings:
# The name of the compute platform to use
name: leonardo
# The number of compute nodes to use
num_nodes: 1024
# Make sure to update the path to your conda environment and HF cache
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
# The scheduler options to use when submitting jobs
scheduler_options: "#SBATCH --reservation=s_res_gb"
# Partition to use.
partition: boost_usr_prod
# Quality of service.
qos: qos_resv
#qos: boost_qos_dbg
# Account to charge compute to.
account: try24_Genomics_0
# The amount of time to request for your job
walltime: 01:00:00
52 changes: 52 additions & 0 deletions examples/nougat/leonardo-scaling/nougat.leonardo.nodes1480.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# The directory containing the pdfs to convert
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data

# The output directory of the workflow
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1480

# Search for zip files in the pdf_dir
iszip: true

# Temporary storage directory for unzipping files
tmp_storage: /dev/shm

# The settings for the pdf parser
parser_settings:
name: nougat
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
batchsize: 12
# Path to download the checkpoint to. if already exists, will use existing.
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
# Set mmd_out to null if you don't want to write mmd files as a byproduct.
mmd_out: null
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
recompute: false
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
full_precision: false
# If set to true, output text will be formatted as a markdown file.
markdown: true
# Preempt processing a paper if mode collapse causes repetitions (true is recommended)
skipping: true
# Path for the nougat-specific logs for the run.
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1480/nougat_logs


# The compute settings for the workflow
compute_settings:
# The name of the compute platform to use
name: leonardo
# The number of compute nodes to use
num_nodes: 1480
# Make sure to update the path to your conda environment and HF cache
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
# The scheduler options to use when submitting jobs
scheduler_options: "#SBATCH --reservation=s_res_gb"
# Partition to use.
partition: boost_usr_prod
# Quality of service.
qos: qos_resv
#qos: boost_qos_dbg
# Account to charge compute to.
account: try24_Genomics_0
# The amount of time to request for your job
walltime: 01:00:00
51 changes: 51 additions & 0 deletions examples/nougat/leonardo-scaling/nougat.leonardo.nodes2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# The directory containing the pdfs to convert
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data

# The output directory of the workflow
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes2

# Search for zip files in the pdf_dir
iszip: true

# Temporary storage directory for unzipping files
tmp_storage: /dev/shm

# The settings for the pdf parser
parser_settings:
name: nougat
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
batchsize: 12
# Path to download the checkpoint to. if already exists, will use existing.
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
# Set mmd_out to null if you don't want to write mmd files as a byproduct.
mmd_out: null
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
recompute: false
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
full_precision: false
# If set to true, output text will be formatted as a markdown file.
markdown: true
# Preempt processing a paper if mode collapse causes repetitions (true is recommended)
skipping: true
# Path for the nougat-specific logs for the run.
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes2.test2/nougat_logs


# The compute settings for the workflow
compute_settings:
# The name of the compute platform to use
name: leonardo
# The number of compute nodes to use
num_nodes: 2
# Make sure to update the path to your conda environment and HF cache
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
# The scheduler options to use when submitting jobs
scheduler_options: "#SBATCH --reservation=s_res_gb"
# Partition to use.
partition: boost_usr_prod
# Quality of service.
qos: qos_resv
# Account to charge compute to.
account: try24_Genomics_0
# The amount of time to request for your job
walltime: 01:00:00
52 changes: 52 additions & 0 deletions examples/nougat/leonardo-scaling/nougat.leonardo.nodes256.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# The directory containing the pdfs to convert
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data

# The output directory of the workflow
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes256

# Search for zip files in the pdf_dir
iszip: true

# Temporary storage directory for unzipping files
tmp_storage: /dev/shm

# The settings for the pdf parser
parser_settings:
name: nougat
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
batchsize: 12
# Path to download the checkpoint to. if already exists, will use existing.
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
# Set mmd_out to null if you don't want to write mmd files as a byproduct.
mmd_out: null
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
recompute: false
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
full_precision: false
# If set to true, output text will be formatted as a markdown file.
markdown: true
# Preempt processing a paper if mode collapse causes repetitions (true is recommended)
skipping: true
# Path for the nougat-specific logs for the run.
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes256/nougat_logs


# The compute settings for the workflow
compute_settings:
# The name of the compute platform to use
name: leonardo
# The number of compute nodes to use
num_nodes: 256
# Make sure to update the path to your conda environment and HF cache
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
# The scheduler options to use when submitting jobs
scheduler_options: "#SBATCH --reservation=s_res_gb"
# Partition to use.
partition: boost_usr_prod
# Quality of service.
qos: qos_resv
#qos: boost_qos_dbg
# Account to charge compute to.
account: try24_Genomics_0
# The amount of time to request for your job
walltime: 01:00:00
52 changes: 52 additions & 0 deletions examples/nougat/leonardo-scaling/nougat.leonardo.nodes512.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# The directory containing the pdfs to convert
pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data

# The output directory of the workflow
out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes512

# Search for zip files in the pdf_dir
iszip: true

# Temporary storage directory for unzipping files
tmp_storage: /dev/shm

# The settings for the pdf parser
parser_settings:
name: nougat
# Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
batchsize: 12
# Path to download the checkpoint to. if already exists, will use existing.
checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
# Set mmd_out to null if you don't want to write mmd files as a byproduct.
mmd_out: null
# If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
recompute: false
# Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
full_precision: false
# If set to true, output text will be formatted as a markdown file.
markdown: true
# Preempt processing a paper if mode collapse causes repetitions (true is recommended)
skipping: true
# Path for the nougat-specific logs for the run.
nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes512/nougat_logs


# The compute settings for the workflow
compute_settings:
# The name of the compute platform to use
name: leonardo
# The number of compute nodes to use
num_nodes: 512
# Make sure to update the path to your conda environment and HF cache
worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
# The scheduler options to use when submitting jobs
scheduler_options: "#SBATCH --reservation=s_res_gb"
# Partition to use.
partition: boost_usr_prod
# Quality of service.
qos: qos_resv
#qos: boost_qos_dbg
# Account to charge compute to.
account: try24_Genomics_0
# The amount of time to request for your job
walltime: 01:00:00
Loading

0 comments on commit 19bf1bf

Please sign in to comment.