Merge pull request #21 from ramanathanlab/develop

Develop
ramanathanlab · Apr 29, 2024 · 19bf1bf · 19bf1bf
2 parents bfe7dbb + 18a4370
commit 19bf1bf
Show file tree

Hide file tree

Showing 76 changed files with 674,703 additions and 128 deletions.
diff --git a/README.md b/README.md
@@ -233,8 +233,11 @@ conda create -n pdfwf python=3.10 -y
 conda activate pdfwf
 mamba install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia -y
 pip install -r requirements/oreo_requirements.txt
+git clone [email protected]:ultralytics/yolov5.git
 ```
 
+Then set the `yolov5_path` option to the path of the cloned `yolov5` repository.
+
 ## CLI
 For running smaller jobs without using the Parsl workflow, the CLI can be used.
 The CLI provides different commands for running the various parsers. The CLI
@@ -316,9 +319,9 @@ $ pdfwf oreo [OPTIONS]
 
 * `-p, --pdf_path PATH`: The directory containing the PDF files to convert (recursive glob).  [required]
 * `-o, --output_dir PATH`: The directory to write the output JSON lines file to.  [required]
-* `-d, --detection_weights_path PATH`: Weights to layout detection model.  [required]
-* `-t, --text_cls_weights_path PATH`: Model weights for (meta) text classifier.  [required]
-* `-s, --spv05_category_file_path PATH`: Path to the SPV05 category file.  [required]
+* `-d, --detection_weights_path PATH`: Weights to layout detection model.
+* `-t, --text_cls_weights_path PATH`: Model weights for (meta) text classifier.
+* `-s, --spv05_category_file_path PATH`: Path to the SPV05 category file.
 * `-d, --detect_only`: File type to be parsed (ignores other files in the input_dir)
 * `-m, --meta_only`: Only parse PDFs for meta data
 * `-e, --equation`: Include equations into the text categories
@@ -329,7 +332,7 @@ $ pdfwf oreo [OPTIONS]
 * `-b, --batch_yolo INTEGER`: Main batch size for detection/# of images loaded per batch  [default: 128]
 * `-v, --batch_vit INTEGER`: Batch size of pre-processed patches for ViT pseudo-OCR inference  [default: 512]
 * `-c, --batch_cls INTEGER`: Batch size K for subsequent text processing  [default: 512]
-* `-o, --bbox_offset INTEGER`: Number of pixels along which  [default: 2]
+* `-x, --bbox_offset INTEGER`: Number of pixels along which  [default: 2]
 * `-nc, --num_conversions INTEGER`: Number of pdfs to convert (useful for debugging, by default convert every document) [default: 0].
 * `--help`: Show this message and exit.
 
@@ -357,3 +360,20 @@ To generate the CLI documentation, run:
 ```
 typer pdfwf.cli utils docs --output CLI.md
 ```
+
+## Installation on HPC systems
+
+### Leonardo
+```bash
+module load python/3.11.6--gcc--8.5.0
+python -m venv pdfwf-venv
+source pdfwf-venv/bin/activate
+pip install -U pip setuptools wheel
+pip install torch
+pip install numpy
+pip install -r requirements/nougat_requirements.txt
+pip install -r requirements/oreo_requirements.txt
+pip install -e .
+python -m nltk.downloader words
+python -m nltk.downloader punkt
+```
diff --git a/examples/nougat/leonardo-scaling/nougat.leonardo.nodes1.yaml b/examples/nougat/leonardo-scaling/nougat.leonardo.nodes1.yaml
@@ -0,0 +1,51 @@
+# The directory containing the pdfs to convert
+pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data
+
+# The output directory of the workflow
+out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1
+
+# Search for zip files in the pdf_dir
+iszip: true
+
+# Temporary storage directory for unzipping files
+tmp_storage: /dev/shm
+
+# The settings for the pdf parser
+parser_settings:
+  name: nougat
+  # Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
+  batchsize: 12
+  # Path to download the checkpoint to. if already exists, will use existing.
+  checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
+  # Set mmd_out to null if you don't want to write mmd files as a byproduct.
+  mmd_out: null
+  # If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
+  recompute: false
+  # Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
+  full_precision: false
+  # If set to true, output text will be formatted as a markdown file.
+  markdown: true
+  # Preempt processing a paper if mode collapse causes repetitions (true is recommended)
+  skipping: true
+  # Path for the nougat-specific logs for the run.
+  nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1/nougat_logs
+
+
+# The compute settings for the workflow
+compute_settings:
+  # The name of the compute platform to use
+  name: leonardo
+  # The number of compute nodes to use
+  num_nodes: 1
+  # Make sure to update the path to your conda environment and HF cache
+  worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
+  # The scheduler options to use when submitting jobs
+  scheduler_options: "#SBATCH --reservation=s_res_gb"
+  # Partition to use.
+  partition: boost_usr_prod
+  # Quality of service.
+  qos: qos_resv
+  # Account to charge compute to.
+  account: try24_Genomics_0
+  # The amount of time to request for your job
+  walltime: 01:00:00
diff --git a/examples/nougat/leonardo-scaling/nougat.leonardo.nodes1024.yaml b/examples/nougat/leonardo-scaling/nougat.leonardo.nodes1024.yaml
@@ -0,0 +1,52 @@
+# The directory containing the pdfs to convert
+pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data
+
+# The output directory of the workflow
+out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1024
+
+# Search for zip files in the pdf_dir
+iszip: true
+
+# Temporary storage directory for unzipping files
+tmp_storage: /dev/shm
+
+# The settings for the pdf parser
+parser_settings:
+  name: nougat
+  # Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
+  batchsize: 12
+  # Path to download the checkpoint to. if already exists, will use existing.
+  checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
+  # Set mmd_out to null if you don't want to write mmd files as a byproduct.
+  mmd_out: null
+  # If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
+  recompute: false
+  # Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
+  full_precision: false
+  # If set to true, output text will be formatted as a markdown file.
+  markdown: true
+  # Preempt processing a paper if mode collapse causes repetitions (true is recommended)
+  skipping: true
+  # Path for the nougat-specific logs for the run.
+  nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1024/nougat_logs
+
+
+# The compute settings for the workflow
+compute_settings:
+  # The name of the compute platform to use
+  name: leonardo
+  # The number of compute nodes to use
+  num_nodes: 1024
+  # Make sure to update the path to your conda environment and HF cache
+  worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
+  # The scheduler options to use when submitting jobs
+  scheduler_options: "#SBATCH --reservation=s_res_gb"
+  # Partition to use.
+  partition: boost_usr_prod
+  # Quality of service.
+  qos: qos_resv
+  #qos: boost_qos_dbg
+  # Account to charge compute to.
+  account: try24_Genomics_0
+  # The amount of time to request for your job
+  walltime: 01:00:00
diff --git a/examples/nougat/leonardo-scaling/nougat.leonardo.nodes1480.yaml b/examples/nougat/leonardo-scaling/nougat.leonardo.nodes1480.yaml
@@ -0,0 +1,52 @@
+# The directory containing the pdfs to convert
+pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data
+
+# The output directory of the workflow
+out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1480
+
+# Search for zip files in the pdf_dir
+iszip: true
+
+# Temporary storage directory for unzipping files
+tmp_storage: /dev/shm
+
+# The settings for the pdf parser
+parser_settings:
+  name: nougat
+  # Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
+  batchsize: 12
+  # Path to download the checkpoint to. if already exists, will use existing.
+  checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
+  # Set mmd_out to null if you don't want to write mmd files as a byproduct.
+  mmd_out: null
+  # If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
+  recompute: false
+  # Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
+  full_precision: false
+  # If set to true, output text will be formatted as a markdown file.
+  markdown: true
+  # Preempt processing a paper if mode collapse causes repetitions (true is recommended)
+  skipping: true
+  # Path for the nougat-specific logs for the run.
+  nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes1480/nougat_logs
+
+
+# The compute settings for the workflow
+compute_settings:
+  # The name of the compute platform to use
+  name: leonardo
+  # The number of compute nodes to use
+  num_nodes: 1480
+  # Make sure to update the path to your conda environment and HF cache
+  worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
+  # The scheduler options to use when submitting jobs
+  scheduler_options: "#SBATCH --reservation=s_res_gb"
+  # Partition to use.
+  partition: boost_usr_prod
+  # Quality of service.
+  qos: qos_resv
+  #qos: boost_qos_dbg
+  # Account to charge compute to.
+  account: try24_Genomics_0
+  # The amount of time to request for your job
+  walltime: 01:00:00
diff --git a/examples/nougat/leonardo-scaling/nougat.leonardo.nodes2.yaml b/examples/nougat/leonardo-scaling/nougat.leonardo.nodes2.yaml
@@ -0,0 +1,51 @@
+# The directory containing the pdfs to convert
+pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data
+
+# The output directory of the workflow
+out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes2
+
+# Search for zip files in the pdf_dir
+iszip: true
+
+# Temporary storage directory for unzipping files
+tmp_storage: /dev/shm
+
+# The settings for the pdf parser
+parser_settings:
+  name: nougat
+  # Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
+  batchsize: 12
+  # Path to download the checkpoint to. if already exists, will use existing.
+  checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
+  # Set mmd_out to null if you don't want to write mmd files as a byproduct.
+  mmd_out: null
+  # If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
+  recompute: false
+  # Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
+  full_precision: false
+  # If set to true, output text will be formatted as a markdown file.
+  markdown: true
+  # Preempt processing a paper if mode collapse causes repetitions (true is recommended)
+  skipping: true
+  # Path for the nougat-specific logs for the run.
+  nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes2.test2/nougat_logs
+
+
+# The compute settings for the workflow
+compute_settings:
+  # The name of the compute platform to use
+  name: leonardo
+  # The number of compute nodes to use
+  num_nodes: 2
+  # Make sure to update the path to your conda environment and HF cache
+  worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
+  # The scheduler options to use when submitting jobs
+  scheduler_options: "#SBATCH --reservation=s_res_gb"
+  # Partition to use.
+  partition: boost_usr_prod
+  # Quality of service.
+  qos: qos_resv
+  # Account to charge compute to.
+  account: try24_Genomics_0
+  # The amount of time to request for your job
+  walltime: 01:00:00
diff --git a/examples/nougat/leonardo-scaling/nougat.leonardo.nodes256.yaml b/examples/nougat/leonardo-scaling/nougat.leonardo.nodes256.yaml
@@ -0,0 +1,52 @@
+# The directory containing the pdfs to convert
+pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data
+
+# The output directory of the workflow
+out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes256
+
+# Search for zip files in the pdf_dir
+iszip: true
+
+# Temporary storage directory for unzipping files
+tmp_storage: /dev/shm
+
+# The settings for the pdf parser
+parser_settings:
+  name: nougat
+  # Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
+  batchsize: 12
+  # Path to download the checkpoint to. if already exists, will use existing.
+  checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
+  # Set mmd_out to null if you don't want to write mmd files as a byproduct.
+  mmd_out: null
+  # If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
+  recompute: false
+  # Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
+  full_precision: false
+  # If set to true, output text will be formatted as a markdown file.
+  markdown: true
+  # Preempt processing a paper if mode collapse causes repetitions (true is recommended)
+  skipping: true
+  # Path for the nougat-specific logs for the run.
+  nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes256/nougat_logs
+
+
+# The compute settings for the workflow
+compute_settings:
+  # The name of the compute platform to use
+  name: leonardo
+  # The number of compute nodes to use
+  num_nodes: 256
+  # Make sure to update the path to your conda environment and HF cache
+  worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
+  # The scheduler options to use when submitting jobs
+  scheduler_options: "#SBATCH --reservation=s_res_gb"
+  # Partition to use.
+  partition: boost_usr_prod
+  # Quality of service.
+  qos: qos_resv
+  #qos: boost_qos_dbg
+  # Account to charge compute to.
+  account: try24_Genomics_0
+  # The amount of time to request for your job
+  walltime: 01:00:00
diff --git a/examples/nougat/leonardo-scaling/nougat.leonardo.nodes512.yaml b/examples/nougat/leonardo-scaling/nougat.leonardo.nodes512.yaml
@@ -0,0 +1,52 @@
+# The directory containing the pdfs to convert
+pdf_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/raw_pdfs/scaling-data/zip-scaling-data
+
+# The output directory of the workflow
+out_dir: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes512
+
+# Search for zip files in the pdf_dir
+iszip: true
+
+# Temporary storage directory for unzipping files
+tmp_storage: /dev/shm
+
+# The settings for the pdf parser
+parser_settings:
+  name: nougat
+  # Recommended batch size of pages (not pdfs) is 12, maximum that fits into A100 64GB.
+  batchsize: 12
+  # Path to download the checkpoint to. if already exists, will use existing.
+  checkpoint: /leonardo_scratch/large/userexternal/abrace00/.cache/nougat/base
+  # Set mmd_out to null if you don't want to write mmd files as a byproduct.
+  mmd_out: null
+  # If set to false, a will skip the pdfs that already have a parsed mmd file in mmd_out
+  recompute: false
+  # Set to true if you want to use fp32 instead of bfloat16 (false is recommended)
+  full_precision: false
+  # If set to true, output text will be formatted as a markdown file.
+  markdown: true
+  # Preempt processing a paper if mode collapse causes repetitions (true is recommended)
+  skipping: true
+  # Path for the nougat-specific logs for the run.
+  nougat_logs_path: /leonardo_scratch/large/userexternal/abrace00/metric-rag/data/parsed_pdfs/scaling/nougat.leonardo.nodes512/nougat_logs
+
+
+# The compute settings for the workflow
+compute_settings:
+  # The name of the compute platform to use
+  name: leonardo
+  # The number of compute nodes to use
+  num_nodes: 512
+  # Make sure to update the path to your conda environment and HF cache
+  worker_init: "module load python/3.11.6--gcc--8.5.0; source /leonardo_scratch/large/userexternal/abrace00/venvs/pdfwf-venv/bin/activate; export HF_HOME=/leonardo_scratch/large/userexternal/abrace00/.cache"
+  # The scheduler options to use when submitting jobs
+  scheduler_options: "#SBATCH --reservation=s_res_gb"
+  # Partition to use.
+  partition: boost_usr_prod
+  # Quality of service.
+  qos: qos_resv
+  #qos: boost_qos_dbg
+  # Account to charge compute to.
+  account: try24_Genomics_0
+  # The amount of time to request for your job
+  walltime: 01:00:00