Skip to content

Latest commit

 

History

History
22 lines (16 loc) · 557 Bytes

README.md

File metadata and controls

22 lines (16 loc) · 557 Bytes

This is a simple script to create a pretraining dataset from a folder of input files (txt, md, pdf, docx, epub, html")

Usage Instructions

  1. Clone the repository and navigate to the project directory:

    git clone https://github.com/agi-dude/pretraining-generator
    cd pretraining-generator
  2. Install the required dependencies:

    pip install -r requirements.txt
  3. Run the script:

    python main.py
  4. Follow the GUI prompts to select the input folder and output file.