This is a simple script to create a pretraining dataset from a folder of input files (txt
, md
, pdf
, docx
, epub
, html"
)
-
Clone the repository and navigate to the project directory:
git clone https://github.com/agi-dude/pretraining-generator cd pretraining-generator
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the script:
python main.py
-
Follow the GUI prompts to select the input folder and output file.