pretraining-generator/README.md at master · agi-dude/pretraining-generator · GitHub

This is a simple script to create a pretraining dataset from a folder of input files (txt, md, pdf, docx, epub, html")

Usage Instructions

Clone the repository and navigate to the project directory:

git clone https://github.com/agi-dude/pretraining-generator
cd pretraining-generator

Install the required dependencies:
```
pip install -r requirements.txt
```
Run the script:
```
python main.py
```
Follow the GUI prompts to select the input folder and output file.