This project is designed to generate dataset from a corpus for training/evaluating LLMs. It utilises OpenAI's GPT models.
Follow the instructions below to get the project up and running on your local machine:
- Clone the repository:
git clone https://github.com/Perseus14/llm-dataset-generator.git
- Navigate to the project directory:
cd llm-dataset-generator
- Create virtual environment:
virtualenv -p /usr/bin/python3 venv
- Activate virtual environment:
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
To use this project, follow these steps:
- Create a folder and and the corpus to that folder (Supports txt and pdf files)
- Create a .env file and add openAI apikey (Similar to .env_local)
- Modify main.py and the required paths
- Generate conversational dataset
- Add other LLM models
- Modify to provide more control to users
Contributions are welcome! Follow the steps below to contribute to this project:
- Fork the repository
- Create a new branch:
git checkout -b new-feature
- Make your changes and commit them:
git commit -m 'Add new feature'
- Push the changes to your forked repository:
git push origin new-feature
- Open a pull request on the original repository
Please ensure that your contributions align with the project's coding style and guidelines.
Apache 2.0