StudySetCreator

Description

StudySetCreator is a Python tool that generates study sets (flashcards) from PDF files using OpenAI's language models. It processes the content of a PDF file, extracts text and images, and uses the OpenAI API to create question-answer pairs suitable for studying or revision purposes. The generated study set is saved as a CSV file, ready to be imported into flashcard applications or used directly.

Features

PDF Processing: Extracts text and images from PDF files.
OpenAI Integration: Utilizes OpenAI's GPT models to generate study cards from the extracted content.
Batch Processing: Supports processing in chunks to handle large PDF files efficiently.
Resume Capability: Can resume processing from where it left off in case of interruptions.
Language Support: Generates study sets in the specified language.
Customization: Allows customization of various parameters like model selection, output file name, chunk size, etc.

Prerequisites

Python: Version 3.7 or higher.
OpenAI API Key: Required to access OpenAI's language models.

Installation

Clone the Repository

git clone https://github.com/jaylann/StudySetCreator.git
cd StudySetCreator

Create a Virtual Environment (optional but recommended)

python3 -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install Dependencies
```
pip install -r requirements.txt
```

Configuration

Set Up the `.env` File

The application requires an OpenAI API key to function. This key should be stored in a .env file in the project's root directory.

Create the .env File

There is a .env.template file provided in the project. Copy this template to create your .env file:
```
cp .env.template .env
```
Edit the .env File

Open the .env file in a text editor and add your OpenAI API key:
```
OPENAI_API_KEY=your-openai-api-key-here
```
Replace your-openai-api-key-here with your actual OpenAI API key.

Usage

Run the main.py script with the required arguments to generate a study set from a PDF file.

python main.py [options]

Optional Arguments

--model: OpenAI model to use (default: gpt-4o-mini).
--output: Output CSV file name (default: study_set.csv).
--input: Input PDF file to process (required).
--in_dir: Input directory containing PDF files to process.
--out_dir: Output directory to save the study sets.
--chunk_size: Number of pages to process at once (default: 10).
--use_batch: Use OpenAI Batch API for processing.
--text_only: Extract text only, ignore images.
--language: Language for the study set (default: english).
--no_resume: Whether to resume processing from the last checkpoint. WARNING: If set and a progress file exists, it will be overwritten.

Examples

Basic Usage

Generate a study set from document.pdf using the default settings.
```
python main.py --input document.pdf --output document.csv
```
Specify Output File and Model

Generate a study set from lecture_notes.pdf, using the gpt-4o model, and save the output to flashcards.csv.
```
python main.py --model gpt-4o --output flashcards.csv --input lecture_notes.pdf
```
Process Only Text Content

Generate a study set ignoring images in the PDF.
```
python main.py --text_only --input textbook.pdf --output textbook.csv
```
Use Batch Processing

Use OpenAI's Batch API to process the PDF (suitable for large PDFs. Reduces cost by ~50% but may take longer).
```
python main.py --use_batch --input large_document.pdf --output large_document.csv
```

Specify Language

Generate a study set in Spanish.

python main.py --language spanish --input notas_de_clase.pdf --output notas_de_clase.csv

Customization

Modifying the Prompt

The system prompt used by the OpenAI API can be customized to change how study cards are generated.

Prompt File: ./storage/prompt.txt

Edit this file to modify the prompt. The placeholder [LANGUAGE] in the prompt will be replaced with the language specified via the --language argument.

Modifying the Schema

The JSON schema defines the expected structure of the API responses.

Schema File: ./storage/schema.json

Edit this file to change the schema if you need the responses in a different format.

Logging

The application uses logging to provide information about its operation.

Log Output: The application outputs logs to the console. You can modify the logging configuration in src/utils/logging.py if you need to change log levels or output formats.

Error Handling

Resume Processing: If the processing is interrupted, the application can resume from where it left off using the progress saved in progress.json.
Progress File: The file progress.json is used to keep track of progress. It can be deleted to start processing from the beginning.
Batch Processing Errors: If a batch job fails or is still in progress, an error message will be logged.

Dependencies

All required Python packages are listed in requirements.txt. Install them using:

pip install -r requirements.txt

Notes

API Usage: Be mindful of your OpenAI API usage and billing.
Supported Models: Ensure that the model you specify (e.g., gpt-4) is available to your OpenAI account.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any bugs or feature requests.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

StudySetCreator

Description

Features

Prerequisites

Installation

Configuration

Set Up the `.env` File

Usage

Optional Arguments

Examples

Customization

Modifying the Prompt

Modifying the Schema

Logging

Error Handling

Dependencies

Notes

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

StudySetCreator

Description

Features

Prerequisites

Installation

Configuration

Set Up the .env File

Usage

Optional Arguments

Examples

Customization

Modifying the Prompt

Modifying the Schema

Logging

Error Handling

Dependencies

Notes

Contributing

License

Set Up the `.env` File