This repository aims to create a LLM with the help of RAG (Retrieval Augmented Generation) to be able to answer questions based on the public talks of the Gulaschprogrammiernacht (GPN).
The Gulaschprogrammiernacht is an annual four-day event organized by Entropia e.V., part of the Chaos Computer Club (CCC) in Karlsruhe, Germany. Originally conceived by computer science students in Karlsruhe in 2002, it has grown into one of Europe's major gatherings for the hacker community. The event, which is held at the ZKM | Center for Art and Media as well as the State Academy of Design Karlsruhe, includes hacking, programming, lectures, workshops, and as the name suggests, a significant amount of goulash cooking to feed participants.
There are many interesting lectures, however they are all quite long (about an hour each) and I currently don't have the time to listen to them. Thus, my idea was to create a LLM which I could ask questions about all the talks.
There are five stages to achieve this:
- Crawling all the data from the GPN archive
- This is done in the crawler.
- It gets all the metadata such as speaker, date, length of the talk and writes it do
data/metadata/name_of_the_talk.json
. - It downloads the corresponding audio file and saves is to
data/audio/name_of_the_talk.mp3
.
- Transcribing the audio files
- This is done in the transcriber.
- The
Transcriber
uses a speech to text model from OpenAI's whisper.- It iterates over all audio files in
data/audio/
and loads them. - It splits them up into smaller chunks and then uses multithreading to transcribe them.
- Afterward it combines all the parts of the transcriptions into one large file and writes it to
data/transcriptions/name_of_the_talk.txt
- It iterates over all audio files in
- Translating the transcriptions
- This is done in the translator.
- It iterates over all metadata files and checks whether its corresponding transcription is not in the target language.
- The target language can be specified by using
--translation-target-language
inmain.py
. - When a transcription is not in the target language it is translated and written back to the original file.
- Creating the Indexing-Pipeline
- This is done in the IndexingPipeline.
- This project uses RAG in order to determine which next word is most likely depending on the current context (the same technology ChatGPT uses).
- The pipeline has multiple steps
- Loading in the data (audio file and metadata): With the help of a custom component to match the corresponding files the data is loaded.
- Splitting the data: The transcriptions are split on a sentence level to help the RAG understand the data.
- Embedding the data: This converts the text into high-dimensional vectors that capture semantic meaning, enabling efficient comparison.
- Writing the data into a
QdrantDocumentStore
: The processed data is stored such that a vector map can be created. This map is used to determine the next word based on the current context.
- Interacting with the Pipeline
- This is done in the ChatUI.
- The
ChatUI
creates a browser application with the help of streamlit in which the user can interact (--> ask questions) with the Pipeline. - The
ChatUI
internally uses the GPNChatPipeline to generate an answer to the users promt.
-
Setup
- Create a virtual environment
python -m venv .venv
- Activate the virtual environment
source .venv/bin/activate
- Install the requirements
poetry install
- Create a virtual environment
-
Running
- Run
main.py
to crawl, transcribe and translate the data.$ python main.py --help usage: Gulaschprogrammiernacht Chat [-h] [--crawl] [--transcribe] [--transcription-model {tiny,base,small,medium,large}] [--transcription-cpu-count TRANSCRIPTION_CPU_COUNT] [--overwrite-existing-transcriptions] [--translation-target-language TRANSLATION_TARGET_LANGUAGE] [--loglevel {debug,info,warning,error,critical}] A GPT that is trained on the Gulaschprogrammiernacht Talks options: -h, --help show this help message and exit --crawl Crawl the audio files and metadata from the GPN archive. This is slow and only has to be done once, the data is written to disk - Default: False --transcribe Transcribe the audio files. This is slow and only has to be done once, the data is written to disk - Default: False --transcription-model {tiny,base,small,medium,large} The Whisper model to be used to transcribe the audio files. The larger the model the more accurate the transcriptions become but the slower it gets. See https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages for more information - Default: base --transcription-cpu-count TRANSCRIPTION_CPU_COUNT The amount of CPU cores to use for transcribing - Default: 3/4 of the available CPU cores (15) --overwrite-existing-transcriptions Overwrite existing transcriptions - Default: False --translation-target-language TRANSLATION_TARGET_LANGUAGE Language to translate the transcriptions to. Specify a ISO 639 language code - Default: de --loglevel {debug,info,warning,error,critical} Set the logging level - Default: info
- Start the
QdrantDocumentStore
container by runningdocker compose up -d
. - Run the
indexing_pipeline.py
once to process all the data and store it in aQdrantDocumentStore
. - Finally, call
chatui.py
to start the browser interface to query the LLM.
- Run