Skip to content

A RAG model which allows to ask question based on the data of all talks of the GulaschProgrammierNacht (GPN)

Notifications You must be signed in to change notification settings

felixschndr/GPN-Chat

Repository files navigation

GPN-Chat

This repository aims to create a LLM with the help of RAG (Retrieval Augmented Generation) to be able to answer questions based on the public talks of the Gulaschprogrammiernacht (GPN).

The Gulaschprogrammiernacht is an annual four-day event organized by Entropia e.V., part of the Chaos Computer Club (CCC) in Karlsruhe, Germany. Originally conceived by computer science students in Karlsruhe in 2002, it has grown into one of Europe's major gatherings for the hacker community. The event, which is held at the ZKM | Center for Art and Media as well as the State Academy of Design Karlsruhe, includes hacking, programming, lectures, workshops, and as the name suggests, a significant amount of goulash cooking to feed participants.

There are many interesting lectures, however they are all quite long (about an hour each) and I currently don't have the time to listen to them. Thus, my idea was to create a LLM which I could ask questions about all the talks.

Inner workings

There are five stages to achieve this:

  1. Crawling all the data from the GPN archive
    • This is done in the crawler.
    • It gets all the metadata such as speaker, date, length of the talk and writes it do data/metadata/name_of_the_talk.json.
    • It downloads the corresponding audio file and saves is to data/audio/name_of_the_talk.mp3.
  2. Transcribing the audio files
    • This is done in the transcriber.
    • The Transcriber uses a speech to text model from OpenAI's whisper.
      • It iterates over all audio files in data/audio/ and loads them.
      • It splits them up into smaller chunks and then uses multithreading to transcribe them.
      • Afterward it combines all the parts of the transcriptions into one large file and writes it to data/transcriptions/name_of_the_talk.txt
  3. Translating the transcriptions
    • This is done in the translator.
    • It iterates over all metadata files and checks whether its corresponding transcription is not in the target language.
    • The target language can be specified by using --translation-target-language in main.py.
    • When a transcription is not in the target language it is translated and written back to the original file.
  4. Creating the Indexing-Pipeline
    • This is done in the IndexingPipeline.
    • This project uses RAG in order to determine which next word is most likely depending on the current context (the same technology ChatGPT uses).
    • The pipeline has multiple steps
      1. Loading in the data (audio file and metadata): With the help of a custom component to match the corresponding files the data is loaded.
      2. Splitting the data: The transcriptions are split on a sentence level to help the RAG understand the data.
      3. Embedding the data: This converts the text into high-dimensional vectors that capture semantic meaning, enabling efficient comparison.
      4. Writing the data into a QdrantDocumentStore: The processed data is stored such that a vector map can be created. This map is used to determine the next word based on the current context.
  5. Interacting with the Pipeline
    • This is done in the ChatUI.
    • The ChatUI creates a browser application with the help of streamlit in which the user can interact (--> ask questions) with the Pipeline.
    • The ChatUI internally uses the GPNChatPipeline to generate an answer to the users promt.

Usage

  1. Setup

    1. Create a virtual environment
      python -m venv .venv
    2. Activate the virtual environment
      source .venv/bin/activate
    3. Install the requirements
      poetry install
  2. Running

    1. Run main.py to crawl, transcribe and translate the data.
      $ python main.py --help
      
      usage: Gulaschprogrammiernacht Chat
      [-h] [--crawl] [--transcribe] [--transcription-model {tiny,base,small,medium,large}] [--transcription-cpu-count TRANSCRIPTION_CPU_COUNT] [--overwrite-existing-transcriptions] [--translation-target-language TRANSLATION_TARGET_LANGUAGE] [--loglevel {debug,info,warning,error,critical}]
      
      A GPT that is trained on the Gulaschprogrammiernacht Talks
      
      options:
      -h, --help
      show this help message and exit
      --crawl
      Crawl the audio files and metadata from the GPN archive. This is slow and only has to be done once, the data is
      written to disk - Default: False
      --transcribe
      Transcribe the audio files. This is slow and only has to be done once, the data is written to disk - Default: False
      --transcription-model {tiny,base,small,medium,large}
      The Whisper model to be used to transcribe the audio files. The larger the model the more accurate the transcriptions
      become but the slower it gets.
      See https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages for more information -
      Default: base
      --transcription-cpu-count TRANSCRIPTION_CPU_COUNT
      The amount of CPU cores to use for transcribing - Default: 3/4 of the available CPU cores (15)
      --overwrite-existing-transcriptions
      Overwrite existing transcriptions - Default: False
      --translation-target-language TRANSLATION_TARGET_LANGUAGE
      Language to translate the transcriptions to. Specify a ISO 639 language code - Default: de
      --loglevel {debug,info,warning,error,critical}
      Set the logging level - Default: info
      
    2. Start the QdrantDocumentStore container by running docker compose up -d.
    3. Run the indexing_pipeline.py once to process all the data and store it in a QdrantDocumentStore.
    4. Finally, call chatui.py to start the browser interface to query the LLM.

About

A RAG model which allows to ask question based on the data of all talks of the GulaschProgrammierNacht (GPN)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages