Skip to content

Latest commit

 

History

History
179 lines (117 loc) · 7.51 KB

README.md

File metadata and controls

179 lines (117 loc) · 7.51 KB

ClassGPT

ChatGPT for my lecture slides

SCR-20230307-isgj

Built with Streamlit, powered by LlamaIndex and LangChain.

Uses the latest ChatGPT API from OpenAI.

Inspired by AthensGPT

App Demo

demo.mp4

How this works

  1. Parses pdf with pypdf
  2. Index Construction with LlamaIndex's GPTSimpleVectorIndex
  3. indexes and files are stored on s3
  4. Query the index
    • uses the latest ChatGPT model gpt-3.5-turbo

Usage

Configuration and secrets

  1. configure aws (quickstart)
    aws configure
  1. create an s3 bucket with a unique name

  2. Change the bucket name in the codebase (look for bucket_name = "classgpt" to whatever you created.

  3. rename [.env.local.example] to .env and add your openai credentials

Locally

  1. create python env
    conda create -n classgpt python=3.9
    conda activate classgpt
  1. install dependencies
    pip install -r requirements.txt
  1. run streamlit app
    cd app/
    streamlit run app/01_❓_Ask.py

Docker

Alternative, you can use Docker

    docker compose up

Then open up a new tab and navigate to http://localhost:8501/

TODO

  • local mode for app (no s3)
    • global variable use_s3 to toggle between local and s3 mode
  • deploy app to streamlit cloud
    • have input box for openai key
    • uses pyarrow local FS to store files
  • update code for new langchain update
  • Custom prompts and tweak settings
    • create a settings page for tweaking model parameters and provide custom prompts example
  • Add ability to query on multiple files

FAQ

Tokens

Tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens. These tokens are not cut up exactly where the words start or end - tokens can include trailing spaces and even sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths:

  • 1 token ~= 4 chars in English
  • 1 token ~= ¾ words
  • 100 tokens ~= 75 words
  • 1-2 sentence ~= 30 tokens
  • 1 paragraph ~= 100 tokens
  • 1,500 words ~= 2048 tokens

Try the OpenAI Tokenizer tool

Source

Embeddings

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

For text-embedding-ada-002, cost is $0.0004 / 1k tokens or 3000 pages/dollar

Models

For gpt-3.5-turbo model (ChatGPTAPI) cost is $0.002 / 1K tokens

For text-davinci-003 model, cost is $0.02 / 1K tokens

References

Streamlit

Deplyoment

LlamaIndex

Loading data

multimodal

ChatGPT

Langchain

Boto3

Docker stuff