Skip to content

An easy-to-use, high-performance(?) backend for serving LLMs and other AI models, built on FastAPI.

License

Notifications You must be signed in to change notification settings

fairyshine/FastMindAPI

Repository files navigation

FastMindAPI

An easy-to-use, high-performance(?) backend for serving LLMs and other AI models, built on FastAPI.

PyPI - Version GitHub License GitHub code size in bytes PyPI - Downloads

✨ 1 Features

1.1 Model: Support models with various backends

  • Transformers

    • Transformers_CausalLM ( AutoModelForCausalLM)
    • Peft_CausalLM ( PeftModelForCausalLM )
  • llama.cpp

    • Llamacpp_LLM (Llama)
  • OpenAI

    • OpenAI_ChatModel (/chat/completions)
  • vllm

    • vLLM_LLM(LLM)
  • MLC LLM

  • ...

1.2 Modules: More than just chatting with models

  • Function Calling (extra tools in Python)
  • Retrieval
  • Agent
  • ...

1.3 Flexibility: Easy to Use & Highly Customizable

  • Load the model when coding / runtime
  • Add any APIs you want

🚀 2 Quick Start

2.1 Installation

pip install fastmindapi

2.2 Usage (C/S)

2.2.1 Run the server (S)

in Terminal
fastmindapi-server --port 8000 --apikey sk-1999XXYY
in Python
import fastmindapi as FM

# Run the server with authentication key, port 8000 for default
server = FM.Server(API_KEY="sk-1999XXYY")
server.run()

2.2.2 Access the service (C)

via client
# For concise documention
curl http://IP:PORT/docs#/

# Use Case
# 1. add model info
curl http://127.0.0.1:8000/model/add_info \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1999XXYY" \
  -d '{
  "model_name": "gemma2",
  "model_type": "Transformers_CausalLM",
  "model_path": ".../PTM/gemma-2-2b"
}'

# 2. load model
curl http://127.0.0.1:8000/model/load/gemma2 -H "Authorization: Bearer sk-1999XXYY"

# 3. run model inference
# 3.1 Generation API
curl http://127.0.0.1:8000/model/generate/gemma2 \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1999XXYY" \
  -d '{
  "input_text": "Do you know something about Dota2?",
  "max_new_tokens": 100,
  "return_logits": true,
  "logits_top_k": 10,
  "stop_strings": ["\n"]
}'

# 3.2 OpenAI like API
curl http://127.0.0.1:8000/openai/chat/completions 
	-H "Content-Type: application/json" \
	-H "Authorization: Bearer sk-1999XXYY" \
	-d '{
  "model": "gemma2",
  "messages": [
    {
      "role": "system",
      "content": "You are a test assistant."
    },
    {
      "role": "user",
      "content": "Do you know something about Dota2?"
    }
  ],
  "max_completion_tokens": 100,
  "logprobs": true,
  "top_logprobs": 10,
  "stop": ["\n"]
}'
via HTTP requests
import fastmindapi as FM

# 127.0.0.1:8000 for default address
client = FM.Client(IP="x.x.x.x", PORT=xxx, API_KEY="sk-1999XXYY") 

# 1. add model info
model_info_list = [
  {
    "model_name": "gemma2",
    "model_type": "Transformers_CausalLM",
    "model_path": ".../PTM/gemma-2-2b"
  },
]
client.add_model_info_list(model_info_list)

# 2. load model
client.load_model("gemma2")

# 3. run model inference
generation_request={
  "input_text": "Do you know something about Dota2?",
  "max_new_tokens": 10,
  "return_logits": True,
  "logits_top_k": 10,
  "stop_strings": ["."]
}
client.generate("gemma2", generation_request)

🪧 We primarily maintain the backend server; the client is provided for reference only. The main usage is through sending HTTP requests. (We might release FM-GUI in the future.)

About

An easy-to-use, high-performance(?) backend for serving LLMs and other AI models, built on FastAPI.

Resources

License

Stars

Watchers

Forks

Packages

No packages published