@app.delete("/v1/models")
async def remove_model(model_name: str):
"""
Remove a model from the API.
diff --git a/search/search_index.json b/search/search_index.json
index 237c8590..2c79ddb8 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"FastMLX","text":"FastMLX is a high-performance, production-ready API for hosting MLX models, including Vision Language Models (VLMs) and Language Models (LMs). It provides an easy-to-use interface for integrating powerful machine learning capabilities into your applications.
"},{"location":"#key-features","title":"Key Features","text":" - OpenAI-compatible API: Easily integrate with existing applications that use OpenAI's API.
- Dynamic Model Loading: Load MLX models on-the-fly or use pre-loaded models for better performance.
- Support for Multiple Model Types: Compatible with various MLX model architectures.
- Image Processing Capabilities: Handle both text and image inputs for versatile model interactions.
- Efficient Resource Management: Optimized for high-performance and scalability.
- Error Handling: Robust error management for production environments.
- Customizable: Easily extendable to accommodate specific use cases and model types.
"},{"location":"#quick-start","title":"Quick Start","text":"Get started with FastMLX: Learn how to install and set up FastMLX in your environment.
Explore Examples: Hands-on guides, such as:
- Chatbot application
- Function calling
"},{"location":"#installation","title":"Installation","text":"Install FastMLX on your system by running the following command:
pip install -U fastmlx\n
"},{"location":"#running-the-server","title":"Running the Server","text":"Start the FastMLX server using the following command:
fastmlx\n
or with multiple workers for improved performance:
fastmlx --workers 4\n
"},{"location":"#making-api-calls","title":"Making API Calls","text":"Once the server is running, you can interact with the API. Here's an example using a Vision Language Model:
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/nanoLLaVA-1.5-4bit\",\n \"image\": \"http://images.cocodataset.org/val2017/000000039769.jpg\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What are these\"}],\n \"max_tokens\": 100\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
"},{"location":"#whats-next","title":"What's Next?","text":" - Check out the Installation guide for detailed setup instructions.
- Learn more about the API usage in the Usage section.
- Explore advanced features and configurations in the API Reference.
- If you're interested in contributing, see our Contributing guidelines.
"},{"location":"#license","title":"License","text":"FastMLX is free software, licensed under the Apache Software License 2.0.
For more detailed information and advanced usage, please explore the rest of our documentation. If you encounter any issues or have questions, don't hesitate to report an issue on our GitHub repository.
Happy coding with FastMLX!
"},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#v010-11-july-2024","title":"[v0.1.0] - 11 July 2024","text":"What's Changed
- Add support for token streaming and custom CORS by @Blaizzy
- Add support for Parallel calls by @Blaizzy
- Add Parallel calls usage by @Blaizzy
Fixes :
- Cross origin Support #2
- Max tokens not overriding #5
"},{"location":"changelog/#v001-09-july-2024","title":"[v0.0.1] - 09 July 2024","text":"What's Changed
- Setup FastMLX by @Blaizzy
- Add support for VLMs by @Blaizzy
- Add support for LMs by by @Blaizzy
New Contributors
- @Blaizzy made their first contribution in https://github.com/Blaizzy/fastmlx/pull/1
"},{"location":"cli_reference/","title":"CLI Reference","text":"The FastMLX API server can be configured using various command-line arguments. Here is a detailed reference for each available option.
"},{"location":"cli_reference/#usage","title":"Usage","text":"fastmlx [OPTIONS]\n
"},{"location":"cli_reference/#options","title":"Options","text":""},{"location":"cli_reference/#-allowed-origins","title":"--allowed-origins
","text":" - Type: List of strings
- Default:
[\"*\"]
- Description: List of allowed origins for CORS (Cross-Origin Resource Sharing).
"},{"location":"cli_reference/#-host","title":"--host
","text":" - Type: String
- Default:
\"0.0.0.0\"
- Description: Host to run the server on.
"},{"location":"cli_reference/#-port","title":"--port
","text":" - Type: Integer
- Default:
8000
- Description: Port to run the server on.
"},{"location":"cli_reference/#-reload","title":"--reload
","text":" - Type: Boolean
- Default:
False
- Description: Enable auto-reload of the server. Only works when 'workers' is set to None.
"},{"location":"cli_reference/#-workers","title":"--workers
","text":" - Type: Integer or Float
- Default: Calculated based on
FASTMLX_NUM_WORKERS
environment variable or 2 if not set. -
Description: Number of workers. This option overrides the FASTMLX_NUM_WORKERS
environment variable.
-
If an integer, it specifies the exact number of workers to use.
- If a float, it represents the fraction of available CPU cores to use (minimum 1 worker).
- To use all available CPU cores, set it to 1.0.
Examples: - --workers 1
: Use 1 worker - --workers 1.0
: Use all available CPU cores - --workers 0.5
: Use half of the available CPU cores - --workers 0.0
: Use 1 worker
"},{"location":"cli_reference/#environment-variables","title":"Environment Variables","text":" FASTMLX_NUM_WORKERS
: Sets the default number of workers if not specified via the --workers
argument.
"},{"location":"cli_reference/#examples","title":"Examples","text":" -
Run the server on localhost with default settings:
fastmlx\n
-
Run the server on a specific host and port:
fastmlx --host 127.0.0.1 --port 5000\n
-
Run the server with 4 workers:
fastmlx --workers 4\n
-
Run the server using half of the available CPU cores:
fastmlx --workers 0.5\n
-
Enable auto-reload (for development):
fastmlx --reload\n
Remember that the --reload
option is intended for development purposes and should not be used in production environments.
"},{"location":"community_projects/","title":"Community Projects","text":"Here are some projects built by the community that use FastMLX:
- FastMLX-MineCraft by Mathieu
- MLX Chat by Nils Durner
- AI Home Hub by Prince Canuma
"},{"location":"community_projects/#projects-in-detail","title":"PROJECTS IN DETAIL","text":""},{"location":"community_projects/#fastmlx-minecraft-by-mathieu","title":"FastMLX-MineCraft by Mathieu","text":""},{"location":"community_projects/#mlx-chat-by-nils-durner","title":"MLX Chat by Nils Durner","text":"Chat interface for MLX for on-device Language Model use on Apple Silicon. Built on FastMLX.
"},{"location":"community_projects/#home-hub-by-prince-canuma","title":"Home Hub by Prince Canuma","text":"Turning your Mac into an AI home server.
"},{"location":"contributing/","title":"Join us in making a difference!","text":"Your contributions are always welcome and we would love to see how you can make our project even better. Your input is invaluable to us, and we ensure that all contributors receive recognition for their efforts.
"},{"location":"contributing/#ways-to-contribute","title":"Ways to contribute","text":"Here\u2019s how you can get involved:
"},{"location":"contributing/#report-bugs","title":"Report Bugs","text":"Report bugs at https://github.com/Blaizzy/fastmlx/issues.
If you are reporting a bug, please include:
- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.
"},{"location":"contributing/#fix-bugs","title":"Fix Bugs","text":"Look through the GitHub issues for bugs. Anything tagged with bug
and help wanted
is open to whoever wants to implement it.
"},{"location":"contributing/#implement-features","title":"Implement Features","text":"Look through the GitHub issues for features. If anything tagged enhancement
and help wanted
catches your eye, dive in and start coding. Your ideas can become a reality in FastMLX!
"},{"location":"contributing/#write-documentation","title":"Write Documentation","text":"We\u2019re always in need of more documentation, whether it\u2019s for our official docs, adding helpful comments in the code, or writing blog posts and articles. Clear and comprehensive documentation empowers the community, and your contributions are crucial!
"},{"location":"contributing/#submit-feedback","title":"Submit Feedback","text":"The best way to share your thoughts is by filing an issue on our GitHub page: https://github.com/Blaizzy/fastmlx/issues. Whether you\u2019re suggesting a new feature or sharing your experience, we want to hear from you!
Proposing a feature?
- Describe in detail how it should work.
- Keep it focused and manageable to make implementation smoother.
- Remember, this is a volunteer-driven project, and your contributions are always appreciated!
"},{"location":"contributing/#how-to-get-started","title":"How to get Started!","text":"Ready to contribute? Follow these simple steps to set up FastMLX for local development and start making a difference.
-
Fork the repository.
- Head over to the fastmlx GitHub repo and click the Fork button to create your copy of the repository.
-
Clone your fork locally
- Open your terminal and run the following command to clone your forked repository:
$ git clone git@github.com:your_name_here/fastmlx.git\n
-
Set Up Your Development Environment
- Install your local copy of FastMLX into a virtual environment. If you\u2019re using
virtualenvwrapper
, follow these steps:
$ mkvirtualenv fastmlx\n$ cd fastmlx/\n$ python setup.py develop\n
Tip: If you don\u2019t have virtualenvwrapper
installed, you can install it with pip install virtualenvwrapper
.
-
Create a Development Branch
- Create a new branch to work on your bugfix or feature:
$ git checkout -b name-of-your-bugfix-or-feature\n
Now you\u2019re ready to make changes!
-
Run Tests and Code Checks
- When you're done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 fastmlx tests\n$ pytest .\n
- To install flake8 and tox, simply run:
pip install flake8 tox\n
-
Commit and Push Your Changes
- Once everything looks good, commit your changes with a descriptive message:
$ git add .\n$ git commit -m \"Your detailed description of your changes.\"\n$ git push origin name-of-your-bugfix-or-feature\n
-
Submit a Pull Request
- Head back to the FastMLX GitHub repo and open a pull request. We\u2019ll review your changes, provide feedback, and merge them once everything is ready.
"},{"location":"contributing/#pull-request-guidelines","title":"Pull Request Guidelines","text":"Before you submit a pull request, check that it meets these guidelines:
- The pull request should include tests.
- If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
- The pull request should work for Python 3.8 and later, and for PyPy. Check https://github.com/Blaizzy/fastmlx/pull_requests and make sure that the tests pass for all supported Python versions.
"},{"location":"endpoints/","title":"Endpoints","text":"Top-level package for fastmlx.
"},{"location":"endpoints/#fastmlx.add_model","title":"add_model(model_name)
async
","text":"Add a new model to the API.
Parameters:
Name Type Description Default model_name
str
The name of the model to add.
required Returns:
Name Type Description dict
dict
A dictionary containing the status of the operation.
Source code in fastmlx/fastmlx.py
@app.post(\"/v1/models\")\nasync def add_model(model_name: str):\n \"\"\"\n Add a new model to the API.\n\n Args:\n model_name (str): The name of the model to add.\n\n Returns:\n dict (dict): A dictionary containing the status of the operation.\n \"\"\"\n model_provider.load_model(model_name)\n return {\"status\": \"success\", \"message\": f\"Model {model_name} added successfully\"}\n
"},{"location":"endpoints/#fastmlx.chat_completion","title":"chat_completion(request)
async
","text":"Handle chat completion requests for both VLM and LM models.
Parameters:
Name Type Description Default request
ChatCompletionRequest
The chat completion request.
required Returns:
Name Type Description ChatCompletionResponse
ChatCompletionResponse
The generated chat completion response.
Raises:
Type Description HTTPException(str)
If MLX library is not available.
Source code in fastmlx/fastmlx.py
@app.post(\"/v1/chat/completions\", response_model=ChatCompletionResponse)\nasync def chat_completion(request: ChatCompletionRequest):\n \"\"\"\n Handle chat completion requests for both VLM and LM models.\n\n Args:\n request (ChatCompletionRequest): The chat completion request.\n\n Returns:\n ChatCompletionResponse (ChatCompletionResponse): The generated chat completion response.\n\n Raises:\n HTTPException (str): If MLX library is not available.\n \"\"\"\n if not MLX_AVAILABLE:\n raise HTTPException(status_code=500, detail=\"MLX library not available\")\n\n stream = request.stream\n model_data = model_provider.load_model(request.model)\n model = model_data[\"model\"]\n config = model_data[\"config\"]\n model_type = MODEL_REMAPPING.get(config[\"model_type\"], config[\"model_type\"])\n stop_words = get_eom_token(request.model)\n\n if model_type in MODELS[\"vlm\"]:\n processor = model_data[\"processor\"]\n image_processor = model_data[\"image_processor\"]\n\n image_url = None\n chat_messages = []\n\n for msg in request.messages:\n if isinstance(msg.content, str):\n chat_messages.append({\"role\": msg.role, \"content\": msg.content})\n elif isinstance(msg.content, list):\n text_content = \"\"\n for content_part in msg.content:\n if content_part.type == \"text\":\n text_content += content_part.text + \" \"\n elif content_part.type == \"image_url\":\n image_url = content_part.image_url[\"url\"]\n chat_messages.append(\n {\"role\": msg.role, \"content\": text_content.strip()}\n )\n\n if not image_url and model_type in MODELS[\"vlm\"]:\n raise HTTPException(\n status_code=400, detail=\"Image URL not provided for VLM model\"\n )\n\n prompt = \"\"\n if model.config.model_type != \"paligemma\":\n prompt = apply_vlm_chat_template(processor, config, chat_messages)\n else:\n prompt = chat_messages[-1][\"content\"]\n\n if stream:\n return StreamingResponse(\n vlm_stream_generator(\n model,\n request.model,\n processor,\n image_url,\n prompt,\n image_processor,\n request.max_tokens,\n request.temperature,\n stream_options=request.stream_options,\n ),\n media_type=\"text/event-stream\",\n )\n else:\n # Generate the response\n output = vlm_generate(\n model,\n processor,\n image_url,\n prompt,\n image_processor,\n max_tokens=request.max_tokens,\n temp=request.temperature,\n verbose=False,\n )\n\n else:\n # Add function calling information to the prompt\n if request.tools and \"firefunction-v2\" not in request.model:\n # Handle system prompt\n if request.messages and request.messages[0].role == \"system\":\n pass\n else:\n # Generate system prompt based on model and tools\n prompt, user_role = get_tool_prompt(\n request.model,\n [tool.model_dump() for tool in request.tools],\n request.messages[-1].content,\n )\n\n if user_role:\n request.messages[-1].content = prompt\n else:\n # Insert the system prompt at the beginning of the messages\n request.messages.insert(\n 0, ChatMessage(role=\"system\", content=prompt)\n )\n\n tokenizer = model_data[\"tokenizer\"]\n\n chat_messages = [\n {\"role\": msg.role, \"content\": msg.content} for msg in request.messages\n ]\n prompt = apply_lm_chat_template(tokenizer, chat_messages, request)\n\n if stream:\n return StreamingResponse(\n lm_stream_generator(\n model,\n request.model,\n tokenizer,\n prompt,\n request.max_tokens,\n request.temperature,\n stop_words=stop_words,\n stream_options=request.stream_options,\n ),\n media_type=\"text/event-stream\",\n )\n else:\n output, token_length_info = lm_generate(\n model,\n tokenizer,\n prompt,\n request.max_tokens,\n temp=request.temperature,\n stop_words=stop_words,\n )\n\n # Parse the output to check for function calls\n return handle_function_calls(output, request, token_length_info)\n
"},{"location":"endpoints/#fastmlx.get_supported_models","title":"get_supported_models()
async
","text":"Get a list of supported model types for VLM and LM.
Returns:
Name Type Description JSONResponse
json
A JSON response containing the supported models.
Source code in fastmlx/fastmlx.py
@app.get(\"/v1/supported_models\", response_model=SupportedModels)\nasync def get_supported_models():\n \"\"\"\n Get a list of supported model types for VLM and LM.\n\n Returns:\n JSONResponse (json): A JSON response containing the supported models.\n \"\"\"\n return JSONResponse(content=MODELS)\n
"},{"location":"endpoints/#fastmlx.list_models","title":"list_models()
async
","text":"List all available (loaded) models.
Returns:
Name Type Description dict
dict
A dictionary containing the list of available models.
Source code in fastmlx/fastmlx.py
@app.get(\"/v1/models\")\nasync def list_models():\n \"\"\"\n List all available (loaded) models.\n\n Returns:\n dict (dict): A dictionary containing the list of available models.\n \"\"\"\n return {\"models\": await model_provider.get_available_models()}\n
"},{"location":"endpoints/#fastmlx.lm_generate","title":"lm_generate(model, tokenizer, prompt, max_tokens=100, **kwargs)
","text":"Generate a complete response from the model.
Parameters:
Name Type Description Default model
Module
The language model.
required tokenizer
PreTrainedTokenizer
The tokenizer.
required prompt
str
The string prompt.
required max_tokens
int
The maximum number of tokens. Default: 100
.
100
verbose
bool
If True
, print tokens and timing information. Default: False
.
required formatter
Optional[Callable]
A function which takes a token and a probability and displays it.
required kwargs
The remaining options get passed to :func:generate_step
. See :func:generate_step
for more details.
{}
Source code in fastmlx/utils.py
def lm_generate(\n model,\n tokenizer,\n prompt: str,\n max_tokens: int = 100,\n **kwargs,\n) -> Union[str, Generator[str, None, None]]:\n \"\"\"\n Generate a complete response from the model.\n\n Args:\n model (nn.Module): The language model.\n tokenizer (PreTrainedTokenizer): The tokenizer.\n prompt (str): The string prompt.\n max_tokens (int): The maximum number of tokens. Default: ``100``.\n verbose (bool): If ``True``, print tokens and timing information.\n Default: ``False``.\n formatter (Optional[Callable]): A function which takes a token and a\n probability and displays it.\n kwargs: The remaining options get passed to :func:`generate_step`.\n See :func:`generate_step` for more details.\n \"\"\"\n if not isinstance(tokenizer, TokenizerWrapper):\n tokenizer = TokenizerWrapper(tokenizer)\n\n stop_words = kwargs.pop(\"stop_words\", [])\n\n stop_words_id = (\n tokenizer._tokenizer(stop_words)[\"input_ids\"][0] if stop_words else None\n )\n\n prompt_tokens = mx.array(tokenizer.encode(prompt))\n prompt_token_len = len(prompt_tokens)\n detokenizer = tokenizer.detokenizer\n\n detokenizer.reset()\n\n for (token, logprobs), n in zip(\n generate_step(prompt_tokens, model, **kwargs),\n range(max_tokens),\n ):\n if token == tokenizer.eos_token_id or (\n stop_words_id and token in stop_words_id\n ):\n break\n\n detokenizer.add_token(token)\n\n detokenizer.finalize()\n\n _completion_tokens = len(detokenizer.tokens)\n token_length_info: Usage = Usage(\n prompt_tokens=prompt_token_len,\n completion_tokens=_completion_tokens,\n total_tokens=prompt_token_len + _completion_tokens,\n )\n return detokenizer.text, token_length_info\n
"},{"location":"endpoints/#fastmlx.remove_model","title":"remove_model(model_name)
async
","text":"Remove a model from the API.
Parameters:
Name Type Description Default model_name
str
The name of the model to remove.
required Returns:
Name Type Description Response
str
A 204 No Content response if successful.
Raises:
Type Description HTTPException(str)
If the model is not found.
Source code in fastmlx/fastmlx.py
@app.delete(\"/v1/models\")\nasync def remove_model(model_name: str):\n \"\"\"\n Remove a model from the API.\n\n Args:\n model_name (str): The name of the model to remove.\n\n Returns:\n Response (str): A 204 No Content response if successful.\n\n Raises:\n HTTPException (str): If the model is not found.\n \"\"\"\n model_name = unquote(model_name).strip('\"')\n removed = await model_provider.remove_model(model_name)\n if removed:\n return Response(status_code=204) # 204 No Content - successful deletion\n else:\n raise HTTPException(status_code=404, detail=f\"Model '{model_name}' not found\")\n
"},{"location":"installation/","title":"Installation","text":""},{"location":"installation/#stable-release","title":"Stable release","text":"To install the latest stable release of FastMLX, use the following command:
pip install -U fastmlx\n
This is the recommended method to install FastMLX, as it will always install the most recent stable release.
If pip isn't installed, you can follow the Python installation guide to set it up.
"},{"location":"installation/#installation-from-sources","title":"Installation from Sources","text":"To install FastMLX directly from the source code, run this command in your terminal:
pip install git+https://github.com/Blaizzy/fastmlx\n
"},{"location":"installation/#running-the-server","title":"Running the Server","text":"There are two ways to start the FastMLX server:
Using the fastmlx
command:
fastmlx\n
or
Using uvicorn
directly:
uvicorn fastmlx:app --reload --workers 0\n
WARNING: The --reload
flag should not be used in production. It is only intended for development purposes.
"},{"location":"installation/#additional-notes","title":"Additional Notes","text":" - Dependencies: Ensure that you have the required dependencies installed. FastMLX relies on several libraries, which
pip
will handle automatically.
"},{"location":"models/","title":"Managing Models","text":""},{"location":"models/#listing-supported-models","title":"Listing Supported Models","text":"To see all vision and language models supported by MLX:
import requests\n\nurl = \"http://localhost:8000/v1/supported_models\"\nresponse = requests.get(url)\nprint(response.json())\n
"},{"location":"models/#listing-available-models","title":"Listing Available Models","text":"To see all available models:
import requests\n\nurl = \"http://localhost:8000/v1/models\"\nresponse = requests.get(url)\nprint(response.json())\n
"},{"location":"models/#deleting-models","title":"Deleting Models","text":"To remove any models loaded to memory:
import requests\n\nurl = \"http://localhost:8000/v1/models\"\nparams = {\n \"model_name\": \"hf-repo-or-path\",\n}\nresponse = requests.delete(url, params=params)\nprint(response)\n
"},{"location":"usage/","title":"Usage","text":"This guide covers the server setup, and usage of FastMLX, including making API calls and managing models.
"},{"location":"usage/#1-installation","title":"1. Installation","text":"Follow the installation guide to install FastMLX.
"},{"location":"usage/#2-running-the-server","title":"2. Running the server","text":"Start the FastMLX server with the following command:
fastmlx\n
or
Using uvicorn
directly:
uvicorn fastmlx:app --reload --workers 0\n
[!WARNING] The --reload
flag should not be used in production. It is only intended for development purposes.
"},{"location":"usage/#running-with-multiple-workers-parallel-processing","title":"Running with Multiple Workers (Parallel Processing)","text":"For improved performance and parallel processing capabilities, you can specify either the absolute number of worker processes or the fraction of CPU cores to use.
You can set the number of workers in three ways (listed in order of precedence):
-
Command-line argument:
fastmlx --workers 4\n
or uvicorn fastmlx:app --workers 4\n
-
Environment variable:
export FASTMLX_NUM_WORKERS=4\nfastmlx\n
-
Default value (2 workers)
To use all available CPU cores, set the value to 1.0:
fastmlx --workers 1.0\n
[!NOTE] - The --reload
flag is not compatible with multiple workers. - The number of workers should typically not exceed the number of CPU cores available on your machine for optimal performance.
"},{"location":"usage/#considerations-for-multi-worker-setup","title":"Considerations for Multi-Worker Setup","text":" - Stateless Application: Ensure your FastMLX application is stateless, as each worker process operates independently.
- Database Connections: If your app uses a database, make sure your connection pooling is configured to handle multiple workers.
- Resource Usage: Monitor your system's resource usage to find the optimal number of workers for your specific hardware and application needs.
- Load Balancing: When running with multiple workers, incoming requests are automatically load-balanced across the worker processes.
"},{"location":"usage/#3-making-api-calls","title":"3. Making API Calls","text":"Use the API similar to OpenAI's chat completions:
"},{"location":"usage/#vision-language-model","title":"Vision Language Model","text":""},{"location":"usage/#without-streaming","title":"Without Streaming","text":"Here's an example of how to use a Vision Language Model:
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/nanoLLaVA-1.5-4bit\",\n \"image\": \"http://images.cocodataset.org/val2017/000000039769.jpg\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What are these\"}],\n \"max_tokens\": 100\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
"},{"location":"usage/#without-streaming_1","title":"Without Streaming","text":"import requests\nimport json\n\ndef process_sse_stream(url, headers, data):\n response = requests.post(url, headers=headers, json=data, stream=True)\n\n if response.status_code != 200:\n print(f\"Error: Received status code {response.status_code}\")\n print(response.text)\n return\n\n full_content = \"\"\n\n try:\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n print(\"\\nStream finished. \u2705\")\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n full_content += content\n print(content, end='', flush=True)\n except json.JSONDecodeError:\n print(f\"\\nFailed to decode JSON: {event_data}\")\n except KeyError:\n print(f\"\\nUnexpected data structure: {chunk_data}\")\n\n except KeyboardInterrupt:\n print(\"\\nStream interrupted by user.\")\n except requests.exceptions.RequestException as e:\n print(f\"\\nAn error occurred: {e}\")\n\nif __name__ == \"__main__\":\n url = \"http://localhost:8000/v1/chat/completions\"\n headers = {\"Content-Type\": \"application/json\"}\n data = {\n \"model\": \"mlx-community/nanoLLaVA-1.5-4bit\",\n \"image\": \"http://images.cocodataset.org/val2017/000000039769.jpg\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What are these?\"}],\n \"max_tokens\": 500,\n \"stream\": True\n }\n process_sse_stream(url, headers, data)\n
"},{"location":"usage/#language-model","title":"Language Model","text":""},{"location":"usage/#without-streaming_2","title":"Without Streaming","text":"Here's an example of how to use a Language Model:
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/gemma-2-9b-it-4bit\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n \"max_tokens\": 100\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
"},{"location":"usage/#with-streaming","title":"With Streaming","text":"import requests\nimport json\n\ndef process_sse_stream(url, headers, data):\n response = requests.post(url, headers=headers, json=data, stream=True)\n\n if response.status_code != 200:\n print(f\"Error: Received status code {response.status_code}\")\n print(response.text)\n return\n\n full_content = \"\"\n\n try:\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n print(\"\\nStream finished. \u2705\")\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n full_content += content\n print(content, end='', flush=True)\n except json.JSONDecodeError:\n print(f\"\\nFailed to decode JSON: {event_data}\")\n except KeyError:\n print(f\"\\nUnexpected data structure: {chunk_data}\")\n\n except KeyboardInterrupt:\n print(\"\\nStream interrupted by user.\")\n except requests.exceptions.RequestException as e:\n print(f\"\\nAn error occurred: {e}\")\n\nif __name__ == \"__main__\":\n url = \"http://localhost:8000/v1/chat/completions\"\n headers = {\"Content-Type\": \"application/json\"}\n data = {\n \"model\": \"mlx-community/gemma-2-9b-it-4bit\",\n \"messages\": [{\"role\": \"user\", \"content\": \"Hi, how are you?\"}],\n \"max_tokens\": 500,\n \"stream\": True\n }\n process_sse_stream(url, headers, data)\n
For more detailed API documentation, please refer to the API Reference section.
"},{"location":"examples/chatbot/","title":"Multi-Modal Chatbot","text":"This example demonstrates how to create a chatbot application using FastMLX with a Gradio interface.
import argparse\nimport gradio as gr\nimport requests\nimport json\n\nimport asyncio\n\nasync def process_sse_stream(url, headers, data):\n response = requests.post(url, headers=headers, json=data, stream=True)\n if response.status_code != 200:\n raise gr.Error(f\"Error: Received status code {response.status_code}\")\n full_content = \"\"\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n yield str(content)\n except (json.JSONDecodeError, KeyError):\n continue\n\nasync def chat(message, history, temperature, max_tokens):\n\n url = \"http://localhost:8000/v1/chat/completions\"\n headers = {\"Content-Type\": \"application/json\"}\n data = {\n \"model\": \"mlx-community/Qwen2.5-1.5B-Instruct-4bit\",\n \"messages\": [{\"role\": \"user\", \"content\": message['text']}],\n \"max_tokens\": max_tokens,\n \"temperature\": temperature,\n \"stream\": True\n }\n\n if len(message['files']) > 0:\n data[\"model\"] = \"mlx-community/nanoLLaVA-1.5-8bit\"\n data[\"image\"] = message['files'][-1][\"path\"]\n\n response = requests.post(url, headers=headers, json=data, stream=True)\n if response.status_code != 200:\n raise gr.Error(f\"Error: Received status code {response.status_code}\")\n\n full_content = \"\"\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n full_content += content\n yield full_content\n except (json.JSONDecodeError, KeyError):\n continue\n\ndemo = gr.ChatInterface(\n fn=chat,\n title=\"FastMLX Chat UI\",\n additional_inputs_accordion=gr.Accordion(\n label=\"\u2699\ufe0f Parameters\", open=False, render=False\n ),\n additional_inputs=[\n gr.Slider(\n minimum=0, maximum=1, step=0.1, value=0.1, label=\"Temperature\", render=False\n ),\n gr.Slider(\n minimum=128,\n maximum=4096,\n step=1,\n value=200,\n label=\"Max new tokens\",\n render=False\n ),\n ],\n multimodal=True,\n)\n\ndemo.launch(inbrowser=True)\n
"},{"location":"examples/function_calling/","title":"Function Calling","text":""},{"location":"examples/function_calling/#function-calling","title":"Function Calling","text":"FastMLX now supports tool calling in accordance with the OpenAI API specification. This feature is available for the following models:
- Llama 3.1
- Arcee Agent
- C4ai-Command-R-Plus
- Firefunction
- xLAM
Supported modes:
- Without Streaming
- Parallel Tool Calling
Note: Tool choice and OpenAI-compliant streaming for function calling are currently under development.
This example demonstrates how to use the get_current_weather
tool with the Llama 3.1
model. The API will process the user's question and use the provided tool to fetch the required information.
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/Meta-Llama-3.1-8B-Instruct-8bit\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"What's the weather like in San Francisco and Washington?\"\n }\n ],\n \"tools\": [\n {\n \"name\": \"get_current_weather\",\n \"description\": \"Get the current weather\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"location\": {\n \"type\": \"string\",\n \"description\": \"The city and state, e.g. San Francisco, CA\"\n },\n \"format\": {\n \"type\": \"string\",\n \"enum\": [\"celsius\", \"fahrenheit\"],\n \"description\": \"The temperature unit to use. Infer this from the user's location.\"\n }\n },\n \"required\": [\"location\", \"format\"]\n }\n }\n ],\n \"max_tokens\": 150,\n \"temperature\": 0.7,\n \"stream\": False,\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
Note: Streaming is available for regular text generation, but the streaming implementation for function calling is still in development and does not yet fully comply with the OpenAI specification.
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"FastMLX","text":"FastMLX is a high-performance, production-ready API for hosting MLX models, including Vision Language Models (VLMs) and Language Models (LMs). It provides an easy-to-use interface for integrating powerful machine learning capabilities into your applications.
"},{"location":"#key-features","title":"Key Features","text":" - OpenAI-compatible API: Easily integrate with existing applications that use OpenAI's API.
- Dynamic Model Loading: Load MLX models on-the-fly or use pre-loaded models for better performance.
- Support for Multiple Model Types: Compatible with various MLX model architectures.
- Image Processing Capabilities: Handle both text and image inputs for versatile model interactions.
- Efficient Resource Management: Optimized for high-performance and scalability.
- Error Handling: Robust error management for production environments.
- Customizable: Easily extendable to accommodate specific use cases and model types.
"},{"location":"#quick-start","title":"Quick Start","text":"Get started with FastMLX: Learn how to install and set up FastMLX in your environment.
Explore Examples: Hands-on guides, such as:
- Chatbot application
- Function calling
"},{"location":"#installation","title":"Installation","text":"Install FastMLX on your system by running the following command:
pip install -U fastmlx\n
"},{"location":"#running-the-server","title":"Running the Server","text":"Start the FastMLX server using the following command:
fastmlx\n
or with multiple workers for improved performance:
fastmlx --workers 4\n
"},{"location":"#making-api-calls","title":"Making API Calls","text":"Once the server is running, you can interact with the API. Here's an example using a Vision Language Model:
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/nanoLLaVA-1.5-4bit\",\n \"image\": \"http://images.cocodataset.org/val2017/000000039769.jpg\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What are these\"}],\n \"max_tokens\": 100\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
"},{"location":"#whats-next","title":"What's Next?","text":" - Check out the Installation guide for detailed setup instructions.
- Learn more about the API usage in the Usage section.
- Explore advanced features and configurations in the API Reference.
- If you're interested in contributing, see our Contributing guidelines.
"},{"location":"#license","title":"License","text":"FastMLX is free software, licensed under the Apache Software License 2.0.
For more detailed information and advanced usage, please explore the rest of our documentation. If you encounter any issues or have questions, don't hesitate to report an issue on our GitHub repository.
Happy coding with FastMLX!
"},{"location":"changelog/","title":"Changelog","text":""},{"location":"changelog/#v010-11-july-2024","title":"[v0.1.0] - 11 July 2024","text":"What's Changed
- Add support for token streaming and custom CORS by @Blaizzy
- Add support for Parallel calls by @Blaizzy
- Add Parallel calls usage by @Blaizzy
Fixes :
- Cross origin Support #2
- Max tokens not overriding #5
"},{"location":"changelog/#v001-09-july-2024","title":"[v0.0.1] - 09 July 2024","text":"What's Changed
- Setup FastMLX by @Blaizzy
- Add support for VLMs by @Blaizzy
- Add support for LMs by by @Blaizzy
New Contributors
- @Blaizzy made their first contribution in https://github.com/Blaizzy/fastmlx/pull/1
"},{"location":"cli_reference/","title":"CLI Reference","text":"The FastMLX API server can be configured using various command-line arguments. Here is a detailed reference for each available option.
"},{"location":"cli_reference/#usage","title":"Usage","text":"fastmlx [OPTIONS]\n
"},{"location":"cli_reference/#options","title":"Options","text":""},{"location":"cli_reference/#-allowed-origins","title":"--allowed-origins
","text":" - Type: List of strings
- Default:
[\"*\"]
- Description: List of allowed origins for CORS (Cross-Origin Resource Sharing).
"},{"location":"cli_reference/#-host","title":"--host
","text":" - Type: String
- Default:
\"0.0.0.0\"
- Description: Host to run the server on.
"},{"location":"cli_reference/#-port","title":"--port
","text":" - Type: Integer
- Default:
8000
- Description: Port to run the server on.
"},{"location":"cli_reference/#-reload","title":"--reload
","text":" - Type: Boolean
- Default:
False
- Description: Enable auto-reload of the server. Only works when 'workers' is set to None.
"},{"location":"cli_reference/#-workers","title":"--workers
","text":" - Type: Integer or Float
- Default: Calculated based on
FASTMLX_NUM_WORKERS
environment variable or 2 if not set. -
Description: Number of workers. This option overrides the FASTMLX_NUM_WORKERS
environment variable.
-
If an integer, it specifies the exact number of workers to use.
- If a float, it represents the fraction of available CPU cores to use (minimum 1 worker).
- To use all available CPU cores, set it to 1.0.
Examples: - --workers 1
: Use 1 worker - --workers 1.0
: Use all available CPU cores - --workers 0.5
: Use half of the available CPU cores - --workers 0.0
: Use 1 worker
"},{"location":"cli_reference/#environment-variables","title":"Environment Variables","text":" FASTMLX_NUM_WORKERS
: Sets the default number of workers if not specified via the --workers
argument.
"},{"location":"cli_reference/#examples","title":"Examples","text":" -
Run the server on localhost with default settings:
fastmlx\n
-
Run the server on a specific host and port:
fastmlx --host 127.0.0.1 --port 5000\n
-
Run the server with 4 workers:
fastmlx --workers 4\n
-
Run the server using half of the available CPU cores:
fastmlx --workers 0.5\n
-
Enable auto-reload (for development):
fastmlx --reload\n
Remember that the --reload
option is intended for development purposes and should not be used in production environments.
"},{"location":"community_projects/","title":"Community Projects","text":"Here are some projects built by the community that use FastMLX:
- FastMLX-MineCraft by Mathieu
- MLX Chat by Nils Durner
- AI Home Hub by Prince Canuma
"},{"location":"community_projects/#projects-in-detail","title":"PROJECTS IN DETAIL","text":""},{"location":"community_projects/#fastmlx-minecraft-by-mathieu","title":"FastMLX-MineCraft by Mathieu","text":""},{"location":"community_projects/#mlx-chat-by-nils-durner","title":"MLX Chat by Nils Durner","text":"Chat interface for MLX for on-device Language Model use on Apple Silicon. Built on FastMLX.
"},{"location":"community_projects/#home-hub-by-prince-canuma","title":"Home Hub by Prince Canuma","text":"Turning your Mac into an AI home server.
"},{"location":"contributing/","title":"Join us in making a difference!","text":"Your contributions are always welcome and we would love to see how you can make our project even better. Your input is invaluable to us, and we ensure that all contributors receive recognition for their efforts.
"},{"location":"contributing/#ways-to-contribute","title":"Ways to contribute","text":"Here\u2019s how you can get involved:
"},{"location":"contributing/#report-bugs","title":"Report Bugs","text":"Report bugs at https://github.com/Blaizzy/fastmlx/issues.
If you are reporting a bug, please include:
- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.
"},{"location":"contributing/#fix-bugs","title":"Fix Bugs","text":"Look through the GitHub issues for bugs. Anything tagged with bug
and help wanted
is open to whoever wants to implement it.
"},{"location":"contributing/#implement-features","title":"Implement Features","text":"Look through the GitHub issues for features. If anything tagged enhancement
and help wanted
catches your eye, dive in and start coding. Your ideas can become a reality in FastMLX!
"},{"location":"contributing/#write-documentation","title":"Write Documentation","text":"We\u2019re always in need of more documentation, whether it\u2019s for our official docs, adding helpful comments in the code, or writing blog posts and articles. Clear and comprehensive documentation empowers the community, and your contributions are crucial!
"},{"location":"contributing/#submit-feedback","title":"Submit Feedback","text":"The best way to share your thoughts is by filing an issue on our GitHub page: https://github.com/Blaizzy/fastmlx/issues. Whether you\u2019re suggesting a new feature or sharing your experience, we want to hear from you!
Proposing a feature?
- Describe in detail how it should work.
- Keep it focused and manageable to make implementation smoother.
- Remember, this is a volunteer-driven project, and your contributions are always appreciated!
"},{"location":"contributing/#how-to-get-started","title":"How to get Started!","text":"Ready to contribute? Follow these simple steps to set up FastMLX for local development and start making a difference.
-
Fork the repository.
- Head over to the fastmlx GitHub repo and click the Fork button to create your copy of the repository.
-
Clone your fork locally
- Open your terminal and run the following command to clone your forked repository:
$ git clone git@github.com:your_name_here/fastmlx.git\n
-
Set Up Your Development Environment
- Install your local copy of FastMLX into a virtual environment. If you\u2019re using
virtualenvwrapper
, follow these steps:
$ mkvirtualenv fastmlx\n$ cd fastmlx/\n$ python setup.py develop\n
Tip: If you don\u2019t have virtualenvwrapper
installed, you can install it with pip install virtualenvwrapper
.
-
Create a Development Branch
- Create a new branch to work on your bugfix or feature:
$ git checkout -b name-of-your-bugfix-or-feature\n
Now you\u2019re ready to make changes!
-
Run Tests and Code Checks
- When you're done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 fastmlx tests\n$ pytest .\n
- To install flake8 and tox, simply run:
pip install flake8 tox\n
-
Commit and Push Your Changes
- Once everything looks good, commit your changes with a descriptive message:
$ git add .\n$ git commit -m \"Your detailed description of your changes.\"\n$ git push origin name-of-your-bugfix-or-feature\n
-
Submit a Pull Request
- Head back to the FastMLX GitHub repo and open a pull request. We\u2019ll review your changes, provide feedback, and merge them once everything is ready.
"},{"location":"contributing/#pull-request-guidelines","title":"Pull Request Guidelines","text":"Before you submit a pull request, check that it meets these guidelines:
- The pull request should include tests.
- If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
- The pull request should work for Python 3.8 and later, and for PyPy. Check https://github.com/Blaizzy/fastmlx/pull_requests and make sure that the tests pass for all supported Python versions.
"},{"location":"endpoints/","title":"Endpoints","text":"Top-level package for fastmlx.
"},{"location":"endpoints/#fastmlx.add_model","title":"add_model(model_name)
async
","text":"Add a new model to the API.
Parameters:
Name Type Description Default model_name
str
The name of the model to add.
required Returns:
Name Type Description dict
dict
A dictionary containing the status of the operation.
Source code in fastmlx/fastmlx.py
@app.post(\"/v1/models\")\nasync def add_model(model_name: str):\n \"\"\"\n Add a new model to the API.\n\n Args:\n model_name (str): The name of the model to add.\n\n Returns:\n dict (dict): A dictionary containing the status of the operation.\n \"\"\"\n model_provider.load_model(model_name)\n return {\"status\": \"success\", \"message\": f\"Model {model_name} added successfully\"}\n
"},{"location":"endpoints/#fastmlx.chat_completion","title":"chat_completion(request)
async
","text":"Handle chat completion requests for both VLM and LM models.
Parameters:
Name Type Description Default request
ChatCompletionRequest
The chat completion request.
required Returns:
Name Type Description ChatCompletionResponse
ChatCompletionResponse
The generated chat completion response.
Raises:
Type Description HTTPException(str)
If MLX library is not available.
Source code in fastmlx/fastmlx.py
@app.post(\"/v1/chat/completions\", response_model=ChatCompletionResponse)\nasync def chat_completion(request: ChatCompletionRequest):\n \"\"\"\n Handle chat completion requests for both VLM and LM models.\n\n Args:\n request (ChatCompletionRequest): The chat completion request.\n\n Returns:\n ChatCompletionResponse (ChatCompletionResponse): The generated chat completion response.\n\n Raises:\n HTTPException (str): If MLX library is not available.\n \"\"\"\n if not MLX_AVAILABLE:\n raise HTTPException(status_code=500, detail=\"MLX library not available\")\n\n stream = request.stream\n model_data = model_provider.load_model(request.model)\n model = model_data[\"model\"]\n config = model_data[\"config\"]\n model_type = MODEL_REMAPPING.get(config[\"model_type\"], config[\"model_type\"])\n stop_words = get_eom_token(request.model)\n\n if model_type in MODELS[\"vlm\"]:\n processor = model_data[\"processor\"]\n image_processor = model_data[\"image_processor\"]\n\n image_url = None\n chat_messages = []\n\n for msg in request.messages:\n if isinstance(msg.content, str):\n chat_messages.append({\"role\": msg.role, \"content\": msg.content})\n elif isinstance(msg.content, list):\n text_content = \"\"\n for content_part in msg.content:\n if content_part.type == \"text\":\n text_content += content_part.text + \" \"\n elif content_part.type == \"image_url\":\n image_url = content_part.image_url[\"url\"]\n chat_messages.append(\n {\"role\": msg.role, \"content\": text_content.strip()}\n )\n\n if not image_url and model_type in MODELS[\"vlm\"]:\n raise HTTPException(\n status_code=400, detail=\"Image URL not provided for VLM model\"\n )\n\n prompt = \"\"\n if model.config.model_type != \"paligemma\":\n prompt = apply_vlm_chat_template(processor, config, chat_messages)\n else:\n prompt = chat_messages[-1][\"content\"]\n\n if stream:\n return StreamingResponse(\n vlm_stream_generator(\n model,\n request.model,\n processor,\n image_url,\n prompt,\n image_processor,\n request.max_tokens,\n request.temperature,\n stream_options=request.stream_options,\n ),\n media_type=\"text/event-stream\",\n )\n else:\n # Generate the response\n output = vlm_generate(\n model,\n processor,\n image_url,\n prompt,\n image_processor,\n max_tokens=request.max_tokens,\n temp=request.temperature,\n verbose=False,\n )\n\n else:\n # Add function calling information to the prompt\n if request.tools and \"firefunction-v2\" not in request.model:\n # Handle system prompt\n if request.messages and request.messages[0].role == \"system\":\n pass\n else:\n # Generate system prompt based on model and tools\n prompt, user_role = get_tool_prompt(\n request.model,\n [tool.model_dump() for tool in request.tools],\n request.messages[-1].content,\n )\n\n if user_role:\n request.messages[-1].content = prompt\n else:\n # Insert the system prompt at the beginning of the messages\n request.messages.insert(\n 0, ChatMessage(role=\"system\", content=prompt)\n )\n\n tokenizer = model_data[\"tokenizer\"]\n\n chat_messages = [\n {\"role\": msg.role, \"content\": msg.content} for msg in request.messages\n ]\n prompt = apply_lm_chat_template(tokenizer, chat_messages, request)\n\n if stream:\n return StreamingResponse(\n lm_stream_generator(\n model,\n request.model,\n tokenizer,\n prompt,\n request.max_tokens,\n request.temperature,\n stop_words=stop_words,\n stream_options=request.stream_options,\n ),\n media_type=\"text/event-stream\",\n )\n else:\n output, token_length_info = lm_generate(\n model,\n tokenizer,\n prompt,\n request.max_tokens,\n temp=request.temperature,\n stop_words=stop_words,\n )\n\n # Parse the output to check for function calls\n return handle_function_calls(output, request, token_length_info)\n
"},{"location":"endpoints/#fastmlx.get_supported_models","title":"get_supported_models()
async
","text":"Get a list of supported model types for VLM and LM.
Returns:
Name Type Description JSONResponse
json
A JSON response containing the supported models.
Source code in fastmlx/fastmlx.py
@app.get(\"/v1/supported_models\", response_model=SupportedModels)\nasync def get_supported_models():\n \"\"\"\n Get a list of supported model types for VLM and LM.\n\n Returns:\n JSONResponse (json): A JSON response containing the supported models.\n \"\"\"\n return JSONResponse(content=MODELS)\n
"},{"location":"endpoints/#fastmlx.list_models","title":"list_models()
async
","text":"Get list of models - provided in OpenAI API compliant format.
Source code in fastmlx/fastmlx.py
@app.get(\"/v1/models\")\nasync def list_models():\n \"\"\"\n Get list of models - provided in OpenAI API compliant format.\n \"\"\"\n models = await model_provider.get_available_models()\n models_data = []\n for model in models:\n models_data.append(\n {\n \"id\": model,\n \"object\": \"model\",\n \"created\": int(time.time()),\n \"owned_by\": \"system\",\n }\n )\n return {\"object\": \"list\", \"data\": models_data}\n
"},{"location":"endpoints/#fastmlx.lm_generate","title":"lm_generate(model, tokenizer, prompt, max_tokens=100, **kwargs)
","text":"Generate a complete response from the model.
Parameters:
Name Type Description Default model
Module
The language model.
required tokenizer
PreTrainedTokenizer
The tokenizer.
required prompt
str
The string prompt.
required max_tokens
int
The maximum number of tokens. Default: 100
.
100
verbose
bool
If True
, print tokens and timing information. Default: False
.
required formatter
Optional[Callable]
A function which takes a token and a probability and displays it.
required kwargs
The remaining options get passed to :func:generate_step
. See :func:generate_step
for more details.
{}
Source code in fastmlx/utils.py
def lm_generate(\n model,\n tokenizer,\n prompt: str,\n max_tokens: int = 100,\n **kwargs,\n) -> Union[str, Generator[str, None, None]]:\n \"\"\"\n Generate a complete response from the model.\n\n Args:\n model (nn.Module): The language model.\n tokenizer (PreTrainedTokenizer): The tokenizer.\n prompt (str): The string prompt.\n max_tokens (int): The maximum number of tokens. Default: ``100``.\n verbose (bool): If ``True``, print tokens and timing information.\n Default: ``False``.\n formatter (Optional[Callable]): A function which takes a token and a\n probability and displays it.\n kwargs: The remaining options get passed to :func:`generate_step`.\n See :func:`generate_step` for more details.\n \"\"\"\n if not isinstance(tokenizer, TokenizerWrapper):\n tokenizer = TokenizerWrapper(tokenizer)\n\n stop_words = kwargs.pop(\"stop_words\", [])\n\n stop_words_id = (\n tokenizer._tokenizer(stop_words)[\"input_ids\"][0] if stop_words else None\n )\n\n prompt_tokens = mx.array(tokenizer.encode(prompt))\n prompt_token_len = len(prompt_tokens)\n detokenizer = tokenizer.detokenizer\n\n detokenizer.reset()\n\n for (token, logprobs), n in zip(\n generate_step(prompt_tokens, model, **kwargs),\n range(max_tokens),\n ):\n if token == tokenizer.eos_token_id or (\n stop_words_id and token in stop_words_id\n ):\n break\n\n detokenizer.add_token(token)\n\n detokenizer.finalize()\n\n _completion_tokens = len(detokenizer.tokens)\n token_length_info: Usage = Usage(\n prompt_tokens=prompt_token_len,\n completion_tokens=_completion_tokens,\n total_tokens=prompt_token_len + _completion_tokens,\n )\n return detokenizer.text, token_length_info\n
"},{"location":"endpoints/#fastmlx.remove_model","title":"remove_model(model_name)
async
","text":"Remove a model from the API.
Parameters:
Name Type Description Default model_name
str
The name of the model to remove.
required Returns:
Name Type Description Response
str
A 204 No Content response if successful.
Raises:
Type Description HTTPException(str)
If the model is not found.
Source code in fastmlx/fastmlx.py
@app.delete(\"/v1/models\")\nasync def remove_model(model_name: str):\n \"\"\"\n Remove a model from the API.\n\n Args:\n model_name (str): The name of the model to remove.\n\n Returns:\n Response (str): A 204 No Content response if successful.\n\n Raises:\n HTTPException (str): If the model is not found.\n \"\"\"\n model_name = unquote(model_name).strip('\"')\n removed = await model_provider.remove_model(model_name)\n if removed:\n return Response(status_code=204) # 204 No Content - successful deletion\n else:\n raise HTTPException(status_code=404, detail=f\"Model '{model_name}' not found\")\n
"},{"location":"installation/","title":"Installation","text":""},{"location":"installation/#stable-release","title":"Stable release","text":"To install the latest stable release of FastMLX, use the following command:
pip install -U fastmlx\n
This is the recommended method to install FastMLX, as it will always install the most recent stable release.
If pip isn't installed, you can follow the Python installation guide to set it up.
"},{"location":"installation/#installation-from-sources","title":"Installation from Sources","text":"To install FastMLX directly from the source code, run this command in your terminal:
pip install git+https://github.com/Blaizzy/fastmlx\n
"},{"location":"installation/#running-the-server","title":"Running the Server","text":"There are two ways to start the FastMLX server:
Using the fastmlx
command:
fastmlx\n
or
Using uvicorn
directly:
uvicorn fastmlx:app --reload --workers 0\n
WARNING: The --reload
flag should not be used in production. It is only intended for development purposes.
"},{"location":"installation/#additional-notes","title":"Additional Notes","text":" - Dependencies: Ensure that you have the required dependencies installed. FastMLX relies on several libraries, which
pip
will handle automatically.
"},{"location":"models/","title":"Managing Models","text":""},{"location":"models/#listing-supported-models","title":"Listing Supported Models","text":"To see all vision and language models supported by MLX:
import requests\n\nurl = \"http://localhost:8000/v1/supported_models\"\nresponse = requests.get(url)\nprint(response.json())\n
"},{"location":"models/#listing-available-models","title":"Listing Available Models","text":"To see all available models:
import requests\n\nurl = \"http://localhost:8000/v1/models\"\nresponse = requests.get(url)\nprint(response.json())\n
"},{"location":"models/#deleting-models","title":"Deleting Models","text":"To remove any models loaded to memory:
import requests\n\nurl = \"http://localhost:8000/v1/models\"\nparams = {\n \"model_name\": \"hf-repo-or-path\",\n}\nresponse = requests.delete(url, params=params)\nprint(response)\n
"},{"location":"usage/","title":"Usage","text":"This guide covers the server setup, and usage of FastMLX, including making API calls and managing models.
"},{"location":"usage/#1-installation","title":"1. Installation","text":"Follow the installation guide to install FastMLX.
"},{"location":"usage/#2-running-the-server","title":"2. Running the server","text":"Start the FastMLX server with the following command:
fastmlx\n
or
Using uvicorn
directly:
uvicorn fastmlx:app --reload --workers 0\n
[!WARNING] The --reload
flag should not be used in production. It is only intended for development purposes.
"},{"location":"usage/#running-with-multiple-workers-parallel-processing","title":"Running with Multiple Workers (Parallel Processing)","text":"For improved performance and parallel processing capabilities, you can specify either the absolute number of worker processes or the fraction of CPU cores to use.
You can set the number of workers in three ways (listed in order of precedence):
-
Command-line argument:
fastmlx --workers 4\n
or uvicorn fastmlx:app --workers 4\n
-
Environment variable:
export FASTMLX_NUM_WORKERS=4\nfastmlx\n
-
Default value (2 workers)
To use all available CPU cores, set the value to 1.0:
fastmlx --workers 1.0\n
[!NOTE] - The --reload
flag is not compatible with multiple workers. - The number of workers should typically not exceed the number of CPU cores available on your machine for optimal performance.
"},{"location":"usage/#considerations-for-multi-worker-setup","title":"Considerations for Multi-Worker Setup","text":" - Stateless Application: Ensure your FastMLX application is stateless, as each worker process operates independently.
- Database Connections: If your app uses a database, make sure your connection pooling is configured to handle multiple workers.
- Resource Usage: Monitor your system's resource usage to find the optimal number of workers for your specific hardware and application needs.
- Load Balancing: When running with multiple workers, incoming requests are automatically load-balanced across the worker processes.
"},{"location":"usage/#3-making-api-calls","title":"3. Making API Calls","text":"Use the API similar to OpenAI's chat completions:
"},{"location":"usage/#vision-language-model","title":"Vision Language Model","text":""},{"location":"usage/#without-streaming","title":"Without Streaming","text":"Here's an example of how to use a Vision Language Model:
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/nanoLLaVA-1.5-4bit\",\n \"image\": \"http://images.cocodataset.org/val2017/000000039769.jpg\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What are these\"}],\n \"max_tokens\": 100\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
"},{"location":"usage/#without-streaming_1","title":"Without Streaming","text":"import requests\nimport json\n\ndef process_sse_stream(url, headers, data):\n response = requests.post(url, headers=headers, json=data, stream=True)\n\n if response.status_code != 200:\n print(f\"Error: Received status code {response.status_code}\")\n print(response.text)\n return\n\n full_content = \"\"\n\n try:\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n print(\"\\nStream finished. \u2705\")\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n full_content += content\n print(content, end='', flush=True)\n except json.JSONDecodeError:\n print(f\"\\nFailed to decode JSON: {event_data}\")\n except KeyError:\n print(f\"\\nUnexpected data structure: {chunk_data}\")\n\n except KeyboardInterrupt:\n print(\"\\nStream interrupted by user.\")\n except requests.exceptions.RequestException as e:\n print(f\"\\nAn error occurred: {e}\")\n\nif __name__ == \"__main__\":\n url = \"http://localhost:8000/v1/chat/completions\"\n headers = {\"Content-Type\": \"application/json\"}\n data = {\n \"model\": \"mlx-community/nanoLLaVA-1.5-4bit\",\n \"image\": \"http://images.cocodataset.org/val2017/000000039769.jpg\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What are these?\"}],\n \"max_tokens\": 500,\n \"stream\": True\n }\n process_sse_stream(url, headers, data)\n
"},{"location":"usage/#language-model","title":"Language Model","text":""},{"location":"usage/#without-streaming_2","title":"Without Streaming","text":"Here's an example of how to use a Language Model:
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/gemma-2-9b-it-4bit\",\n \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n \"max_tokens\": 100\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
"},{"location":"usage/#with-streaming","title":"With Streaming","text":"import requests\nimport json\n\ndef process_sse_stream(url, headers, data):\n response = requests.post(url, headers=headers, json=data, stream=True)\n\n if response.status_code != 200:\n print(f\"Error: Received status code {response.status_code}\")\n print(response.text)\n return\n\n full_content = \"\"\n\n try:\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n print(\"\\nStream finished. \u2705\")\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n full_content += content\n print(content, end='', flush=True)\n except json.JSONDecodeError:\n print(f\"\\nFailed to decode JSON: {event_data}\")\n except KeyError:\n print(f\"\\nUnexpected data structure: {chunk_data}\")\n\n except KeyboardInterrupt:\n print(\"\\nStream interrupted by user.\")\n except requests.exceptions.RequestException as e:\n print(f\"\\nAn error occurred: {e}\")\n\nif __name__ == \"__main__\":\n url = \"http://localhost:8000/v1/chat/completions\"\n headers = {\"Content-Type\": \"application/json\"}\n data = {\n \"model\": \"mlx-community/gemma-2-9b-it-4bit\",\n \"messages\": [{\"role\": \"user\", \"content\": \"Hi, how are you?\"}],\n \"max_tokens\": 500,\n \"stream\": True\n }\n process_sse_stream(url, headers, data)\n
For more detailed API documentation, please refer to the API Reference section.
"},{"location":"examples/chatbot/","title":"Multi-Modal Chatbot","text":"This example demonstrates how to create a chatbot application using FastMLX with a Gradio interface.
import argparse\nimport gradio as gr\nimport requests\nimport json\n\nimport asyncio\n\nasync def process_sse_stream(url, headers, data):\n response = requests.post(url, headers=headers, json=data, stream=True)\n if response.status_code != 200:\n raise gr.Error(f\"Error: Received status code {response.status_code}\")\n full_content = \"\"\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n yield str(content)\n except (json.JSONDecodeError, KeyError):\n continue\n\nasync def chat(message, history, temperature, max_tokens):\n\n url = \"http://localhost:8000/v1/chat/completions\"\n headers = {\"Content-Type\": \"application/json\"}\n data = {\n \"model\": \"mlx-community/Qwen2.5-1.5B-Instruct-4bit\",\n \"messages\": [{\"role\": \"user\", \"content\": message['text']}],\n \"max_tokens\": max_tokens,\n \"temperature\": temperature,\n \"stream\": True\n }\n\n if len(message['files']) > 0:\n data[\"model\"] = \"mlx-community/nanoLLaVA-1.5-8bit\"\n data[\"image\"] = message['files'][-1][\"path\"]\n\n response = requests.post(url, headers=headers, json=data, stream=True)\n if response.status_code != 200:\n raise gr.Error(f\"Error: Received status code {response.status_code}\")\n\n full_content = \"\"\n for line in response.iter_lines():\n if line:\n line = line.decode('utf-8')\n if line.startswith('data: '):\n event_data = line[6:] # Remove 'data: ' prefix\n if event_data == '[DONE]':\n break\n try:\n chunk_data = json.loads(event_data)\n content = chunk_data['choices'][0]['delta']['content']\n full_content += content\n yield full_content\n except (json.JSONDecodeError, KeyError):\n continue\n\ndemo = gr.ChatInterface(\n fn=chat,\n title=\"FastMLX Chat UI\",\n additional_inputs_accordion=gr.Accordion(\n label=\"\u2699\ufe0f Parameters\", open=False, render=False\n ),\n additional_inputs=[\n gr.Slider(\n minimum=0, maximum=1, step=0.1, value=0.1, label=\"Temperature\", render=False\n ),\n gr.Slider(\n minimum=128,\n maximum=4096,\n step=1,\n value=200,\n label=\"Max new tokens\",\n render=False\n ),\n ],\n multimodal=True,\n)\n\ndemo.launch(inbrowser=True)\n
"},{"location":"examples/function_calling/","title":"Function Calling","text":""},{"location":"examples/function_calling/#function-calling","title":"Function Calling","text":"FastMLX now supports tool calling in accordance with the OpenAI API specification. This feature is available for the following models:
- Llama 3.1
- Arcee Agent
- C4ai-Command-R-Plus
- Firefunction
- xLAM
Supported modes:
- Without Streaming
- Parallel Tool Calling
Note: Tool choice and OpenAI-compliant streaming for function calling are currently under development.
This example demonstrates how to use the get_current_weather
tool with the Llama 3.1
model. The API will process the user's question and use the provided tool to fetch the required information.
import requests\nimport json\n\nurl = \"http://localhost:8000/v1/chat/completions\"\nheaders = {\"Content-Type\": \"application/json\"}\ndata = {\n \"model\": \"mlx-community/Meta-Llama-3.1-8B-Instruct-8bit\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"What's the weather like in San Francisco and Washington?\"\n }\n ],\n \"tools\": [\n {\n \"name\": \"get_current_weather\",\n \"description\": \"Get the current weather\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"location\": {\n \"type\": \"string\",\n \"description\": \"The city and state, e.g. San Francisco, CA\"\n },\n \"format\": {\n \"type\": \"string\",\n \"enum\": [\"celsius\", \"fahrenheit\"],\n \"description\": \"The temperature unit to use. Infer this from the user's location.\"\n }\n },\n \"required\": [\"location\", \"format\"]\n }\n }\n ],\n \"max_tokens\": 150,\n \"temperature\": 0.7,\n \"stream\": False,\n}\n\nresponse = requests.post(url, headers=headers, data=json.dumps(data))\nprint(response.json())\n
Note: Streaming is available for regular text generation, but the streaming implementation for function calling is still in development and does not yet fully comply with the OpenAI specification.
"}]}
\ No newline at end of file