Skip to content

Commit

Permalink
fix: remove unnecessary backend support for CloudModel; doc: modify R…
Browse files Browse the repository at this point in the history
…EADME.md

Signed-off-by: FuryMartin <[email protected]>
  • Loading branch information
FuryMartin committed Feb 9, 2025
1 parent 9f3e847 commit 00bc01e
Show file tree
Hide file tree
Showing 10 changed files with 61 additions and 227 deletions.
53 changes: 32 additions & 21 deletions examples/cloud-edge-collaborative-inference-for-llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Additionally, Speculative Decoding $^{[3]}$ is another promising strategy to fur

The overall design is shown in the figure below.

![image-20240926143857223](./assets/image-20250115535482354.png)
![Architecture](./assets/Architecture.png)

When Ianvs starts the benchmarking job, the Test Env Manager will first pass the data of the user-specified Dataset to the Test Case Controller for Joint Inference one by one.

Expand Down Expand Up @@ -144,16 +144,14 @@ Here is an example:
}
```



### Metric Configuration

*Note: If you just want to run this example quickly, you can skip this step.*

We have designed multiple metrics for edge-cloud collaborative inference, including:

| Metric | Description | Unit |
| :---------------------- | :------------------------------------------------------ | ------- |
| :---: | :---: | :---: |
| Accuracy | Accuracy on the test Dataset | - |
| Edge Ratio | proportion of queries router to edge | - |
| Time to First Token | Time taken to generate the first token | s |
Expand All @@ -178,14 +176,12 @@ In the configuration file, there are two models available for configuration: `Ed

#### EdgeModel Configuration

The `EdgeModel` is designed to be deployed on your local machine, offering support for multiple serving backends including `huggingface`, `vllm`, `EAGLE`, and `LADE`. Additionally, it provides the flexibility to integrate with API-based model services.

The `CloudModel` represents the model on cloud. For extensibility, it supports both API-based models (which call LLM API via OpenAI API format) and local inference using backends like `huggingface`, `vllm`, `EAGLE`, and `LADE`. For API-based models, you need to set your `OPENAI_BASE_URL` and `OPENAI_API_KEY` in the environment variables yourself, for example:
The `EdgeModel` is designed to be deployed on your local machine, offering support for multiple serving backends including `huggingface`, `vllm`, `EagleSpecDec`, and `LADE`. Additionally, it provides the flexibility to integrate with API-based model services.

For both `EdgeModel` and `CloudModel`, the open parameters are:
For both `EdgeModel`, the arguments are:

| Parameter Name | Type | Description | Defalut |
| ---------------------- | ----- | ------------------------------------------------------------ | ------------------------ |
| :---: | :-----: | :---: | :---:|
| model | str | model name | Qwen/Qwen2-1.5B-Instruct |
| backend | str | model serving framework | huggingface |
| temperature | float | What sampling temperature to use, between 0 and 2 | 0.8 |
Expand All @@ -194,22 +190,37 @@ For both `EdgeModel` and `CloudModel`, the open parameters are:
| repetition_penalty | float | The parameter for repetition penalty | 1.05 |
| tensor_parallel_size | int | The size of tensor parallelism (Used for vLLM) | 1 |
| gpu_memory_utilization | float | The percentage of GPU memory utilization (Used for vLLM) | 0.9 |
| draft_model | str | The draft model used for Speculative Decoding | - |

#### CloudModel Configuration


If you want to call API-based models, you need to set your `OPENAI_BASE_URL` and `OPENAI_API_KEY` in the environment variables yourself, for example:
The `CloudModel` represents the model on cloud, it will call LLM API via OpenAI API format. You need to set your OPENAI_BASE_URL and OPENAI_API_KEY in the environment variables yourself, for example.

```bash
export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY=sk_xxxxxxxx
```

For `CloudModel`, the open parameters are:

| Parameter Name | Type | Description | Defalut |
| :---: | :---: | :---: | :---: |
| model | str | model name | gpt-4o-mini |
| temperature | float | What sampling temperature to use, between 0 and 2 | 0.8 |
| top_p | float | nucleus sampling parameter | 0.8 |
| max_tokens | int | The maximum number of tokens that can be generated in the chat completion | 512 |
| repetition_penalty | float | The parameter for repetition penalty | 1.05 |


#### Router Configuration

Router is a component that routes the query to the edge or cloud model. The router is configured by `hard_example_mining` in `examples/cloud-edge-collaborative-inference-for-llm/testrouters/query-routing/test_queryrouting.yaml`.

Currently, supported routers include:

| Router Type | Description | Parameters |
| ------------ | ------------------------------------------------------------ | ---------------- |
| :---: | :---: | :---: |
| EdgeOnly | Route all queries to the edge model. | - |
| CloudOnly | Route all queries to the cloud model. | - |
| OracleRouter | Optimal Router | |
Expand All @@ -226,7 +237,7 @@ The Data Processor allows you to custom your own data format after the dataset l
Currently, supported routers include:

| Data Processor | Description | Parameters |
| ------------ | ------------------------------------------------------------ | ---------------- |
| :---: | :---: | :---: |
| OracleRouterDatasetProcessor | Expose `gold` label to OracleRouter | - |

## Step 3. Run Ianvs
Expand Down Expand Up @@ -283,18 +294,18 @@ Ianvs will output a `rank.csv` and `selected_rank.csv` in `ianvs/workspace`, whi

You can modify the relevant model parameters in `examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/test_queryrouting.yaml`, conduct multiple tests, and compare the results of different configurations.


Since MMLU-5-shot has a large amount of data, we recommend using the GPQA dataset to test the latency and throughput performance under different inference frameworks and Oracle Router. Below are the test results for two inference frameworks `vllm` and `EAGLE` under Oracle Router:

```bash
+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
| rank | algorithm | Accuracy | Edge Ratio | Time to First Token | Throughput | Internal Token Latency | Cloud Prompt Tokens | Cloud Completion Tokens | Edge Prompt Tokens | Edge Completion Tokens | paradigm | hard_example_mining | edgemodel-model | edgemodel-backend | cloudmodel-model | time | url |
+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
| 1 | query-routing | 54.04 | 78.79 | 0.278 | 47.1 | 0.021 | 12081 | 20383 | 43636 | 64042 | jointinference | OracleRouter | Qwen/Qwen2-7B-Instruct | vllm | gpt-4o-mini | 2025-01-16 16:27:00 | ./workspace-gpqa/benchmarkingjob/query-routing/a5477f86-d3e3-11ef-aa28-0242ac110008 |
| 2 | query-routing | 39.39 | 0.0 | 1.388 | 57.48 | 0.017 | 52553 | 100395 | 0 | 0 | jointinference | CloudOnly | Qwen/Qwen2-7B-Instruct | vllm | gpt-4o-mini | 2025-01-16 16:13:12 | ./workspace-gpqa/benchmarkingjob/query-routing/e204bac6-d3dc-11ef-8dfe-0242ac110008 |
| 3 | query-routing | 32.83 | 100.0 | 0.059 | 44.95 | 0.022 | 0 | 0 | 56550 | 80731 | jointinference | EdgeOnly | Qwen/Qwen2-7B-Instruct | vllm | gpt-4o-mini | 2025-01-16 13:12:20 | ./workspace-gpqa/benchmarkingjob/query-routing/fdda7ce2-d3c1-11ef-8ea0-0242ac110008 |
| 4 | query-routing | 28.28 | 100.0 | 0.137 | 66.12 | 0.015 | 0 | 0 | 56550 | 67426 | jointinference | EdgeOnly | Qwen/Qwen2-7B-Instruct | EagleSpecDec | gpt-4o-mini | 2025-01-16 12:43:05 | ./workspace-gpqa/benchmarkingjob/query-routing/fdda7aa8-d3c1-11ef-8ea0-0242ac110008 |
+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
| rank | algorithm | Accuracy | Edge Ratio | Time to First Token | Throughput | Internal Token Latency | Cloud Prompt Tokens | Cloud Completion Tokens | Edge Prompt Tokens | Edge Completion Tokens | paradigm | hard_example_mining | edgemodel-model | edgemodel-backend | cloudmodel-model | time | url |
+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
| 1 | query-routing | 54.55 | 72.73 | 0.27 | 49.94 | 0.02 | 16777 | 30824 | 42823 | 66112 | jointinference | OracleRouter | NousResearch/Llama-2-7b-chat-hf | vllm | gpt-4o-mini | 2025-02-09 14:26:46 | ./workspace-gpqa/benchmarkingjob/query-routing/d393d334-e6ae-11ef-8ed1-0242ac110002 |
| 2 | query-routing | 53.54 | 74.24 | 0.301 | 89.44 | 0.011 | 16010 | 27859 | 43731 | 68341 | jointinference | OracleRouter | NousResearch/Llama-2-7b-chat-hf | EagleSpecDec | gpt-4o-mini | 2025-02-09 14:26:46 | ./workspace-gpqa/benchmarkingjob/query-routing/d393d0e6-e6ae-11ef-8ed1-0242ac110002 |
| 3 | query-routing | 40.91 | 0.0 | 0.762 | 62.57 | 0.016 | 52553 | 109922 | 0 | 0 | jointinference | CloudOnly | NousResearch/Llama-2-7b-chat-hf | vllm | gpt-4o-mini | 2025-02-09 14:26:33 | ./workspace-gpqa/benchmarkingjob/query-routing/cb8bae14-e6ae-11ef-bc17-0242ac110002 |
| 4 | query-routing | 27.78 | 100.0 | 0.121 | 110.61 | 0.009 | 0 | 0 | 62378 | 92109 | jointinference | EdgeOnly | NousResearch/Llama-2-7b-chat-hf | EagleSpecDec | gpt-4o-mini | 2025-02-09 14:26:16 | ./workspace-gpqa/benchmarkingjob/query-routing/c1afaa30-e6ae-11ef-8c1d-0242ac110002 |
| 5 | query-routing | 27.27 | 100.0 | 0.06 | 46.95 | 0.021 | 0 | 0 | 62378 | 92068 | jointinference | EdgeOnly | NousResearch/Llama-2-7b-chat-hf | vllm | gpt-4o-mini | 2025-02-09 14:26:16 | ./workspace-gpqa/benchmarkingjob/query-routing/c1afac74-e6ae-11ef-8c1d-0242ac110002 |
+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
```


Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

from core.common.log import LOGGER
from sedna.common.class_factory import ClassType, ClassFactory
from models import APIBasedLLM, HuggingfaceLLM, VllmLLM, EagleSpecDecModel, LadeSpecDecLLM
from models import APIBasedLLM

os.environ['BACKEND_TYPE'] = 'TORCH'

Expand All @@ -32,41 +32,18 @@ def __init__(self, **kwargs):
"""Initialize the CloudModel. See `APIBasedLLM` for details about `kwargs`.
"""
LOGGER.info(kwargs)
self.kwargs = kwargs
self.model_name = kwargs.get("model", None)
self.backend = kwargs.get("backend", "huggingface")
self._set_config()
self.load()
self.model = APIBasedLLM(**kwargs)
self.load(kwargs.get("model", "gpt-4o-mini"))

def _set_config(self):
"""Set the model path in our environment variables due to Sedna’s [check](https://github.com/kubeedge/sedna/blob/ac623ab32dc37caa04b9e8480dbe1a8c41c4a6c2/lib/sedna/core/base.py#L132).
"""
pass
#
# os.environ["model_path"] = self.model_name

def load(self, **kwargs):
"""Set the model backend to be used. Will be called by Sedna's JointInference interface.
def load(self, model):
"""Set the model.
Raises
------
Exception
When the backend is not supported.
Parameters
----------
model : str
Existing model from your OpenAI provider. Example: `gpt-4o-mini`
"""
if self.backend == "huggingface":
self.model = HuggingfaceLLM(**self.kwargs)
elif self.backend == "vllm":
self.model = VllmLLM(**self.kwargs)
elif self.backend == "api":
self.model = APIBasedLLM(**self.kwargs)
elif self.backend == "EagleSpecDec":
self.model = EagleSpecDecModel(**self.kwargs)
elif self.backend == "LadeSpecDec":
self.model = LadeSpecDecLLM(**self.kwargs)
else:
raise Exception(f"Backend {self.backend} is not supported. Please use 'huggingface', 'vllm', or `api` ")

self.model._load(self.kwargs.get("model", None))
self.model._load(model = model)

def inference(self, data, **kwargs):
"""Inference the model with the given data.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,4 @@
from .huggingface_llm import HuggingfaceLLM
from .vllm_llm import VllmLLM
from .base_llm import BaseLLM
from .speculative_decoding_models.eagle_llm import EagleSpecDecModel
from .speculative_decoding_models.lade_llm import LadeSpecDecLLM
from .eagle_llm import EagleSpecDecModel
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ def inference(self, data):
else:
raise ValueError(f"DataType {type(data)} is not supported, it must be `dict`")

def get_message_chain(self, question, system = None):
def get_message_chain(self, question, system = "You are a helpful assistant."):
"""Get the OpenAI Chat style message chain
Parameters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ def _load(self, model):
# breakpoint()
self.model = EaModel.from_pretrained(
base_model_path=self.config.get("model", None),

ea_model_path=self.config.get("draft_model", None),
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,6 @@ def _infer(self, messages):
most_recent_timestamp = st

# messages = self.get_message_chain(question, system_prompt)

streamer = TextIteratorStreamer(self.tokenizer)

text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
Expand Down Expand Up @@ -131,3 +128,8 @@ def _infer(self, messages):
)

return response

if __name__ == "__main__":
model = HuggingfaceLLM()
model._load("Qwen/Qwen2-7B-Instruct")
print(model._infer("Hello, how are you?"))
Loading

0 comments on commit 00bc01e

Please sign in to comment.