diff --git a/examples/cloud-edge-collaborative-inference-for-llm/README.md b/examples/cloud-edge-collaborative-inference-for-llm/README.md index 4fbe6a4d..1719d785 100644 --- a/examples/cloud-edge-collaborative-inference-for-llm/README.md +++ b/examples/cloud-edge-collaborative-inference-for-llm/README.md @@ -30,7 +30,7 @@ Additionally, Speculative Decoding $^{[3]}$ is another promising strategy to fur The overall design is shown in the figure below. -![image-20240926143857223](./assets/image-20250115535482354.png) +![Architecture](./assets/Architecture.png) When Ianvs starts the benchmarking job, the Test Env Manager will first pass the data of the user-specified Dataset to the Test Case Controller for Joint Inference one by one. @@ -86,6 +86,8 @@ pip install -r requirements.txt python setup.py install ``` +If you want to use speculative decoding models like [EAGLE](If you want to use speculative decoding models like EAGLE, refer to the original repository for setup instructions.), refer to the original repository for setup instructions. + ## Step 2. Dataset and Model Preparation ### Dataset Configuration @@ -144,8 +146,6 @@ Here is an example: } ``` - - ### Metric Configuration *Note: If you just want to run this example quickly, you can skip this step.* @@ -153,7 +153,7 @@ Here is an example: We have designed multiple metrics for edge-cloud collaborative inference, including: | Metric | Description | Unit | -| :---------------------- | :------------------------------------------------------ | ------- | +| :---: | :---: | :---: | | Accuracy | Accuracy on the test Dataset | - | | Edge Ratio | proportion of queries router to edge | - | | Time to First Token | Time taken to generate the first token | s | @@ -178,14 +178,12 @@ In the configuration file, there are two models available for configuration: `Ed #### EdgeModel Configuration -The `EdgeModel` is designed to be deployed on your local machine, offering support for multiple serving backends including `huggingface`, `vllm`, `EAGLE`, and `LADE`. Additionally, it provides the flexibility to integrate with API-based model services. +The `EdgeModel` is designed to be deployed on your local machine, offering support for multiple serving backends including `huggingface`, `vllm`, `EagleSpecDec`. Additionally, it provides the flexibility to integrate with API-based model services. -The `CloudModel` represents the model on cloud. For extensibility, it supports both API-based models (which call LLM API via OpenAI API format) and local inference using backends like `huggingface`, `vllm`, `EAGLE`, and `LADE`. For API-based models, you need to set your `OPENAI_BASE_URL` and `OPENAI_API_KEY` in the environment variables yourself, for example: - -For both `EdgeModel` and `CloudModel`, the open parameters are: +For both `EdgeModel`, the arguments are: | Parameter Name | Type | Description | Defalut | -| ---------------------- | ----- | ------------------------------------------------------------ | ------------------------ | +| :---: | :-----: | :---: | :---:| | model | str | model name | Qwen/Qwen2-1.5B-Instruct | | backend | str | model serving framework | huggingface | | temperature | float | What sampling temperature to use, between 0 and 2 | 0.8 | @@ -194,14 +192,29 @@ For both `EdgeModel` and `CloudModel`, the open parameters are: | repetition_penalty | float | The parameter for repetition penalty | 1.05 | | tensor_parallel_size | int | The size of tensor parallelism (Used for vLLM) | 1 | | gpu_memory_utilization | float | The percentage of GPU memory utilization (Used for vLLM) | 0.9 | +| draft_model | str | The draft model used for Speculative Decoding | - | + +#### CloudModel Configuration -If you want to call API-based models, you need to set your `OPENAI_BASE_URL` and `OPENAI_API_KEY` in the environment variables yourself, for example: + +The `CloudModel` represents the model on cloud, it will call LLM API via OpenAI API format. You need to set your OPENAI_BASE_URL and OPENAI_API_KEY in the environment variables yourself, for example. ```bash export OPENAI_BASE_URL="https://api.openai.com/v1" export OPENAI_API_KEY=sk_xxxxxxxx ``` +For `CloudModel`, the open parameters are: + +| Parameter Name | Type | Description | Defalut | +| :---: | :---: | :---: | :---: | +| model | str | model name | gpt-4o-mini | +| temperature | float | What sampling temperature to use, between 0 and 2 | 0.8 | +| top_p | float | nucleus sampling parameter | 0.8 | +| max_tokens | int | The maximum number of tokens that can be generated in the chat completion | 512 | +| repetition_penalty | float | The parameter for repetition penalty | 1.05 | + + #### Router Configuration Router is a component that routes the query to the edge or cloud model. The router is configured by `hard_example_mining` in `examples/cloud-edge-collaborative-inference-for-llm/testrouters/query-routing/test_queryrouting.yaml`. @@ -209,7 +222,7 @@ Router is a component that routes the query to the edge or cloud model. The rout Currently, supported routers include: | Router Type | Description | Parameters | -| ------------ | ------------------------------------------------------------ | ---------------- | +| :---: | :---: | :---: | | EdgeOnly | Route all queries to the edge model. | - | | CloudOnly | Route all queries to the cloud model. | - | | OracleRouter | Optimal Router | | @@ -226,7 +239,7 @@ The Data Processor allows you to custom your own data format after the dataset l Currently, supported routers include: | Data Processor | Description | Parameters | -| ------------ | ------------------------------------------------------------ | ---------------- | +| :---: | :---: | :---: | | OracleRouterDatasetProcessor | Expose `gold` label to OracleRouter | - | ## Step 3. Run Ianvs @@ -283,18 +296,18 @@ Ianvs will output a `rank.csv` and `selected_rank.csv` in `ianvs/workspace`, whi You can modify the relevant model parameters in `examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/test_queryrouting.yaml`, conduct multiple tests, and compare the results of different configurations. - Since MMLU-5-shot has a large amount of data, we recommend using the GPQA dataset to test the latency and throughput performance under different inference frameworks and Oracle Router. Below are the test results for two inference frameworks `vllm` and `EAGLE` under Oracle Router: ```bash -+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+ -| rank | algorithm | Accuracy | Edge Ratio | Time to First Token | Throughput | Internal Token Latency | Cloud Prompt Tokens | Cloud Completion Tokens | Edge Prompt Tokens | Edge Completion Tokens | paradigm | hard_example_mining | edgemodel-model | edgemodel-backend | cloudmodel-model | time | url | -+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+ -| 1 | query-routing | 54.04 | 78.79 | 0.278 | 47.1 | 0.021 | 12081 | 20383 | 43636 | 64042 | jointinference | OracleRouter | Qwen/Qwen2-7B-Instruct | vllm | gpt-4o-mini | 2025-01-16 16:27:00 | ./workspace-gpqa/benchmarkingjob/query-routing/a5477f86-d3e3-11ef-aa28-0242ac110008 | -| 2 | query-routing | 39.39 | 0.0 | 1.388 | 57.48 | 0.017 | 52553 | 100395 | 0 | 0 | jointinference | CloudOnly | Qwen/Qwen2-7B-Instruct | vllm | gpt-4o-mini | 2025-01-16 16:13:12 | ./workspace-gpqa/benchmarkingjob/query-routing/e204bac6-d3dc-11ef-8dfe-0242ac110008 | -| 3 | query-routing | 32.83 | 100.0 | 0.059 | 44.95 | 0.022 | 0 | 0 | 56550 | 80731 | jointinference | EdgeOnly | Qwen/Qwen2-7B-Instruct | vllm | gpt-4o-mini | 2025-01-16 13:12:20 | ./workspace-gpqa/benchmarkingjob/query-routing/fdda7ce2-d3c1-11ef-8ea0-0242ac110008 | -| 4 | query-routing | 28.28 | 100.0 | 0.137 | 66.12 | 0.015 | 0 | 0 | 56550 | 67426 | jointinference | EdgeOnly | Qwen/Qwen2-7B-Instruct | EagleSpecDec | gpt-4o-mini | 2025-01-16 12:43:05 | ./workspace-gpqa/benchmarkingjob/query-routing/fdda7aa8-d3c1-11ef-8ea0-0242ac110008 | -+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+ ++------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+ +| rank | algorithm | Accuracy | Edge Ratio | Time to First Token | Throughput | Internal Token Latency | Cloud Prompt Tokens | Cloud Completion Tokens | Edge Prompt Tokens | Edge Completion Tokens | paradigm | hard_example_mining | edgemodel-model | edgemodel-backend | cloudmodel-model | time | url | ++------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+ +| 1 | query-routing | 54.55 | 72.73 | 0.27 | 49.94 | 0.02 | 16777 | 30824 | 42823 | 66112 | jointinference | OracleRouter | NousResearch/Llama-2-7b-chat-hf | vllm | gpt-4o-mini | 2025-02-09 14:26:46 | ./workspace-gpqa/benchmarkingjob/query-routing/d393d334-e6ae-11ef-8ed1-0242ac110002 | +| 2 | query-routing | 53.54 | 74.24 | 0.301 | 89.44 | 0.011 | 16010 | 27859 | 43731 | 68341 | jointinference | OracleRouter | NousResearch/Llama-2-7b-chat-hf | EagleSpecDec | gpt-4o-mini | 2025-02-09 14:26:46 | ./workspace-gpqa/benchmarkingjob/query-routing/d393d0e6-e6ae-11ef-8ed1-0242ac110002 | +| 3 | query-routing | 40.91 | 0.0 | 0.762 | 62.57 | 0.016 | 52553 | 109922 | 0 | 0 | jointinference | CloudOnly | NousResearch/Llama-2-7b-chat-hf | vllm | gpt-4o-mini | 2025-02-09 14:26:33 | ./workspace-gpqa/benchmarkingjob/query-routing/cb8bae14-e6ae-11ef-bc17-0242ac110002 | +| 4 | query-routing | 27.78 | 100.0 | 0.121 | 110.61 | 0.009 | 0 | 0 | 62378 | 92109 | jointinference | EdgeOnly | NousResearch/Llama-2-7b-chat-hf | EagleSpecDec | gpt-4o-mini | 2025-02-09 14:26:16 | ./workspace-gpqa/benchmarkingjob/query-routing/c1afaa30-e6ae-11ef-8c1d-0242ac110002 | +| 5 | query-routing | 27.27 | 100.0 | 0.06 | 46.95 | 0.021 | 0 | 0 | 62378 | 92068 | jointinference | EdgeOnly | NousResearch/Llama-2-7b-chat-hf | vllm | gpt-4o-mini | 2025-02-09 14:26:16 | ./workspace-gpqa/benchmarkingjob/query-routing/c1afac74-e6ae-11ef-8c1d-0242ac110002 | ++------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+ ``` diff --git a/examples/cloud-edge-collaborative-inference-for-llm/assets/Architecture.png b/examples/cloud-edge-collaborative-inference-for-llm/assets/Architecture.png new file mode 100644 index 00000000..9a23f29d Binary files /dev/null and b/examples/cloud-edge-collaborative-inference-for-llm/assets/Architecture.png differ diff --git a/examples/cloud-edge-collaborative-inference-for-llm/assets/image-20250115535482354.png b/examples/cloud-edge-collaborative-inference-for-llm/assets/image-20250115535482354.png deleted file mode 100644 index c45bd942..00000000 Binary files a/examples/cloud-edge-collaborative-inference-for-llm/assets/image-20250115535482354.png and /dev/null differ diff --git a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/cloud_model.py b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/cloud_model.py index 751160ae..f466b367 100644 --- a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/cloud_model.py +++ b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/cloud_model.py @@ -18,7 +18,7 @@ from core.common.log import LOGGER from sedna.common.class_factory import ClassType, ClassFactory -from models import APIBasedLLM, HuggingfaceLLM, VllmLLM, EagleSpecDecModel, LadeSpecDecLLM +from models import APIBasedLLM os.environ['BACKEND_TYPE'] = 'TORCH' @@ -32,41 +32,18 @@ def __init__(self, **kwargs): """Initialize the CloudModel. See `APIBasedLLM` for details about `kwargs`. """ LOGGER.info(kwargs) - self.kwargs = kwargs - self.model_name = kwargs.get("model", None) - self.backend = kwargs.get("backend", "huggingface") - self._set_config() - self.load() + self.model = APIBasedLLM(**kwargs) + self.load(kwargs.get("model", "gpt-4o-mini")) - def _set_config(self): - """Set the model path in our environment variables due to Sedna’s [check](https://github.com/kubeedge/sedna/blob/ac623ab32dc37caa04b9e8480dbe1a8c41c4a6c2/lib/sedna/core/base.py#L132). - """ - pass - # - # os.environ["model_path"] = self.model_name - - def load(self, **kwargs): - """Set the model backend to be used. Will be called by Sedna's JointInference interface. + def load(self, model): + """Set the model. - Raises - ------ - Exception - When the backend is not supported. + Parameters + ---------- + model : str + Existing model from your OpenAI provider. Example: `gpt-4o-mini` """ - if self.backend == "huggingface": - self.model = HuggingfaceLLM(**self.kwargs) - elif self.backend == "vllm": - self.model = VllmLLM(**self.kwargs) - elif self.backend == "api": - self.model = APIBasedLLM(**self.kwargs) - elif self.backend == "EagleSpecDec": - self.model = EagleSpecDecModel(**self.kwargs) - elif self.backend == "LadeSpecDec": - self.model = LadeSpecDecLLM(**self.kwargs) - else: - raise Exception(f"Backend {self.backend} is not supported. Please use 'huggingface', 'vllm', or `api` ") - - self.model._load(self.kwargs.get("model", None)) + self.model._load(model = model) def inference(self, data, **kwargs): """Inference the model with the given data. diff --git a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/__init__.py b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/__init__.py index e7733b09..3dfe9f24 100644 --- a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/__init__.py +++ b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/__init__.py @@ -2,5 +2,4 @@ from .huggingface_llm import HuggingfaceLLM from .vllm_llm import VllmLLM from .base_llm import BaseLLM -from .speculative_decoding_models.eagle_llm import EagleSpecDecModel -from .speculative_decoding_models.lade_llm import LadeSpecDecLLM \ No newline at end of file +from .eagle_llm import EagleSpecDecModel \ No newline at end of file diff --git a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/base_llm.py b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/base_llm.py index f80bb0eb..416bb46a 100644 --- a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/base_llm.py +++ b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/base_llm.py @@ -147,7 +147,7 @@ def inference(self, data): else: raise ValueError(f"DataType {type(data)} is not supported, it must be `dict`") - def get_message_chain(self, question, system = None): + def get_message_chain(self, question, system = "You are a helpful assistant."): """Get the OpenAI Chat style message chain Parameters diff --git a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/speculative_decoding_models/eagle_llm.py b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/eagle_llm.py similarity index 99% rename from examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/speculative_decoding_models/eagle_llm.py rename to examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/eagle_llm.py index f7d2aa72..ed3ba614 100644 --- a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/speculative_decoding_models/eagle_llm.py +++ b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/eagle_llm.py @@ -41,7 +41,6 @@ def _load(self, model): # breakpoint() self.model = EaModel.from_pretrained( base_model_path=self.config.get("model", None), - ea_model_path=self.config.get("draft_model", None), torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, diff --git a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/huggingface_llm.py b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/huggingface_llm.py index 8bf87385..0d3f3b6a 100644 --- a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/huggingface_llm.py +++ b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/huggingface_llm.py @@ -73,9 +73,6 @@ def _infer(self, messages): most_recent_timestamp = st # messages = self.get_message_chain(question, system_prompt) - - streamer = TextIteratorStreamer(self.tokenizer) - text = self.tokenizer.apply_chat_template( messages, tokenize=False, @@ -131,3 +128,8 @@ def _infer(self, messages): ) return response + +if __name__ == "__main__": + model = HuggingfaceLLM() + model._load("Qwen/Qwen2-7B-Instruct") + print(model._infer("Hello, how are you?")) \ No newline at end of file diff --git a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/speculative_decoding_models/lade_llm.py b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/speculative_decoding_models/lade_llm.py deleted file mode 100644 index 45b2eef5..00000000 --- a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/speculative_decoding_models/lade_llm.py +++ /dev/null @@ -1,139 +0,0 @@ -# Copyright 2024 The KubeEdge Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import time -from threading import Thread - -from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer -from models.base_llm import BaseLLM - -device = "cuda" -os.environ["TOKENIZERS_PARALLELISM"] = "true" - -class LadeSpecDecLLM(BaseLLM): - def __init__(self, **kwargs) -> None: - import lade - - lade.augment_all() - #For a 7B model, set LEVEL=5, WINDOW_SIZE=7, GUESS_SET_SIZE=7 - lade.config_lade(LEVEL=7, WINDOW_SIZE=20, GUESS_SET_SIZE=20, DEBUG=1) - - """ Initialize the HuggingfaceLLM class - - Parameters - ---------- - kwargs : dict - Parameters that are passed to the model. Details can be found in the BaseLLM class. - """ - BaseLLM.__init__(self, **kwargs) - - def _load(self, model): - """Load the model via Hugging Face API - - Parameters - ---------- - model : str - Hugging Face style model name. Example: `Qwen/Qwen2.5-0.5B-Instruct` - """ - self.model = AutoModelForCausalLM.from_pretrained( - model, - torch_dtype="auto", - device_map="auto", - trust_remote_code=True - ) - self.tokenizer = AutoTokenizer.from_pretrained( - model, - trust_remote_code=True - ) - - def _infer(self, messages): - """Call the transformers inference API to get the response - - Parameters - ---------- - messages : list - OpenAI style message chain. Example: - ``` - [{"role": "user", "content": "Hello, how are you?"}] - ``` - - Returns - ------- - dict - Formatted Response. See `_format_response()` for more details. - """ - - st = time.perf_counter() - most_recent_timestamp = st - - # messages = self.get_message_chain(question, system_prompt) - - streamer = TextIteratorStreamer(self.tokenizer) - - text = self.tokenizer.apply_chat_template( - messages, - tokenize=False, - add_generation_prompt=True - ) - - model_inputs = self.tokenizer([text], return_tensors="pt").to(device) - - streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True) - - generation_kwargs = dict( - model_inputs, - streamer=streamer, - max_new_tokens=self.max_tokens, - temperature=self.temperature, - top_p=self.top_p, - repetition_penalty=self.repetition_penalty, - ) - - thread = Thread( - target=self.model.generate, - kwargs=generation_kwargs - ) - - thread.start() - time_to_first_token = 0 - internal_token_latency = [] - generated_text = "" - completion_tokens = 0 - - for chunk in streamer: - timestamp = time.perf_counter() - if time_to_first_token == 0: - time_to_first_token = time.perf_counter() - st - else: - internal_token_latency.append(timestamp - most_recent_timestamp) - most_recent_timestamp = timestamp - generated_text += chunk - completion_tokens += 1 - - text = generated_text.replace("<|im_end|>", "") - prompt_tokens = len(model_inputs.input_ids[0]) - internal_token_latency = sum(internal_token_latency) / len(internal_token_latency) - throughput = 1 / internal_token_latency - - response = self._format_response( - text, - prompt_tokens, - completion_tokens, - time_to_first_token, - internal_token_latency, - throughput - ) - - return response diff --git a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/test_queryrouting.yaml b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/test_queryrouting.yaml index 4240a03f..642e3763 100644 --- a/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/test_queryrouting.yaml +++ b/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/test_queryrouting.yaml @@ -21,7 +21,7 @@ algorithm: # name of the hyperparameter; string type; - model: values: - - "Qwen/Qwen2-7B-Instruct" + - "Qwen/Qwen2.5-7B-Instruct" - backend: # backend; string type; # currently the options of value are as follows: @@ -31,18 +31,20 @@ algorithm: # 4> "EagleSpecDec": EAGLE Speculative Decoding framework; # 5> "LadeSepcDec": Lookahead Decoding framework; values: - # - "EagleSpecDec" + - "huggingface" - "vllm" + # - "EagleSpecDec" - - draft_model: - values: - - "yuhuili/EAGLE-Qwen2-7B-Instruct" + # If you're using speculative models, uncomment the following lines: + # - draft_model: + # values: + # - "yuhuili/EAGLE-llama2-chat-7B" - temperature: # What sampling temperature to use, between 0 and 2; float type; # For reproducable results, the temperature should be set to 0; values: - - 0.9 + - 0.0000001 - top_p: # nucleus sampling parameter; float type; values: @@ -54,12 +56,12 @@ algorithm: - repetition_penalty: # The parameter for repetition penalty; float type; values: - - 1.05 + - 1 - use_cache: # Whether to use reponse cache; boolean type; values: - true - + - type: "cloudmodel" # name of python module; string type; name: "CloudModel" @@ -67,46 +69,29 @@ algorithm: url: "./examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/cloud_model.py" hyperparameters: - # name of the hyperparameter; string type; + # name of the hyperparameter; string type; - model: values: - "gpt-4o-mini" - - backend: - # backend; string type; - # currently the options of value are as follows: - # 1> "huggingface": transformers backend; - # 2> "vllm": vLLM backend; - # 3> "api": OpenAI API backend; - # 4> "EagleSpecDec": EAGLE Speculative Decoding framework; - # 5> "LadeSepcDec": Lookahead Decoding framework; - values: - # - "EagleSpecDec" - - "api" - temperature: - # What sampling temperature to use, between 0 and 2; float type; - # For reproducable results, the temperature should be set to 0; values: - 0.9 - top_p: - # nucleus sampling parameter; float type; values: - 0.9 - max_tokens: - # The maximum number of tokens that can be generated in the chat completion; int type; values: - 1024 - repetition_penalty: - # The parameter for repetition penalty; float type; values: - 1.05 - use_cache: - # Whether to use reponse cache; boolean type; values: - true - type: "hard_example_mining" # name of Router module; string type; # BERTRouter, EdgeOnly, CloudOnly, RandomRouter, OracleRouter - name: "CloudOnly" + name: "OracleRouter" # the url address of python module; string type; url: "./examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/hard_sample_mining.py"