fix: remove unnecessary backend support for CloudModel; doc: modify R…

…EADME.md Signed-off-by: FuryMartin <[email protected]>
kubeedge · Feb 9, 2025 · 00bc01e · 00bc01e
1 parent 9f3e847
commit 00bc01e
Show file tree

Hide file tree

Showing 10 changed files with 61 additions and 227 deletions.
diff --git a/examples/cloud-edge-collaborative-inference-for-llm/README.md b/examples/cloud-edge-collaborative-inference-for-llm/README.md
@@ -30,7 +30,7 @@ Additionally, Speculative Decoding $^{[3]}$ is another promising strategy to fur
 
 The overall design is shown in the figure below.
 
-![image-20240926143857223](./assets/image-20250115535482354.png)
+![Architecture](./assets/Architecture.png)
 
 When Ianvs starts the benchmarking job, the Test Env Manager will first pass the data of the user-specified Dataset to the Test Case Controller for Joint Inference one by one.
 
@@ -144,16 +144,14 @@ Here is an example:
 }
 ```
 
-
-
 ### Metric Configuration
 
 *Note: If you just want to run this example quickly, you can skip this step.*
 
 We have designed multiple metrics for edge-cloud collaborative inference, including:
 
 | Metric                  | Description                                             | Unit    |
-| :---------------------- | :------------------------------------------------------ | ------- |
+|  :---: | :---: | :---: |
 | Accuracy                | Accuracy on the test Dataset                            | -       |
 | Edge Ratio            | proportion of queries router to edge                    | -       |
 | Time to First Token     | Time taken to generate the first token                  | s       |
@@ -178,14 +176,12 @@ In the configuration file, there are two models available for configuration: `Ed
 
 #### EdgeModel Configuration
 
-The `EdgeModel` is designed to be deployed on your local machine, offering support for multiple serving backends including `huggingface`, `vllm`, `EAGLE`, and `LADE`. Additionally, it provides the flexibility to integrate with API-based model services.
-
-The `CloudModel` represents the model on cloud. For extensibility, it supports both API-based models (which call LLM API via OpenAI API format) and local inference using backends like `huggingface`, `vllm`, `EAGLE`, and `LADE`. For API-based models, you need to set your `OPENAI_BASE_URL` and `OPENAI_API_KEY` in the environment variables yourself, for example:
+The `EdgeModel` is designed to be deployed on your local machine, offering support for multiple serving backends including `huggingface`, `vllm`, `EagleSpecDec`, and `LADE`. Additionally, it provides the flexibility to integrate with API-based model services.
 
-For both `EdgeModel` and `CloudModel`, the open parameters are:
+For both `EdgeModel`, the arguments are:
 
 | Parameter Name         | Type  | Description                                                  | Defalut                  |
-| ---------------------- | ----- | ------------------------------------------------------------ | ------------------------ |
+| :---: | :-----: | :---: | :---:|
 | model                  | str   | model name                                                   | Qwen/Qwen2-1.5B-Instruct |
 | backend                | str   | model serving framework                                      | huggingface              |
 | temperature            | float | What sampling temperature to use, between 0 and 2            | 0.8                      |
@@ -194,22 +190,37 @@ For both `EdgeModel` and `CloudModel`, the open parameters are:
 | repetition_penalty     | float | The parameter for repetition penalty                         | 1.05                     |
 | tensor_parallel_size   | int   | The size of tensor parallelism (Used for vLLM)               | 1                        |
 | gpu_memory_utilization | float | The percentage of GPU memory utilization (Used for vLLM)     | 0.9                      |
+| draft_model | str | The draft model used for Speculative Decoding | - |
+
+#### CloudModel Configuration
+
 
-If you want to call API-based models, you need to set your `OPENAI_BASE_URL` and `OPENAI_API_KEY` in the environment variables yourself, for example:
+The `CloudModel` represents the model on cloud, it will call LLM API via OpenAI API format. You need to set your OPENAI_BASE_URL and OPENAI_API_KEY in the environment variables yourself, for example.
 
 ```bash
 export OPENAI_BASE_URL="https://api.openai.com/v1"
 export OPENAI_API_KEY=sk_xxxxxxxx
 ```
 
+For `CloudModel`, the open parameters are:
+
+| Parameter Name     | Type | Description                                                  | Defalut     |
+| :---: | :---: | :---: | :---: |
+| model              | str  | model name                                                   | gpt-4o-mini |
+| temperature        | float  | What sampling temperature to use, between 0 and 2            | 0.8         |
+| top_p              | float  | nucleus sampling parameter                                   | 0.8         |
+| max_tokens         | int  | The maximum number of tokens that can be generated in the chat completion | 512         |
+| repetition_penalty | float  | The parameter for repetition penalty                         | 1.05        |
+
+
 #### Router Configuration
 
 Router is a component that routes the query to the edge or cloud model. The router is configured by `hard_example_mining` in `examples/cloud-edge-collaborative-inference-for-llm/testrouters/query-routing/test_queryrouting.yaml`.
 
 Currently, supported routers include:
 
 | Router Type  | Description                                                  | Parameters       |
-| ------------ | ------------------------------------------------------------ | ---------------- |
+|  :---: | :---: | :---: |
 | EdgeOnly     | Route all queries to the edge model.                         | -                |
 | CloudOnly    | Route all queries to the cloud model.                        | -                |
 | OracleRouter | Optimal Router         |         |
@@ -226,7 +237,7 @@ The Data Processor allows you to custom your own data format after the dataset l
 Currently, supported routers include:
 
 | Data Processor  | Description                                                  | Parameters       |
-| ------------ | ------------------------------------------------------------ | ---------------- |
+|  :---: | :---: | :---: |
 | OracleRouterDatasetProcessor     |  Expose `gold` label to OracleRouter                      |   -         |
 
 ## Step 3. Run Ianvs
@@ -283,18 +294,18 @@ Ianvs will output a `rank.csv` and `selected_rank.csv` in `ianvs/workspace`, whi
 
 You can modify the relevant model parameters in `examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/test_queryrouting.yaml`, conduct multiple tests, and compare the results of different configurations.
 
-
 Since MMLU-5-shot has a large amount of data, we recommend using the GPQA dataset to test the latency and throughput performance under different inference frameworks and Oracle Router. Below are the test results for two inference frameworks `vllm` and `EAGLE` under Oracle Router:
 
 ```bash
-+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
-| rank |   algorithm   | Accuracy | Edge Ratio | Time to First Token | Throughput | Internal Token Latency | Cloud Prompt Tokens | Cloud Completion Tokens | Edge Prompt Tokens | Edge Completion Tokens |    paradigm    | hard_example_mining |    edgemodel-model     | edgemodel-backend | cloudmodel-model |         time        |                                         url                                         |
-+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
-|  1   | query-routing |  54.04   |   78.79    |        0.278        |    47.1    |         0.021          |        12081        |          20383          |       43636        |         64042          | jointinference |     OracleRouter    | Qwen/Qwen2-7B-Instruct |        vllm       |   gpt-4o-mini    | 2025-01-16 16:27:00 | ./workspace-gpqa/benchmarkingjob/query-routing/a5477f86-d3e3-11ef-aa28-0242ac110008 |
-|  2   | query-routing |  39.39   |    0.0     |        1.388        |   57.48    |         0.017          |        52553        |          100395         |         0          |           0            | jointinference |      CloudOnly      | Qwen/Qwen2-7B-Instruct |        vllm       |   gpt-4o-mini    | 2025-01-16 16:13:12 | ./workspace-gpqa/benchmarkingjob/query-routing/e204bac6-d3dc-11ef-8dfe-0242ac110008 |
-|  3   | query-routing |  32.83   |   100.0    |        0.059        |   44.95    |         0.022          |          0          |            0            |       56550        |         80731          | jointinference |       EdgeOnly      | Qwen/Qwen2-7B-Instruct |        vllm       |   gpt-4o-mini    | 2025-01-16 13:12:20 | ./workspace-gpqa/benchmarkingjob/query-routing/fdda7ce2-d3c1-11ef-8ea0-0242ac110008 |
-|  4   | query-routing |  28.28   |   100.0    |        0.137        |   66.12    |         0.015          |          0          |            0            |       56550        |         67426          | jointinference |       EdgeOnly      | Qwen/Qwen2-7B-Instruct |    EagleSpecDec   |   gpt-4o-mini    | 2025-01-16 12:43:05 | ./workspace-gpqa/benchmarkingjob/query-routing/fdda7aa8-d3c1-11ef-8ea0-0242ac110008 |
-+------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
++------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
+| rank |   algorithm   | Accuracy | Edge Ratio | Time to First Token | Throughput | Internal Token Latency | Cloud Prompt Tokens | Cloud Completion Tokens | Edge Prompt Tokens | Edge Completion Tokens |    paradigm    | hard_example_mining |         edgemodel-model         | edgemodel-backend | cloudmodel-model |         time        |                                         url                                         |
++------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
+|  1   | query-routing |  54.55   |   72.73    |         0.27        |   49.94    |          0.02          |        16777        |          30824          |       42823        |         66112          | jointinference |     OracleRouter    | NousResearch/Llama-2-7b-chat-hf |        vllm       |   gpt-4o-mini    | 2025-02-09 14:26:46 | ./workspace-gpqa/benchmarkingjob/query-routing/d393d334-e6ae-11ef-8ed1-0242ac110002 |
+|  2   | query-routing |  53.54   |   74.24    |        0.301        |   89.44    |         0.011          |        16010        |          27859          |       43731        |         68341          | jointinference |     OracleRouter    | NousResearch/Llama-2-7b-chat-hf |    EagleSpecDec   |   gpt-4o-mini    | 2025-02-09 14:26:46 | ./workspace-gpqa/benchmarkingjob/query-routing/d393d0e6-e6ae-11ef-8ed1-0242ac110002 |
+|  3   | query-routing |  40.91   |    0.0     |        0.762        |   62.57    |         0.016          |        52553        |          109922         |         0          |           0            | jointinference |      CloudOnly      | NousResearch/Llama-2-7b-chat-hf |        vllm       |   gpt-4o-mini    | 2025-02-09 14:26:33 | ./workspace-gpqa/benchmarkingjob/query-routing/cb8bae14-e6ae-11ef-bc17-0242ac110002 |
+|  4   | query-routing |  27.78   |   100.0    |        0.121        |   110.61   |         0.009          |          0          |            0            |       62378        |         92109          | jointinference |       EdgeOnly      | NousResearch/Llama-2-7b-chat-hf |    EagleSpecDec   |   gpt-4o-mini    | 2025-02-09 14:26:16 | ./workspace-gpqa/benchmarkingjob/query-routing/c1afaa30-e6ae-11ef-8c1d-0242ac110002 |
+|  5   | query-routing |  27.27   |   100.0    |         0.06        |   46.95    |         0.021          |          0          |            0            |       62378        |         92068          | jointinference |       EdgeOnly      | NousResearch/Llama-2-7b-chat-hf |        vllm       |   gpt-4o-mini    | 2025-02-09 14:26:16 | ./workspace-gpqa/benchmarkingjob/query-routing/c1afac74-e6ae-11ef-8c1d-0242ac110002 |
++------+---------------+----------+------------+---------------------+------------+------------------------+---------------------+-------------------------+--------------------+------------------------+----------------+---------------------+---------------------------------+-------------------+------------------+---------------------+-------------------------------------------------------------------------------------+
 ```
 
 

diff --git a/examples/cloud-edge-collaborative-inference-for-llm/assets/Architecture.png b/examples/cloud-edge-collaborative-inference-for-llm/assets/Architecture.png
diff --git a/...s/cloud-edge-collaborative-inference-for-llm/assets/image-20250115535482354.png b/...s/cloud-edge-collaborative-inference-for-llm/assets/image-20250115535482354.png
diff --git a/...es/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/cloud_model.py b/...es/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/cloud_model.py
@@ -18,7 +18,7 @@
 
 from core.common.log import LOGGER
 from sedna.common.class_factory import ClassType, ClassFactory
-from models import APIBasedLLM, HuggingfaceLLM, VllmLLM, EagleSpecDecModel, LadeSpecDecLLM
+from models import APIBasedLLM
 
 os.environ['BACKEND_TYPE'] = 'TORCH'
 
@@ -32,41 +32,18 @@ def __init__(self, **kwargs):
         """Initialize the CloudModel.  See `APIBasedLLM` for details about `kwargs`.
         """
         LOGGER.info(kwargs)
-        self.kwargs = kwargs
-        self.model_name = kwargs.get("model", None)
-        self.backend = kwargs.get("backend", "huggingface")
-        self._set_config()
-        self.load()
+        self.model = APIBasedLLM(**kwargs)
+        self.load(kwargs.get("model", "gpt-4o-mini"))
 
-    def _set_config(self):
-        """Set the model path in our environment variables due to Sedna’s [check](https://github.com/kubeedge/sedna/blob/ac623ab32dc37caa04b9e8480dbe1a8c41c4a6c2/lib/sedna/core/base.py#L132).
-        """
-        pass
-        #
-        # os.environ["model_path"] = self.model_name
-
-    def load(self, **kwargs):
-        """Set the model backend to be used. Will be called by Sedna's JointInference interface.
+    def load(self, model):
+        """Set the model.
 
-        Raises
-        ------
-        Exception
-            When the backend is not supported.
+        Parameters
+        ----------
+        model : str
+            Existing model from your OpenAI provider. Example: `gpt-4o-mini`
         """
-        if self.backend == "huggingface":
-            self.model = HuggingfaceLLM(**self.kwargs)
-        elif self.backend == "vllm":
-            self.model = VllmLLM(**self.kwargs)
-        elif self.backend == "api":
-            self.model = APIBasedLLM(**self.kwargs)
-        elif self.backend == "EagleSpecDec":
-            self.model = EagleSpecDecModel(**self.kwargs)
-        elif self.backend == "LadeSpecDec":
-            self.model = LadeSpecDecLLM(**self.kwargs)
-        else:
-            raise Exception(f"Backend {self.backend} is not supported. Please use 'huggingface', 'vllm', or `api` ")
-
-        self.model._load(self.kwargs.get("model", None))
+        self.model._load(model = model)
 
     def inference(self, data, **kwargs):
         """Inference the model with the given data.

diff --git a/...loud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/__init__.py b/...loud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/__init__.py
@@ -2,5 +2,4 @@
 from .huggingface_llm import HuggingfaceLLM
 from .vllm_llm import VllmLLM
 from .base_llm import BaseLLM
-from .speculative_decoding_models.eagle_llm import EagleSpecDecModel
-from .speculative_decoding_models.lade_llm import LadeSpecDecLLM
+from .eagle_llm import EagleSpecDecModel
diff --git a/...loud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/base_llm.py b/...loud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/models/base_llm.py
@@ -147,7 +147,7 @@ def inference(self, data):
         else:
             raise ValueError(f"DataType {type(data)} is not supported, it must be `dict`")
 
-    def get_message_chain(self, question, system = None):
+    def get_message_chain(self, question, system = "You are a helpful assistant."):
         """Get the OpenAI Chat style message chain
 
         Parameters

diff --git a/.../speculative_decoding_models/eagle_llm.py → ...orithms/query-routing/models/eagle_llm.py b/.../speculative_decoding_models/eagle_llm.py → ...orithms/query-routing/models/eagle_llm.py
@@ -41,7 +41,6 @@ def _load(self, model):
         # breakpoint()
         self.model = EaModel.from_pretrained(
             base_model_path=self.config.get("model", None),
-
             ea_model_path=self.config.get("draft_model", None),
             torch_dtype=torch.bfloat16,
             low_cpu_mem_usage=True,

diff --git a/...ge-collaborative-inference-for-llm/testalgorithms/query-routing/models/huggingface_llm.py b/...ge-collaborative-inference-for-llm/testalgorithms/query-routing/models/huggingface_llm.py
@@ -73,9 +73,6 @@ def _infer(self, messages):
         most_recent_timestamp = st
 
         # messages = self.get_message_chain(question, system_prompt)
-
-        streamer = TextIteratorStreamer(self.tokenizer)
-
         text = self.tokenizer.apply_chat_template(
             messages,
             tokenize=False,
@@ -131,3 +128,8 @@ def _infer(self, messages):
         )
 
         return response
+
+if __name__ == "__main__":
+    model = HuggingfaceLLM()
+    model._load("Qwen/Qwen2-7B-Instruct")
+    print(model._infer("Hello, how are you?"))