refactor: estimate across multiple devices

Signed-off-by: thxCode <[email protected]>
gpustack · Aug 26, 2024 · 5c1f28d · 5c1f28d
1 parent da6f757
commit 5c1f28d
Show file tree

Hide file tree

Showing 9 changed files with 837 additions and 471 deletions.
diff --git a/README.md b/README.md
diff --git a/cmd/gguf-parser/README.md b/cmd/gguf-parser/README.md
@@ -37,8 +37,9 @@ GLOBAL OPTIONS:
    --no-mmap                                            Specify disabling Memory-Mapped using, which is used to estimate the usage. Memory-Mapped can avoid loading the entire model weights into RAM. (default: false)
    --parallel-size value, --parallel value, --np value  Specify the number of parallel sequences to decode, which is used to estimate the usage. (default: 1)
    --platform-footprint value                           Specify the platform footprint(RAM,VRAM) of running host in MiB, which is used to estimate the NonUMA usage, default is 150,250. Different platform always gets different RAM and VRAM footprints, for example, within CUDA, 'cudaMemGetInfo' would occupy some RAM and VRAM, see https://stackoverflow.com/questions/64854862/free-memory-occupied-by-cudamemgetinfo. (default: "150,250")
+   --rpc value                                          Specify the RPC servers, which is used to estimate the usage, it is a comma-separated list of host:port. Woks with --tensor-split.
    --split-mode value, --sm value                       Specify how to split the model across multiple devices, which is used to estimate the usage, select from [layer, row, none]. Since gguf-parser always estimates the usage of VRAM, "none" is meaningless here, keep for compatibility. (default: "layer")
-   --tensor-split value, --ts value                     Specify the fraction of the model to offload to each device, which is used to estimate the usage, it is a comma-separated list of integer. Since gguf-parser cannot recognize the host GPU devices or RPC servers, must explicitly set --tensor-split to indicate how many devices are used.
+   --tensor-split value, --ts value                     Specify the fraction of the model to offload to each device, which is used to estimate the usage, it is a comma-separated list of integer. Since gguf-parser cannot recognize the host GPU devices or RPC servers, must explicitly set --tensor-split to indicate how many devices are used. To declare the devices belong to RPC servers, set --rpc please.
    --ubatch-size value, --ub value                      Specify the physical maximum batch size, which is used to estimate the usage. (default: 512)
 
    Load
@@ -95,10 +96,10 @@ GLOBAL OPTIONS:
    --json-pretty        Works with --json, to output pretty format JSON. (default: true)
    --raw                Output the GGUF file information as JSON only, skip anything. (default: false)
    --raw-output value   Works with --raw, to save the result to the file
-   --skip-architecture  Skip to display architecture metadata. (default: false)
-   --skip-estimate      Skip to estimate. (default: false)
-   --skip-model         Skip to display model metadata. (default: false)
-   --skip-tokenizer     Skip to display tokenizer metadata. (default: false)
+   --skip-architecture  Skip to display architecture. (default: false)
+   --skip-estimate      Skip to estimate. By default, gguf-parser always estimates the file which types with "model". (default: false)
+   --skip-metadata      Skip to display metadata. (default: false)
+   --skip-tokenizer     Skip to display tokenizer. By default, gguf-parser always displays the tokenizer of the file which types with "model". (default: false)
 
 ```