Skip to content

Commit

Permalink
refactor: estimate across multiple devices
Browse files Browse the repository at this point in the history
Signed-off-by: thxCode <[email protected]>
  • Loading branch information
thxCode committed Aug 26, 2024
1 parent da6f757 commit 5c1f28d
Show file tree
Hide file tree
Showing 9 changed files with 837 additions and 471 deletions.
540 changes: 384 additions & 156 deletions README.md

Large diffs are not rendered by default.

11 changes: 6 additions & 5 deletions cmd/gguf-parser/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,9 @@ GLOBAL OPTIONS:
--no-mmap Specify disabling Memory-Mapped using, which is used to estimate the usage. Memory-Mapped can avoid loading the entire model weights into RAM. (default: false)
--parallel-size value, --parallel value, --np value Specify the number of parallel sequences to decode, which is used to estimate the usage. (default: 1)
--platform-footprint value Specify the platform footprint(RAM,VRAM) of running host in MiB, which is used to estimate the NonUMA usage, default is 150,250. Different platform always gets different RAM and VRAM footprints, for example, within CUDA, 'cudaMemGetInfo' would occupy some RAM and VRAM, see https://stackoverflow.com/questions/64854862/free-memory-occupied-by-cudamemgetinfo. (default: "150,250")
--rpc value Specify the RPC servers, which is used to estimate the usage, it is a comma-separated list of host:port. Woks with --tensor-split.
--split-mode value, --sm value Specify how to split the model across multiple devices, which is used to estimate the usage, select from [layer, row, none]. Since gguf-parser always estimates the usage of VRAM, "none" is meaningless here, keep for compatibility. (default: "layer")
--tensor-split value, --ts value Specify the fraction of the model to offload to each device, which is used to estimate the usage, it is a comma-separated list of integer. Since gguf-parser cannot recognize the host GPU devices or RPC servers, must explicitly set --tensor-split to indicate how many devices are used.
--tensor-split value, --ts value Specify the fraction of the model to offload to each device, which is used to estimate the usage, it is a comma-separated list of integer. Since gguf-parser cannot recognize the host GPU devices or RPC servers, must explicitly set --tensor-split to indicate how many devices are used. To declare the devices belong to RPC servers, set --rpc please.
--ubatch-size value, --ub value Specify the physical maximum batch size, which is used to estimate the usage. (default: 512)
Load
Expand Down Expand Up @@ -95,10 +96,10 @@ GLOBAL OPTIONS:
--json-pretty Works with --json, to output pretty format JSON. (default: true)
--raw Output the GGUF file information as JSON only, skip anything. (default: false)
--raw-output value Works with --raw, to save the result to the file
--skip-architecture Skip to display architecture metadata. (default: false)
--skip-estimate Skip to estimate. (default: false)
--skip-model Skip to display model metadata. (default: false)
--skip-tokenizer Skip to display tokenizer metadata. (default: false)
--skip-architecture Skip to display architecture. (default: false)
--skip-estimate Skip to estimate. By default, gguf-parser always estimates the file which types with "model". (default: false)
--skip-metadata Skip to display metadata. (default: false)
--skip-tokenizer Skip to display tokenizer. By default, gguf-parser always displays the tokenizer of the file which types with "model". (default: false)
```
Expand Down
Loading

0 comments on commit 5c1f28d

Please sign in to comment.