Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to split during conversion #6942

Merged
merged 73 commits into from
Jun 24, 2024

Conversation

christianazinn
Copy link
Contributor

@christianazinn christianazinn commented Apr 27, 2024

This PR introduces additional options to convert.py that allow users to split a model into shards while converting rather than having to do it after conversion, including a default small first shard as outlined in #6463.

Other functionality we ought to have includes --split-max-size (so far it's just --split-max-tensors), displaying estimated shard sizes, dry running, and adding sharding for the other convert-*-to-*.py scripts. This will be considered a draft until those are worked out. Also needs considerable testing, but luckily as this deals with the Python scripts, it can be tested easily.

Usage

(examples are using zephyr-smol_llama-100m-sft-full)

Example, --split-max-size

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-size 64M

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00005.gguf: n_tensors = 0, total_size = negligible - metadata only
    /path/to/outfile-00002-of-00005.gguf: n_tensors = 1, total_size = 47.1M
    /path/to/outfile-00003-of-00005.gguf: n_tensors = 11, total_size = 63.6M
    /path/to/outfile-00004-of-00005.gguf: n_tensors = 32, total_size = 63.4M
    /path/to/outfile-00005-of-00005.gguf: n_tensors = 13, total_size = 19.1M

Writing shard 2/5 with 1/57 tensors remaining (of 57 total)
[1/1] Writing tensor output.weight                          | size  32128 x    768  | type F16  | T+   2

Writing shard 3/5 with 11/56 tensors remaining (of 57 total)
[ 1/11] Writing tensor token_embd.weight                      | size  32128 x    768  | type F16  | T+   2
[ 2/11] Writing tensor blk.0.attn_norm.weight                 | size    768           | type F32  | T+   3
[ 3/11] Writing tensor blk.0.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   3
[ 4/11] Writing tensor blk.0.ffn_gate.weight                  | size   3072 x    768  | type F16  | T+   3
[ 5/11] Writing tensor blk.0.ffn_up.weight                    | size   3072 x    768  | type F16  | T+   3
[ 6/11] Writing tensor blk.0.ffn_norm.weight                  | size    768           | type F32  | T+   3
[ 7/11] Writing tensor blk.0.attn_k.weight                    | size    256 x    768  | type F16  | T+   3
[ 8/11] Writing tensor blk.0.attn_output.weight               | size    768 x    768  | type F16  | T+   3
[ 9/11] Writing tensor blk.0.attn_q.weight                    | size    768 x    768  | type F16  | T+   3
[10/11] Writing tensor blk.0.attn_v.weight                    | size    256 x    768  | type F16  | T+   3
[11/11] Writing tensor blk.1.attn_norm.weight                 | size    768           | type F32  | T+   3

Writing shard 4/5 with 32/45 tensors remaining (of 57 total)
[ 1/32] Writing tensor blk.1.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   0
[etc...]

With --split-max-size 200M (or any number greater than the total resultant size), it gives:

Model has smaller size than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

[the rest of output is the same as in master]

Example, --split-max-tensors with --dry-run, --large-first-shard

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-tensors 20 --dry-run --large-first-shard

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00003.gguf: n_tensors = 20, total_size = 127.1M
    /path/to/outfile-00002-of-00003.gguf: n_tensors = 20, total_size = 37.5M
    /path/to/outfile-00003-of-00003.gguf: n_tensors = 17, total_size = 28.5M

Dry run, not writing files

With --split-max-tensors 64 (or any number greater than the total tensor count), it gives:

Model has fewer tensors than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

Dry run, not writing files

References

@christianazinn christianazinn marked this pull request as draft April 27, 2024 04:24
@christianazinn
Copy link
Contributor Author

I've added support for --split-max-size and --dry-run, taking a page out of gguf-split.cpp. Faced with adding split functionality to the convert-*-to-*.py scripts, I wonder whether this should be added to the GGUFWriter class itself rather than to the convert scripts, since it would be tedious to rewrite every write_tensors method in convert-hf-to-gguf.py.

The counterpoint I can see to doing this is that GGUFWriter should only write one file, since it's GGUFWriter and not GGMLWriter. It would also be very annoying to rewrite GGUFWriter, and I'm hesitant to touch the gguf package as a novice. But it's also likely nobody thought of this scenario when creating the file, so perhaps there's good reason to make these changes in the GGUFWriter class. @phymbert thoughts?

@phymbert
Copy link
Collaborator

This is already a good start. Could you add an end to end usage in the summary?

@christianazinn
Copy link
Contributor Author

christianazinn commented Apr 28, 2024

Sure thing (I assume you mean examples of usage and expected outputs).

I also plan to rework the implementation by consolidating code into a new GGUFManager class that handles multiple file writes via multiple GGUFWriter instances, so GGUFWriter still only writes to one file. This is because each Model in convert-hf-to-gguf.py has only one instance of GGUFWriter, so splitting would be nearly impossible there. Usage should remain the same, but the code will be fundamentally altered. (I also imagine this could do things to memory usage, so that will need to be heavily tested.)

@christianazinn
Copy link
Contributor Author

I'll need to implement for convert-llama-ggml-to-gguf.py and convert-persimmon-to-gguf.py soon - what are some models that require those scripts for conversion, so I can test? Also, I see convert-lora-to-ggml.py doesn't even use GGUFWriter - is that just for converting LoRA adapters? Is that something we should even add splitting for, considering the small size of LoRA adapters?

Anyway, GGUFManager is implemented as a near drop-in replacement for GGUFWriter that supports file splitting, so far only in convert.py (migrated from my previous commits); support for convert-hf-to-gguf.py is next up.

@slaren
Copy link
Collaborator

slaren commented Apr 28, 2024

convert-llama-ggml-to-gguf.py is for conversion of pre-gguf models. At this point it could be removed. convert-lora-to-ggml.py doesn't export to gguf format. convert-persimmon-to-gguf.py should probably be integrated into convert-hf-to-gguf.py, but I don't think it needs to be updated.

@christianazinn
Copy link
Contributor Author

Got it - will only implement for convert-hf-to-gguf.py. Remind me to watch memory usage while converting. Since I'm making changes to the gguf package, how will I push those?

@slaren
Copy link
Collaborator

slaren commented Apr 29, 2024

You can modify the gguf package in the gguf-py directory in this repository. There are instructions for publishing new releases in https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md.

@christianazinn
Copy link
Contributor Author

You can modify the gguf package in the gguf-py directory in this repository

That's what I've been doing so far; will check out instructions to contribute, thanks!

@christianazinn
Copy link
Contributor Author

Testing on Mistral 7B Instruct, this branch's convert.py takes up approximately the same amount of memory as that of master. Will need to check on larger models since the discrepancy was around 6%, 3.6G vs. 3.4G used at max. Obviously memory plays a major role in splitting larger files, which is the entire point of this PR.

@christianazinn
Copy link
Contributor Author

Running tests on my side for all convert-hf-to-gguf.py supported model architectures. What models fall under QWenLMHeadModel - is that just plain QWen 1?

@christianazinn
Copy link
Contributor Author

christianazinn commented May 2, 2024

Will keep track of tests here as I go. Picking one model from each architecture in convert-hf-to-gguf.py as it exists in my branch and testing; will need assistance testing, say, vision models, which I'm not as familiar with. Also note that I went with smaller models to test the architecture; larger models should act the same, but again, tests will be needed.

It also seems like the current convert-hf-to-gguf.py doesn't print tensor status as it goes, which I intend to change.

  • GPTNeoX: EleutherAI/gpt-neox-20b - FAILED LOADING with "unknown architecture" (failed on master as well)
  • Bloom: bigscience/bloom-7b1 - WORKS
  • MPT: mosaicml/mpt-7b - WORKS
  • Orion: OrionStarAI/Orion-14B-Chat - WORKS
  • Baichuan: baichuan-inc/Baichuan2-7B-Chat - WORKS
  • Xverse: xverse/XVERSE-7B-Chat - FAILED CONVERSION with "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 78 column 3" (failed on master as well)
  • Falcon: tiiuae/falcon-7b-instruct - WORKS
  • GPTBigCode: bigcode/gpt_bigcode-santacoder - FAILED LOADING with "tensor output.weight not found" (failed on master as well)
  • GPTRefact: smallcloudai/Refact-1_6B-fim - WORKS (incoherent code but I assume that's what it's used for)
  • Persimmon: adept/persimmon-8b-chat - Strictly "WORKS" but is incoherent - I assume this has to do with prompt formatting on master so I won't look further. It loads and generates.
  • StableLM: stabilityai/stablelm-2-1_6b-chat - WORKS
  • Mistral: mistralai/Mistral-7B-Instruct-v0.2
  • Llama2: meta-llama/Llama-2-7b-chat-hf
  • DBRX: databricks/dbrx-instruct
  • MiniCPM: openbmb/MiniCPM-V-2
  • Qwen1: Qwen/Qwen-1_8B
  • Qwen2: Qwen/Qwen1.5-1.8B
  • Qwen MoE: Qwen/Qwen1.5-MoE-A2.7B-Chat
  • GPT2: openai-community/gpt2
  • Phi2: microsoft/phi-2
  • Phi3: microsoft/Phi-3-mini-4k-instruct
  • Plamo: pfnet/plamo-13b-instruct
  • CodeShell: WisdomShell/CodeShell-7B-Chat
  • InternLM: internlm/internlm2-chat-7b
  • BERT: avsolatorio/GIST-Embedding-v0
  • NomicBERT: nomic-ai/nomic-embed-text-v1.5
  • Gemma: google/gemma-1.1-2b-it
  • StarCoder2: bigcode/starcoder2-3b
  • Mamba: TRI-ML/mamba-7b-rw
  • Cohere: CohereForAI/c4ai-command-r-v01
  • OLMo: allenai/OLMo-7B-Instruct

@christianazinn
Copy link
Contributor Author

Leaving a note for myself to watch merge conflicts with #6511. Development on this branch has slowed down as I'm pretty busy.

@christianazinn
Copy link
Contributor Author

Noting time to convert baichuan-inc/Baichuan2-7B-Chat.

New branch, --split, --split-max-size 4G:
real 6m27.788s
user 1m15.914s
sys 0m46.017s

New branch, no split:
real 7m17.661s
user 1m18.516s
sys 0m44.285s

master:
real 5m57.387s
user 1m14.567s
sys 0m48.403s

Note that these conversions were done writing the outfile over 2.5GbE, so there was considerable time spent just saving the file. Will test more later, but it doesn't seem like the change increases conversion time too significantly.

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level python python script changes enhancement New feature or request labels May 9, 2024
@mofosyne
Copy link
Collaborator

mofosyne commented May 9, 2024

Merge attempted. Some ambiguous lines, so @christianazinn should give this a lookover to make sure the intent is still correct.

@christianazinn
Copy link
Contributor Author

christianazinn commented May 9, 2024

I'll check in a few hours and fix conflicts.

@christianazinn
Copy link
Contributor Author

The new get-vocab-base-pre functionality introduced to convert-hf-to-gguf.py by #6920 is throwing me off, but things look fine for the most part. Push incoming for conflict resolution; testing on Refact for convert-hf-to-gguf.py worked and no fundamental changes are required to convert.py. This will remain approximately dormant for another two weeks or so while I focus on finals, but since the code is already almost all implemented, if other people want to pick up and take this PR to the finish line I'd more than appreciate it.

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm satisfied with how this turned out. I did not test this extensively, but from the conversions I tried (with --split-max-size and with no split, both with q8_0 and f16), this worked well.

A future PR to add split model support to GGUFReader would be nice.

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved
@christianazinn christianazinn marked this pull request as ready for review June 15, 2024 15:29
@christianazinn
Copy link
Contributor Author

Forgot to mark as ready for review. Can probably be merged.

@compilade compilade added merge ready indicates that this may be ready to merge soon and is just holding out in case of objections and removed help wanted Extra attention is needed examples labels Jun 15, 2024
@mofosyne
Copy link
Collaborator

few days has passed with the merge ready label, ci passed and approval.

Consensus achieved? I'll presume it will be so by the end of the week.

@christianazinn
Copy link
Contributor Author

It's been about a week and I see no dissent so far.

convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mofosyne mofosyne merged commit 52fc870 into ggerganov:master Jun 24, 2024
18 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
* support splits in convert.py

* Support split by size and dry run to write estimated shards/filesizes

* Move split functionality to new GGUFManager class

* fix improper function signature

* tentative push of convert-hf-to-gguf support

* resolve merge + SplitArguments for easier parsing

* Fix eager tensor memory leak and remove convert.py changes

Removed a memory leak caused by unexpected reference retention to eager tensors.

Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.

* refactor SplitStrategy to be a deque

Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.

* fix Q8 quantization

* remove unnecessary imports in gguf_manager

* fix final? merge issue

* fix gguf_writer placement and remove comments

* oops, actually fix gguf_writer placement

* reduce duplicated code from gguf_writer

* further simplify GGUFManager

* simplify even further and standardize with GGUFWriter

* reduce diffs with master

* form shards while adding tensors, SHA256 sums agree with master

* re-add type hint

Co-authored-by: compilade <[email protected]>

* GGUFWriter compatibility fix

Co-authored-by: compilade <[email protected]>

* Shard dataclass and un-negative dont_add_architecture

* type consistency in format_n_bytes_to_str

* move kv keys to constants.py

* make pathlib explicit

* base-1024 bytes to base-1000

* rename GGUFManager to GGUFWriterSplit

* Update gguf-py/gguf/constants.py

Co-authored-by: compilade <[email protected]>

* fix convert-hf-to-gguf.py permissions

* fix line endings

* Update gguf-py/gguf/gguf_writer_split.py

Co-authored-by: compilade <[email protected]>

* convert-hf : restore executable file permission

* examples/convert-legacy-llama.py: restore executable file permission

* reinstate original gguf package import and fix type annotation

* attempt to appease the linter

* attempt 2 to appease the linter

* attempt 3 to appease the linter

* comma consistency

* Update convert-hf-to-gguf.py

Co-authored-by: compilade <[email protected]>

* edit cmd line args

* use simplification from ggerganov#7827

* kv/ti data are still wrong

* try to refactor kv data (still fails)

* fix ti data messiness

* tidy up

* fix linting

* actually make the linter happy

* cleanup round 1

* remove SplitStrategy, SplitArguments

* appease linter

* fix typing and clean up

* fix linting

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* progress bar, fix split logic

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* catch oversights

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* swap bar orders

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* compatibility fix

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update convert-hf-to-gguf.py

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: Brian <[email protected]>
Co-authored-by: compilade <[email protected]>
MagnusS0 pushed a commit to MagnusS0/llama.cpp-normistral-tokenizer that referenced this pull request Jul 1, 2024
* support splits in convert.py

* Support split by size and dry run to write estimated shards/filesizes

* Move split functionality to new GGUFManager class

* fix improper function signature

* tentative push of convert-hf-to-gguf support

* resolve merge + SplitArguments for easier parsing

* Fix eager tensor memory leak and remove convert.py changes

Removed a memory leak caused by unexpected reference retention to eager tensors.

Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.

* refactor SplitStrategy to be a deque

Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.

* fix Q8 quantization

* remove unnecessary imports in gguf_manager

* fix final? merge issue

* fix gguf_writer placement and remove comments

* oops, actually fix gguf_writer placement

* reduce duplicated code from gguf_writer

* further simplify GGUFManager

* simplify even further and standardize with GGUFWriter

* reduce diffs with master

* form shards while adding tensors, SHA256 sums agree with master

* re-add type hint

Co-authored-by: compilade <[email protected]>

* GGUFWriter compatibility fix

Co-authored-by: compilade <[email protected]>

* Shard dataclass and un-negative dont_add_architecture

* type consistency in format_n_bytes_to_str

* move kv keys to constants.py

* make pathlib explicit

* base-1024 bytes to base-1000

* rename GGUFManager to GGUFWriterSplit

* Update gguf-py/gguf/constants.py

Co-authored-by: compilade <[email protected]>

* fix convert-hf-to-gguf.py permissions

* fix line endings

* Update gguf-py/gguf/gguf_writer_split.py

Co-authored-by: compilade <[email protected]>

* convert-hf : restore executable file permission

* examples/convert-legacy-llama.py: restore executable file permission

* reinstate original gguf package import and fix type annotation

* attempt to appease the linter

* attempt 2 to appease the linter

* attempt 3 to appease the linter

* comma consistency

* Update convert-hf-to-gguf.py

Co-authored-by: compilade <[email protected]>

* edit cmd line args

* use simplification from ggerganov#7827

* kv/ti data are still wrong

* try to refactor kv data (still fails)

* fix ti data messiness

* tidy up

* fix linting

* actually make the linter happy

* cleanup round 1

* remove SplitStrategy, SplitArguments

* appease linter

* fix typing and clean up

* fix linting

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* progress bar, fix split logic

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* catch oversights

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* swap bar orders

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* compatibility fix

* Update gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

* Update convert-hf-to-gguf.py

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: Brian <[email protected]>
Co-authored-by: compilade <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants