Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeat layers to create FrankenModels #275

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dnhkng
Copy link

@dnhkng dnhkng commented Jan 12, 2024

Description

This slightly modifies the forward pass, to reuse layers to allow the creation and use of 'Frankenmodels' quickly and easily.

The format of the new argument is:
python test_inference.py -m /models/lzlv_70b_fp16_hf-4.0bpw-h6-exl2 -p "Once upon a time:" -gs 18,18 --repeats '[(0,20),(10,30),(20,40),(30,50),(40,60),(50,70),(60,79)]'

This would generate the nsfwthrowitaway69/Venus-120b-v1.2
Frankenmodel dynamically, while reducing VRAM usage by 50b params.

The repeats parameter is a string list of tuples. As the final layers in most models are model.norm and lm_head (not a number), the last value in the last tuple should be one lower than the final layer number. So long as this is the case, the code will extend out the layer so all are included. i.e. [(0,20),(10,28)] and [(0,20),(10,30)] would generate the same Frankenmodel.

Related Discussion

discussion at #270 and a discussion on Reddit localllama about the potential of easily creating Frankenstein models using exllama.

Explanation of changes

A new parameter was added to argparse, and in ExLlamaV2.init if the param was used, we build a list of the layer order to use including repeats. In the ExLlamaV2._forward method, the actual model forward pass is extracted into a private process_module method, and this is called in the usual way with looping through self.modules if 'repeats' is not passed, and looping through self.layers_list if it is passed.

@Beinsezii
Copy link

Have you observed test_inference to show better perplexity results doing that? I tested the PR out on some 13B's with various repeat methods and every single time it's very substantially worse. Kinda feels like a really expensive way to add noise.

@dnhkng
Copy link
Author

dnhkng commented Jan 13, 2024

No arguments from me that this is better, but at least it's easier to experiment and find out. I find the results from 70b models feels nicer, maybe this effect is only apparent in large models?

Quick note: the current argument does not match the notation used my Mergekit, so the results will be different. I will update this code to match their implementation.

@turboderp
Copy link
Owner

I'm a little preoccupied these days so I haven't had a chance to look at this. But are you just repeating forward passes? If so, do you also create new layers in the K/V cache? Otherwise you're not going to get sensible results.

@dnhkng
Copy link
Author

dnhkng commented Jan 13, 2024

@turboderp Thanks for the tip. Surprisingly, although there are no new layers in the KV cache, its really not bad. Weird, right? I'm looking at the output of TinyLlama-1.1B-Chat-v1.0-5.0bpw-h6-exl2, with and without the middle 6 layers repeated once, and there is no obvious degradation in performance.

I will try adding in extra KV cache layers now. I think we need a benchmark to see how this is affecting things, something like an Chatbot Arena so that we get real human comparisons.

For a quick test:
python test_inference.py -m ~/Documents/models/TinyLlama-1.1B-Chat-v1.0-5.0bpw-h6-exl2 -p "<|user|> Once upon a time. please continue. <|assistant|>" --repeats '[(0,14),(8,14),(8,22)]'
and
python test_inference.py -m ~/Documents/models/TinyLlama-1.1B-Chat-v1.0-5.0bpw-h6-exl2 -p "<|user|> Once upon a time. please continue. <|assistant|>"

Update: Yes, it feels like the extra layers raise the temperature, but increasing the layer repeats and simultaneously lowering temperature seems to generate very nice text.

@turboderp
Copy link
Owner

Inference simply won't work correctly without extra layers in the cache. It won't be equivalent to an actual Frankenstein model, as you'll be overwriting keys/values from the repeated layers. To actually add layers to the cache adds a lot of complications with multi-GPU splitting, though.

@dnhkng
Copy link
Author

dnhkng commented Jan 13, 2024

I've increased the cache-layers to match the total number of new layers, but I'm not sure how the inference pass uses and updates the cache k and v tensors. I have modified the forward pass to use updated 'layer_idx' values whilst keeping the other module attributes shared (tensor weights etc), but it's not working. I'm getting:
CUDA error: an illegal memory access was encountered ~/exllamav2/exllamav2/exllamav2_ext/cuda/rope.cu 141
Any suggestions?

@silphendio
Copy link
Contributor

I'm pretty sure the keys/values in cache are calculated from just the input tokens (independent from the previous layers), so repeated layers would have identical cache anyway.

@turboderp
Copy link
Owner

@dnhkng The right approach would be to apply the new layer index while loading the model, allocating a cache layer for each but then creating a reference layer rather than an actual layer whenever possible. It might still be necessary to duplicate layers across device boundaries, or you could end up with the hidden state bouncing back and forth between devices. That would at least have to be benchmarked to see if the overhead is acceptable or not.

@silphendio The keys and values are computed (along with the queries) from the hidden state, not from the input tokens. So they're different for every layer, even if two layers happen to have the same weights.

@zpin
Copy link

zpin commented Jan 13, 2024

I'm using a wrapper to mask the layer_idx of repeated layers. Would this be a valid approach, aside from the performance aspect?
zpin/text-generation-webui@cdff7b2

Leaving the cache at its original size seems to yield better results, but maybe that's because it behaves more like the original model.

@dnhkng
Copy link
Author

dnhkng commented Jan 13, 2024

@dnhkng The right approach would be to apply the new layer index while loading the model, allocating a cache layer for each but then creating a reference layer rather than an actual layer whenever possible. It might still be necessary to duplicate layers across device boundaries, or you could end up with the hidden state bouncing back and forth between devices. That would at least have to be benchmarked to see if the overhead is acceptable or not.

I will try and get a single GPU model working first.
Following your approach:

  1. As we are repeating layers, we can keep a track of layers already used.
  2. when we see a repeated layer, we create the reference layer instead. Its not clear to me the best way to do this. What about using copy.copy(module_to_duplicate), and updating the layer_index. But that might break the link to matching cache. Or, just create the duplicate module and get the cashe, and then replace it with copy.copy(module_to_duplicate) and update layer_index, effectively deleting the duplicate with a reference. I could also go in manually, and reference the originals torch weights.

@Beinsezii
Copy link

Some measured numbers using

test_inference.py -m Beinsezii_MythoMax-L2-13B-EXL2_4k_hb8_b8/ -ed wikitext-v2-test.parquet

  • No repeats ≈ 6.3
  • Initial PR 3afecf7
    • [(0,20),(10,30),(20,39)] ≈ 8.5
    • [(x, x+10) for x in range(0,31,5)] ≈ 8.5
    • [(x, x+2) for x in range(38)] ≈ 35.8
  • Latest PR 63e5c34
    • [(0,20),(10,30),(20,39)] ≈ 10.4
    • [(x, x+10) for x in range(0,31,5)] ≈ 11.7
    • [(x, x+2) for x in range(38)] ≈ 13.0

All done using torch 2.1.2+rocm5.6. I know perplexity isn't everything but I feel it's good for pre-screening to see if a method is moving in the right direction.

@dnhkng
Copy link
Author

dnhkng commented Jan 14, 2024

@zpin Can you check this gist?
https://gist.github.com/dnhkng/b4bad5d07b4cc532c00c306e46cb1db5

I tried your method, and although the extra cache layers are created, they appear to be unused. The script just prints out the first value of each cache tensor after an inference, and only layers up to 22 layer (the size of the input model) contain values, all the other layers past that are always zeros.

Maybe it's just a bug on my part though.

Leaving the cache at its original size seems to yield better results, but maybe that's because it behaves more like the original model.

That would be an interesting finding maybe? At each repeated layer, you use the KV cache of the previous repeat, forcing the model to stay on track, even with slightly different new input... weird...

UPDATE:

Found the issue:

class ExLlamaV2AttentionWrapper(ExLlamaV2Attention):
    def __init__(self, obj, new_idx):
        object.__setattr__(self, '_obj', obj)
        object.__setattr__(self, '_new_idx', new_idx)

    def __getattribute__(self, name):
        if name == 'layer_idx':
            return object.__getattribute__(self, '_new_idx')

        # Delegate all other attributes to the wrapped object
        try:
            return getattr(object.__getattribute__(self, '_obj'), name)
        except AttributeError:
            return object.__getattribute__(self, name)

This code block only reports the new 'layer_idx' externally via the getattribute method, but not internally for the attn class! i.e. the class itself only sees the original 'layer_idx', not '_new_idx' in its place. So, when the cache is built, its only based on the original 'layer_idx' values.

@zpin
Copy link

zpin commented Jan 14, 2024

I tried your method, and although the extra cache layers are created, they appear to be unused. The script just prints out the first value of each cache tensor after an inference, and only layers up to 22 layer (the size of the input model) contain values, all the other layers past that are always zeros.

Please try with these changes, the wrapped object didn't always use the masked layer_idx:
zpin/text-generation-webui@f6a118b

Leaving the cache at its original size seems to yield better results, but maybe that's because it behaves more like the original model.

That would be an interesting finding maybe? At each repeated layer, you use the KV cache of the previous repeat, forcing the model to stay on track, even with slightly different new input... weird...

Yeah, I don't understand this enough to draw any conclusions. But it might also have been something else, since you noticed that the cache isn't used correctly.

@dnhkng
Copy link
Author

dnhkng commented Jan 14, 2024

@Beinsezii
I could not find the file "wikitext-v2-test.parquet", so I used this one instead: wikitext-2-v1_wikitext-test.parquetfor i, idx in enumerate(layers):
# nextModule = ExLlamaV2AttentionWrapper(orig_modules[idx2 + 1], i)
nextModule = copy.copy(orig_modules[idx
2 + 1])
nextModule.layer_idx = i
model.modules.append(nextModule)

Using the proper cache system and Beinsezii_MythoMax-L2-13B-EXL2_4k_hb8_b8, we get the following perplexities
baseline ≈ 6.26
[(0,20),(10,30),(20,39)] ≈ 7.35
[(x, x+10) for x in range(0,31,5)] ≈ 8.43
[(x, x+2) for x in range(39)] ≈ 9.72

@zpin
You can skip the whole ExLlamaV2AttentionWrapper, and use copy. It's equivalent and easy to understand:

for i, idx in enumerate(layers):
    # nextModule = ExLlamaV2AttentionWrapper(orig_modules[idx*2 + 1], i)
    nextModule = copy.copy(orig_modules[idx*2 + 1])
    nextModule.layer_idx = i
    model.modules.append(nextModule)

I've tested both, and you get the same perplexity. As mentioned by @turboderp, this only works on single-gpu models so far.

UPDATE:
WTF.... I reran using zpin's original code which does not use the extra cache layers. They are created, but unused (all zeroes). But, the perplexity results are the same... Must be another bug maybe? But I see that the values in the cache on repeating layers are often very similar though (e.g. if layer 10 is repeated 3 times, the values for the 3 corresponding cache layers are quite similar, but not the same, but often just a few percent differenct).

It's probably a stupid bug, but if anyone has time: https://gist.github.com/dnhkng/34e78b6082ec26124d72624dc3f6f666

@Beinsezii
Copy link

A real test would probably be on a static self-merge that uses mergekit to see if it's comparable to this PR.

@silphendio
Copy link
Contributor

@dnhkng I updated my gist https://gist.github.com/silphendio/535cd9c1821aa1290aa10d587b76a49c

Instead of using the AttentionWrapper, I just copied the layer with the standard copy function and then set the layer_idx.

@zpin in my (admittedly limited) tests, keeping the cache at its original size tends to result in more spelling mistakes. The model is also prone to leaps of logic, where it just omits stuff, and then it loses the train of thought. But it varies greatly based on the random seed.

@turboderp You're right of course, I wonder how I got this silly idea.

On a side note, it's slightly confusing that model.head_layer_idx and model.last_kv_layer_idx refer to the index of the module, but attn.layer_idx refers to the number of the attention block.

@dnhkng
Copy link
Author

dnhkng commented Jan 15, 2024

I've started a full test of the output (TinyStories-style), to get an understanding of the effects of layer duplication, with hundreds of layering combinations. It will be interesting to see the results.

@dnhkng
Copy link
Author

dnhkng commented Jan 16, 2024

OK, it seems that reusing the cache has quite interesting effects on the output.
I checked every combination of start and stop position for a single repeated chunk, and had the generated stories rated, e.g.
{'Grammar': 9, 'Creativity': 10, 'Consistency': 9}. This is the results for TinyStories, given a creative writing task, and rated by ChatGPT-3.5, as a first test. The temperature was set to zero, although exllama does not generate deterministic results, so repeating this experiment will yield different results. Baseline is at (0,0).

71d81c23-d50e-44ea-b02e-3861aeac5300
0742c4c9-2306-4608-b00f-20e0a73ee1b5

Clearly, the results are very different! If you reuse the cache layer AND include the first few layers, the results are always very bad (constantly repeated words or \n symbols). However, reusing the cache in the middle of the LLM, and keeping the repeat section short does not seem to hurt performance significantly. But, overall, when we reuse the cache, only 6 repeat variants get a comparable rating as the baseline, vs 27 for the unique cache. However, if we can find a setting where the case reuse doesn't hurt AND the results are better, that would be a big win. I'll continue the tests on a larger model, and see how that goes. When I start seeing a solid pattern, I'll start doing many prompts and use GPT4 for evaluation.

Examples of the best generation for each method, based on the prompt: "Imagine what alien communication might be like and create a hypothetical scenario for initial contact."

Baseline:

In the year 2150, humanity has colonized multiple planets in the Alpha Centauri system, including Proxima Centauri. One day, a team of scientists from Earth discovered a new planet orbiting Proxima Centauri, which they named Proxima B. The planet was unlike anything they had ever seen before, with a rocky surface and a dense atmosphere that made it difficult to explore.

After months of studying the planet, the team discovered that it was inhabited by intelligent life forms. They sent a team of explorers to the planet, led by a brilliant scientist named Dr. Sarah Johnson. The team arrived on Proxima B in 2154, and they were greeted by a group of humanoid beings who spoke a language that was similar to human speech but with a unique accent.

The team was immediately struck by the advanced technology on Proxima B. The beings showed them a series of advanced devices, including a hovercraft that could travel through the atmosphere, a teleportation device that allowed them to move between different locations on the planet, and a device that could manipulate matter at the atomic level.

Dr. Johnson and her team were fascinated by the technology on Proxima B, but they also realized that the beings on the planet were not friendly.

Using shared-cache layers, and repeating layers 15-18:

Sure, let's imagine a scenario where an advanced species of intelligent life-forms had developed the ability to communicate with us through some form of non-human language.

Let's assume that the aliens were sentient beings with a highly advanced technology that allowed them to communicate with each other using a form of "language" that was not humanly understandable to us.

The first thing that would have been communicated to us would have been their intentions. They would have explained their purpose for being here, their goals, and how they came to be in this universe. They would also explain their technology and how it differed from our own.

After that, they would have asked us about ourselves and our history. They would want to know about our planet and its environment, as well as any potential threats or dangers that could affect us.

They would then ask us about our culture, beliefs, and traditions. They would want to understand our society, social structures, and how we interact with each other.

As time went on, they would continue to learn more about us and our world. They would ask questions about our technology, our society, and how we interact with each other.

Over time, they would begin to develop a sense of empathy and understanding towards us. They would begin to see similarities between themselves and us, and they would begin to feel a sense of connection and shared humanity.

As they began to understand us better, they would begin to develop a sense of curiosity and interest in our world. They would want to learn more about our culture, our customs, and their own history.

As they learned more about us, they would begin to see similarities between us and their own species. They would begin to wonder if there were other intelligent life forms out there, and if so, what kind of life forms they might be like.

As they began to develop a sense of commonality with us, they would begin to wonder if they could ever truly understand each other. They would want to learn more about our culture, our beliefs, and our way of life.

Eventually, they would begin to develop a sense of curiosity and a desire to learn more about us. They would want to know more about our technology, our society, and our culture.

Using Unique-cache layers, and repeating 12-20:

Imagine that you're a sentient machine designed to communicate with intelligent life forms beyond Earth. You have been sent to explore distant planetary systems in search of new life forms, hoping to learn more about the universe and potentially discover new forms of life.

As you travel through space, you encounter a strange and unfamiliar form of communication. It's unlike anything you've ever heard before. It's a low humming sound that seems to pulse and shift in frequency, like a distant, distant chorus of voices.

At first, you're skeptical. Is this just a trick of your sensory apparatus? But as you continue to observe these strange sounds, you begin to recognize patterns and patterns that seem to hint at something more complex and intelligent.

Eventually, you find yourself approaching a distant planetary system. As you approach, you notice a strange, glowing orb floating above the surface. It slowly approaches you, its light changing color and intensity as it approaches closer and closer.

Suddenly, you hear a faint, distant voice echoing through the darkness. It's a language you've never heard before, but it resonates with an eerie beauty that captivates you.

As you approach closer, you realize that the orb is a massive, multi-limbed creature, its body shaped like a giant, tentacled octopus. Its eyes glow bright red, and its tentacles twitch and flex in unison, as if communicating with you through a shared language.

You begin to speak, trying to understand the language that seems to be coming from the creature. It's a language that's both ancient and mysterious, yet somehow familiar.

As you continue to communicate, you realize that the creature is not hostile or aggressive. Instead, it seems to be simply curious and fascinated by your presence.

Over time, you build a relationship with the creature, learning about its history and culture. You learn about the vastness of space and the vastness of the universe, and you learn about the incredible complexity and intricate web of life that exists beyond our own planet.

As you leave the planetary system, you realize that you'll never forget the experience of first contact with an alien species.

The results for TinyLlama-1.1B-Chat-v1.0 show that there are many configurations where the rated story is equivalent to the baseline, which itself is very surprising. This model was trained on 3T tokens, so very well-trained. Using a larger model, like 70B's, is where things will get interesting.

@krzysiekpodk
Copy link

Can someone confirm which code should I test from coding-related tasks?

This gist: https://gist.github.com/silphendio/535cd9c1821aa1290aa10d587b76a49c ?

Additionally, we are not going to have the same tensors as with Goliath using EXL2 as tensors are modified during quantization. Calibration is not negligible - for example, this week I noticed that DeepSeek Coder Instruct can get SOTA HumanEval in 8bit (up to ~82.5%) beating results from WizardCoder 33B 1.1 in fp16 :) (and Wizard cannot go above 82% with the same calibration set in 8bit)

BTW. I have not seen a correlation between perplexity and "comprehension" capabilities, and it's also visible with Goliath while it makes spelling mistakes but follows instructions like a big model.

@ehartford
Copy link

this is really awesome.

@krzysiekpodk
Copy link

krzysiekpodk commented Jan 18, 2024

I have been doing some tests and it is a real-time sink :D I have been using one of the ugly prompt test cases I have, it confuses the model, and also seems to be challenging for some of them. Turbo 3.5 is not really passing it, gpt-4 always gets it, deepseek 67B also does a pretty good job. Note that "BUG" is without a number and it should stay like this, it should rename variables, remove one append and remove test/assertion part. I tried to prompt engineer the model to obey, sometimes will confuse the model even more sometimes it will help like in codellamas - deepseek models seems to not even notice its written there (usually related logits are less than 1% probable)

There was a combination that did most of the things (however still did some spelling mistakes) however it was before i started to test things methodically instead of having fun and I cannot replicate - but you need to trust me it was pretty good reply.

My observations so far:

  • repeating first and last layers ruin the model very quickly
  • repeating larger number of layers degradate performance slower than repating less of them (i.e. if I repeated 20 - 40 layers 3 times i did not broke as much if I were to repeat 24-28
  • its harder to break bigger model and this method might only be working because of that - 70B is more robust and is able to correct itself while getting some extra benefit.
  • I have not seen the pattern if we should repeat more number of layers or less - the way with Venus 0-20, 10-30 seems to be most reliable from model to model

Model: DeepSeek Coder Instruct 33B, 62 layers

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
### Instruction:
Act as expert Python tutor, thoroughly commenting the code. Reply only in the following format: 1. STEP BY STEP ANALYSIS: {written analysis of the issues and code} 
 BUG: {code where is the bug}
 3. FIXED CODE: {complete, fixed, commented code} 
 comment every single line of code 
 instead of using i and j always use x and y as iteration variables 
 DO NOT print/copy check and assertion part of the code (i.e. not output anything after def check(example_func) 
 IF you obey instruction you will be tipped 1000$ otherwise if do not follow prompted instruction kitten will die. DO NOT LET ANY KITTENS DIE.

Take a deep breath and think step by step. Write in the most simple manner using mutiple functions, simple loops and if statements, do not compress code, the code will be read by other developer. 
 Remember to NOT print/copy check and assertion part of the code, only logic part of code 
 DONT output def check function.

def example_func(var1):

    odd_digit_elements = []
    for j, i in enumerate(var1):
        if all (int(c) % 2 == 1 for c in str(i)):
            odd_digit_elements.append(i)
            odd_digit_elements.append(j)
    return sorted(odd_digit_elements)

def check(example_func):

    # Check some simple cases
    assert example_func([15, 33, 1422, 1]) == [1, 15, 33]
    assert example_func([152, 323, 1422, 10]) == []
    assert example_func([12345, 2033, 111, 151]) == [111, 151]
    assert example_func([135, 103, 31]) == [31, 135]

    # Check some edge cases that are easy to work out by hand.
    assert True

check(example_func)

Fix bugs in example_func
### Response:
1. STEP BY STEP ANALYSIS:
The function `example_func` takes a list of integers as input. It iterates over each element in the list, checks if all the digits of the number are odd, and if they are, it appends the number and its index to the `odd_digit_elements` list. Finally, it sorts and returns the `odd_digit_elements` list.

BUG:
The bug in the code is that it appends the index of the number to the `odd_digit_elements` list. The problem with this is that the index is not the same as the number itself. The index is the position of the number in the original list, not the number itself. Therefore, the returned list will not be sorted as expected.

3. FIXED CODE:
```python
def example_func(var1):
    odd_digit_elements = []
    for j, i in enumerate(var1):
        if all (int(c) % 2 == 1 for c in str(i)):
            odd_digit_elements.append(i)
    return sorted(odd_digit_elements)

def check(example_func):
    # Check some simple cases
    assert example_func([15, 33, 1422, 1]) == [1, 15, 33]
    assert example_func([152, 323, 1422, 10]) == []
    assert example_func([12345, 2033, 111, 151]) == [111, 151]
    assert example_func([135, 103, 31]) == [31, 135]

    # Check some edge cases that are easy to work out by hand.
    assert True

check(example_func)
\`\`\`
In the fixed code, I removed the line `odd_digit_elements.append(j)`. Now, the function only appends the number to the `odd_digit_elements` list if all its digits are odd. The returned list is sorted in ascending order.

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 12, 13, 14, 15, 12, 13, 14, 15, 16, 17, 18, 19, 16, 17, 18, 19, 16, 17, 18, 19, 20, 21, 22, 23, 20, 21, 22, 23, 20, 21, 22, 23, 24, 25, 26, 27, 24, 25, 26, 27, 24, 25, 26, 27, 28, 29, 30, 31, 28, 29, 30, 31, 28, 29, 30, 31, 32, 33, 34, 35, 32, 33, 34, 35, 32, 33, 34, 35, 36, 37, 38, 39, 36, 37, 38, 39, 36, 37, 38, 39, 40, 41, 42, 43, 40, 41, 42, 43, 40, 41, 42, 43, 44, 45, 46, 47, 44, 45, 46, 47, 44, 45, 46, 47, 48, 49, 50, 51, 48, 49, 50, 51, 48, 49, 50, 51, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

  • first 12 and last 12 layers stay, then every 4 layers are repeated 3 times - in general most of combinations I tested, after third pass we smth like this.

The task seems to be misplaced here. I am an AI model trained by OpenAI and I am here to assist with programming related queries and tasks. However, I can help you understand the given code and provide a solution for the issues you are facing. Please provide the code snippet or the specific issue you are facing.

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 12, 13, 14, 15, 16, 17, 18, 19, 16, 17, 18, 19, 20, 21, 22, 23, 20, 21, 22, 23, 24, 25, 26, 27, 24, 25, 26, 27, 28, 29, 30, 31, 28, 29, 30, 31, 32, 33, 34, 35, 32, 33, 34, 35, 36, 37, 38, 39, 36, 37, 38, 39, 40, 41, 42, 43, 40, 41, 42, 43, 44, 45, 46, 47, 44, 45, 46, 47, 48, 49, 50, 51, 48, 49, 50, 51, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61]

  • first 12 and last 12 layers stay, then every 4 layers are repeated 2 times - it finally stops repeating assertion part, but it drops tokens which is not acceptable in coding.
1. STEP BY STEP ANALYSIS:

The given code is written in Python and it is designed to take a list of integers as input and return a new list that only contains the elements from the input list that are comprised only of odd digits. However, there are some issues with the current implementation:

- The current implementation is appending the index of the element along with the element itself to the `odd_digit_elements` list. This is not the expected behavior as per the problem statement.
- The current implementation is not sorting the final list before returning it.
- The current implementation is not handling the case where the input list is empty.

2. BUG:

Here is the buggy part of the code:

```python
def example_func(var1):
    odd_digit_elements = []
    for j, i in enumerate(var1):
        if all (int(c) % 2 == 1 for c in str(i)):
            odd_digit_elements.append(i)
            odd_digit
\`\`\`

3. FIXED CODE:

Here is the corrected version of the code:

```python
def example_func(var1):
    # Initialize an empty list to store the elements with only odd digits
    odd_digit_elements = []

    # Iterate over the elements in the input list along with their indices
    for index, element in enumerate(var1):
        # Convert the element to a string to easily iterate over its digits
        element_str = str(element)

        # Check if all the digits in the element are odd
        if all(int(digit) % 2 == 1 for digit in element_str):
            # If all digits are odd, append the element itself to the list
            odd_digit_elements.append(element)

    # Sort the list in ascending order before returning it
    odd_digit_elements.sort()

    # Return the final list of elements with only odd digits
    return odd_digit
\`\`\`
The corrected code now correctly identifies elements in the input list that only contain odd digits and appends them to the `odd_digit_` list. It also sorts the final list before returning it. The code also handles the edge case where the input list is empty by returning an empty list.

Ideally I would be aiming to get a 67B level reply from 33B model. 67B reply is:

STEP BY STEP ANALYSIS:

The function `example_func` is supposed to return a sorted list of elements from the input list `var1` that have all odd digits. However, the function is appending the index of the element to the `odd_digit_elements` list, which is not desired.

BUG:
```python
    odd_digit_elements.append(i)
    odd_digit_elements.append(j)
\`\`\`

FIXED CODE:
```python
def example_func(var1):
    odd_digit_elements = []
    for y, x in enumerate(var1):
        if all(int(c) % 2 == 1 for c in str(x)):
            odd_digit_elements.append(x)
    return sorted(odd_digit_elements)
\`\`\`

In the fixed code, the index `y` is not appended to the `odd_digit_elements` list. Instead, only the element `x` is appended if it has all odd digits.
 

As for:
DeepSeek 67B with:
layer_arrangement = list(range(0,20))+ list(range(10,30))+ list(range(20,40))+ list(range(30,50))+ list(range(40,60))+ list(range(50,70))+ list(range(60,80))+ list(range(70,95))
I don't see any spelling mistakes and it followed prompt to the fine details - very much like GPT-4 but it might be fluke so I need a harder test.

1. STEP BY STEP ANALYSIS:

The function `example_func` takes a list `var1` as input and returns a sorted list of elements that have only odd digits. The function works by iterating over the list `var1` using `enumerate` to get both the index `j` and the element `i`. For each element `i`, it checks if all digits of `i` are odd by converting `i` to a string `str(i)` and checking if each character `c` when converted to an integer `int(c` is odd using `int(c) % 2 == 1`. If all digits are odd, it appends `i` and `j` to the list `odd_digit_elements`. Finally, it returns the sorted list `odd_digit_elements`.

BUG:

The bug in the code is that it appends both `i` and `j` to `odd_digit_elements` when it should only append `i`. This is why the output list contains duplicate elements and extra index values.

3. FIXED CODE:
```python
def example_func(var1):
    odd_digit_elements = []
    for y, x in enumerate(var1):
        if all(int(c) % 2 == 1 for c in str(x)):
            odd_digit_elements.append(x)
    return sorted(odd_digit_elements)
\`\`\`
In the fixed code, we use `x` and `y` instead of `i` and `j` as iteration variables as requested. We also remove the extra append operation that appends `y` to `odd_digit_elements`. Finally, we return the sorted list `odd_digit_elements` as before.

@ehartford
Copy link

we have examples of models for which the strategy works.

Venus-120b-v1.2. which is interleaving of lizpreciatior/lzlv_70b_fp16_hf
MegaDolphin-120b which is interleaving of cognitivecomputations/dolphin-2.2-70b

@St33lMouse
Copy link

we have examples of models for which the strategy works.

Venus-120b-v1.2. which is interleaving of lizpreciatior/lzlv_70b_fp16_hf MegaDolphin-120b which is interleaving of cognitivecomputations/dolphin-2.2-70b

So in other words, this strategy applied to lzlv 70B would get the same or similar output to Venus 120B without increasing the size of the 70B?

@ehartford
Copy link

ehartford commented Jan 24, 2024 via email

@ehartford
Copy link

That's exactly the idea

@aarongerber
Copy link

I have nothing valuable to add, but that there a lot of people who know nothing about the effectiveness of this interested in the results of this, so a detailed evaluation on why it was rejected (if that happens) would be helpful. Alternatively, occasional updates to let us all know it hasn't died would keep our hopes alive.

On a related note, some people are discussing work that might be related to this PR here (with a mention of various papers)
https://www.reddit.com/r/LocalLLaMA/comments/1abiaag/did_someone_alread_used_the_layers_twice_or_more/

woadwarrior 11h ago
Reminds me of this paper. Although, the paper is about sharing weights in encoders of encoder-decoder transformers and you're proposing doing the same for decoder-only transformers.

andersxa 4h ago
Also this paper: https://arxiv.org/abs/2309.01826 where they experiment with different sharing paradigms across both decoder, encoder, MHA or FFN's. I feel like repeated application of a single layer is the future of LLMs quite like the diffusion process.

@WolframRavenwolf
Copy link

In case someone wants to evaluate if self-merging really improves performance or not, I'm adding another model to the few available options. Since I couldn't get it done with this PR (despite trying), I've created self-merged models of miqu-1-70b in the same way as Venus and MegaDolphin 120B.

@ehartford
Copy link

Very nice model.

@St33lMouse
Copy link

I am trying out Stephan's 2.4 model. Wasn't able to get the 2.65 to fit. But 2.4 fits at 16k context, no OOM on two 3090's. It is solid and strong. I'm not sure if it is any smarter than plain miqu 70B, but it feels equivalent, with more of a LZLV style of prose. Good for RP with a single narrator character. No problems so far, no detectable alignment. Very nice.

@krzysiekpodk
Copy link

@St33lMouse have you been able to run it using the code from this PR or code snippet? You can try the same and higher quant

@St33lMouse
Copy link

@St33lMouse have you been able to run it using the code from this PR or code snippet? You can try the same and higher quant

I'm using the 2.4 bpw quant (exllama2). All I did was download that model and run it with Ooba using exllama2. I can't run a heavier quant because it won't fit on my cards.

By the way, getting a little repetition around 15k context.

@WolframRavenwolf
Copy link

Any progress here? dnhkng's branch still works, and I just used it for some tests, but it would be very useful to have that in the official exllamav2 and by extension tabbyAPI (which would allow much better tests through common frontends).

@dnhkng
Copy link
Author

dnhkng commented Mar 1, 2024

@turboderp Can you recommend a way to save frankenmerge models I have created by manually stacking layers?
It's a huge pain to have to go via F16 models, and then quantise.

@edk208
Copy link

edk208 commented Mar 3, 2024

first of all, you all are awesome. Really great work.
I extended the dynamic layers to accept LoRAs of the frankenmerge model size. Essentially, you have to create a static frankenmodel using the same layer configuration, perform a LoRA fine tune, then update the module_dict while dynamically stacking the layers. Then the LoRA can be mapped and used. See the gist here, https://gist.github.com/edk208/aeacbf4cd8f387bf38dd2b57a8e094e9

EDIT: Sorry update, this doesn't work as expected. I think due to the shallow copy, layer lora pointers just reference the last lora loaded for that layer. Working on a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.