-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add_exllamav2 #1419
add_exllamav2 #1419
Conversation
def test_generate_quality(self): | ||
# don't need to test | ||
pass | ||
|
||
def test_serialization(self): | ||
# don't need to test | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we quantize the model using the cuda-old kernel and save the model to later load it with exllamav2 for the test_exllama_serialization
test. Since these tests will use the cuda-old kernel, we don't need to test them as we already do so in a previous test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how about generate_quality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also tested in the GPTQTest
class. The wording is confusing but test_exllama_serialization
in GPTQTestExllamav2
does two things: test the loading quantized weights with exllamav2 kernels + test the inference correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, that's great! As commented on transformers PR by somebody else, that's true that the inflation of disable args is not very scalable, that was probably a bad idea on my end to use that in AutoGPTQ.
def test_generate_quality(self): | ||
# don't need to test | ||
pass | ||
|
||
def test_serialization(self): | ||
# don't need to test | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how about generate_quality?
What does this PR do ?
This PR adds the possibility to choose exllamav2 kernels for GPTQ model This PR follows the integration of the kernels in auto-gptq. I've also added a test to check that we are able to load and infer using exllamav2 kernel. I will update the benchmark in a follow-up PR.