Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llamacpp with cpp backend #2527

Closed

Conversation

shrinath-suresh
Copy link
Contributor

@shrinath-suresh shrinath-suresh commented Aug 16, 2023

Description

Benchmarking LLM deployment with CPP Backend

Setup and Test

  1. Follow the instructions from README.md to set up the environment

  2. Download the TheBloke/Llama-2-7B-Chat-GGML model.

cd serve/cpp/./test/resources/torchscript_model/llm/llm_handler
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin

and update the path of the model in the script here.

To control the number of parameters to be generated, update the max_context_size variable in the script to the desired number.

Note: In the next version, this step will be changed to read the llm path from config.

  1. Run the build
cd serve/cpp
./builld.sh

Once the build is successful libllm_handler.so shared object file would be generated in serve/cpp/./test/resources/torchscript_model/llm/llm_handler folder.

  1. Copy the dummy.pt file to the llm_handler folder.
  2. Move to llm_handler folder and run the following command to generate mar file
torch-model-archiver --model-name llm --version 1.0 --serialized-file dummy.pt --handler libllm_handler:LlmHandler --runtime LSP
  1. Move the llm.mar to model_store
mkdir model_store
mv llm.mar model_store/llm.mar
  1. Create a new config.properties file and past the content.
default_response_timeout=300000

The default timeout is 120000. When the context size is 512, LLM generation takes more time to complete the request in the single gpu machine.

  1. Start the torchserve
torchserve --start --ncs --ts-config config.properties --model-store model_store/
  1. Register the model using curl command
curl -v -X POST "http://localhost:8081/models?initial_workers=1&url=llm.mar"
  1. Update the input in prompt.txt if needed and run
curl http://localhost:8080/predictions/llm -T prompt.txt

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
    Logs for Test A

  • Test B
    Logs for Test B

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@chauhang chauhang added the c++ label Aug 24, 2023
Signed-off-by: Shrinath Suresh <[email protected]>
@shrinath-suresh shrinath-suresh changed the title [WIP] LLM with cpp backend Llamacpp with cpp backend Sep 13, 2023
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Signed-off-by: Shrinath Suresh <[email protected]>
Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Left a couple of comments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to create a CMakeLists.txt in the llamacpp directory and use add_subdirectory() in the main file to avoid the main one to getting too crowded.

@@ -5,3 +5,26 @@ list(APPEND MNIST_SOURCE_FILES ${MNIST_SRC_DIR}/mnist_handler.cc)
add_library(mnist_handler SHARED ${MNIST_SOURCE_FILES})
target_include_directories(mnist_handler PUBLIC ${MNIST_SRC_DIR})
target_link_libraries(mnist_handler PRIVATE ts_backends_torch_scripted ts_utils ${TORCH_LIBRARIES})

set(LLM_SRC_DIR "${torchserve_cpp_SOURCE_DIR}/src/examples/llamacpp")
set(LLAMACPP_SRC_DIR "/home/ubuntu/llama.cpp")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to avoid absolute paths. Is the file included in the PR? What is the license of llama.cpp? Do we need to include the license file?

target_link_libraries(llamacpp_handler PRIVATE ts_backends_torch_scripted ts_utils ${TORCH_LIBRARIES})


set(MY_OBJECT_FILES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the src files to these obj files?

@@ -0,0 +1,5 @@
{
"checkpoint_path" : "/home/ubuntu/llama-2-7b-chat.Q4_0.gguf"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto also: How big is this file?

namespace llm {

void LlamacppHandler::initialize_context() {
llama_ctx = llama_new_context_with_model(llamamodel, ctx_params);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this defined?

std::shared_ptr<torch::Device>& device,
std::pair<std::string&, std::map<uint8_t, std::string>&>& idx_to_req_id,
std::shared_ptr<torchserve::InferenceResponseBatch>& response_batch) {
auto tokens_list_tensor = inputs[0].toTensor();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you implement processing of the whole batch? Serialized in a for loop would be fine for now. Batched processing would be even better if possible with llama.cpp


std::vector<llama_token> tokens_list;

for (auto id : long_vector) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we jump through so many loop here? Cant we directly write the tokens_list from the tensor? Or can you create an array using the data_ptr as underlying storage without making a copy?

}
const int n_gen = std::min(32, max_context_size);

while (llama_get_kv_cache_token_count(llama_ctx) < n_gen) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I read this correctly that the maximum number of tokens (including context) will be 32?

llama_token new_token_id = 0;

auto logits = llama_get_logits(llama_ctx);
auto n_vocab = llama_n_vocab(llama_ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to avoid auto for primitive datatypes when readability is not suffering.


torch::Tensor stacked_tensor = torch::stack(tensor_vector);
llama_print_timings(llama_ctx);
llama_free(llama_ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the model be reused? Then this should be moved into the destructor.

@shrinath-suresh
Copy link
Contributor Author

@mreso Thanks for your review comments. I have already addressed few of your comments - implementing destructor, batch processing, remove auto based on your previous comments in babyllama PR. Will address the remaining ones and let you know.

@mreso mreso mentioned this pull request Jan 25, 2024
7 tasks
@lxning
Copy link
Collaborator

lxning commented Mar 11, 2024

This feature was picked up in v0.10.0 task.

@lxning lxning closed this Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants