Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains reference implementations for Llama 3 ChatQA at both the 8B and 70B size.
These models use the basic interface which assumes you're passing full documents or have pre-processed any retrieval steps for maximum compatibility. There is another interface style where you can use a built-in chunking model before applying context, but I'm assuming the straightforward implementation is better for testing and integrating into existing systems.
Querying the LLM uses the following data structure:
Expected response:
Performance notes:
These models are running on A100 GPUs. You can change the hardware to H100 for better performance if desired. This is not an optimized implementation with VLLM or TensorRT-LLM, so higher TPS and throughput is likely possible for a production implementation. But inference speed is quite usable as-is.