Releases · AI-Hypercomputer/jetstream-pytorch

Highlight

New commandline interface jpt becomes the main interface.

What's Changed

Add test for Mixtral model. by @wang2yn84 in #131
fix mixtral quantization scaler axis when dimension > 2 by @sixiang-google in #132
Add layer id in scope for each TransformerBlock layer by @FanhaiLu1 in #136
Update README.md to state the limitation of accessing GCS when conver… by @wang2yn84 in #139
Add left aligned cache support. by @wang2yn84 in #133
add enable jax profiler to run_server by @bvrockwell in #140
Update benchmark command in README.md by @bhavya01 in #141
Add server tests by @bvrockwell in #142
Set JAX_PLATFORMS to "tpu, cpu" for ray worker by @richardsliu in #145
Fix exception in ray_worker by @richardsliu in #144
Make prefilling return first token for loadgen integration by @sixiang-google in #143
Jetstream + RayServe deployment for interleave mode by @richardsliu in #146
Make Ray engine and worker process prefill returning first token by @richardsliu in #147
prototyping better UX by @qihqi in #134
Add mlperf benchmark scripts in-tree. by @qihqi in #148
Set accumulate type to bf16 in activation quant by @lsy323 in #152
Return np instead of jax array for prefill result tokens by @FanhaiLu1 in #158
Correct typo enbedding -> embedding by @tengomucho in #157
V5e8 ray by @FanhaiLu1 in #159
Add newest llama-3 benchmarks by @qihqi in #160
Update Ray version in Dockerfile and add v5 configs by @richardsliu in #161
Handle v5e-8 in run_ray_serve_interleave by @richardsliu in #162
Fix Ray engine crash on multihost by @richardsliu in #164
Fix TPU head resource name for v4 and v5e by @richardsliu in #165
Fixed exhausted bug between head and workers by @FanhaiLu1 in #163
Optimize cache update. by @wang2yn84 in #151
Add page attention manager and kvcache manager by @FanhaiLu1 in #167
Add a script to measure speed of basic ops by @qihqi in #168
Replace repeat kv with proper GQA handling. by @wang2yn84 in #171
fix ray engine crashes on multihost by @sixiang-google in #170
Fix the performance regression with ragged attention on for llama2 7b. by @wang2yn84 in #172
Add mixtral support to new CLI by @qihqi in #174
Use kwargs to simplify the call sites a bit by @yixinshi in #175
Add gemma support in better cli by @qihqi in #176
Update Jetstream, add optional sampler args. by @qihqi in #177
Update README for new CLI by @qihqi in #178
Support End To End PagedAttention in JetStream by @FanhaiLu1 in #180
Add offline perf ci by @qihqi in #181
Switch to NP from Jax to improve attention manager performance by @FanhaiLu1 in #184
Fix too many positional arguments lint error by @FanhaiLu1 in #186
Add model warmup and jax compilation cache flags by @vivianrwu in #187
Fix ray recompilation and accuracy by @sixiang-google in #189
Make jpt the default cli - remove other entry point scripts by @qihqi in #188
Delete convert_checkpoints and helper classmethods. by @qihqi in #190
add local tokenizer option for automated testing without hf token by @sixiang-google in #192
feat: add quantize exclude layer flag by @tengomucho in #194
Fix: correct quantization name filtering by @tengomucho in #196