jetstream-v0.2.4
Pre-release
Pre-release
Highlight
New commandline interface jpt
becomes the main interface.
What's Changed
- Add test for Mixtral model. by @wang2yn84 in #131
- fix mixtral quantization scaler axis when dimension > 2 by @sixiang-google in #132
- Add layer id in scope for each TransformerBlock layer by @FanhaiLu1 in #136
- Update README.md to state the limitation of accessing GCS when conver… by @wang2yn84 in #139
- Add left aligned cache support. by @wang2yn84 in #133
- add enable jax profiler to run_server by @bvrockwell in #140
- Update benchmark command in README.md by @bhavya01 in #141
- Add server tests by @bvrockwell in #142
- Set JAX_PLATFORMS to "tpu, cpu" for ray worker by @richardsliu in #145
- Fix exception in ray_worker by @richardsliu in #144
- Make prefilling return first token for loadgen integration by @sixiang-google in #143
- Jetstream + RayServe deployment for interleave mode by @richardsliu in #146
- Make Ray engine and worker process prefill returning first token by @richardsliu in #147
- prototyping better UX by @qihqi in #134
- Add mlperf benchmark scripts in-tree. by @qihqi in #148
- Set accumulate type to bf16 in activation quant by @lsy323 in #152
- Return np instead of jax array for prefill result tokens by @FanhaiLu1 in #158
- Correct typo enbedding -> embedding by @tengomucho in #157
- V5e8 ray by @FanhaiLu1 in #159
- Add newest llama-3 benchmarks by @qihqi in #160
- Update Ray version in Dockerfile and add v5 configs by @richardsliu in #161
- Handle v5e-8 in run_ray_serve_interleave by @richardsliu in #162
- Fix Ray engine crash on multihost by @richardsliu in #164
- Fix TPU head resource name for v4 and v5e by @richardsliu in #165
- Fixed exhausted bug between head and workers by @FanhaiLu1 in #163
- Optimize cache update. by @wang2yn84 in #151
- Add page attention manager and kvcache manager by @FanhaiLu1 in #167
- Add a script to measure speed of basic ops by @qihqi in #168
- Replace repeat kv with proper GQA handling. by @wang2yn84 in #171
- fix ray engine crashes on multihost by @sixiang-google in #170
- Fix the performance regression with ragged attention on for llama2 7b. by @wang2yn84 in #172
- Add mixtral support to new CLI by @qihqi in #174
- Use kwargs to simplify the call sites a bit by @yixinshi in #175
- Add gemma support in better cli by @qihqi in #176
- Update Jetstream, add optional sampler args. by @qihqi in #177
- Update README for new CLI by @qihqi in #178
- Support End To End PagedAttention in JetStream by @FanhaiLu1 in #180
- Add offline perf ci by @qihqi in #181
- Switch to NP from Jax to improve attention manager performance by @FanhaiLu1 in #184
- Fix too many positional arguments lint error by @FanhaiLu1 in #186
- Add model warmup and jax compilation cache flags by @vivianrwu in #187
- Fix ray recompilation and accuracy by @sixiang-google in #189
- Make jpt the default cli - remove other entry point scripts by @qihqi in #188
- Delete convert_checkpoints and helper classmethods. by @qihqi in #190
- add local tokenizer option for automated testing without hf token by @sixiang-google in #192
- feat: add quantize exclude layer flag by @tengomucho in #194
- Fix: correct quantization name filtering by @tengomucho in #196
New Contributors
- @sixiang-google made their first contribution in #132
- @richardsliu made their first contribution in #145
- @tengomucho made their first contribution in #157
- @yixinshi made their first contribution in #175
- @vivianrwu made their first contribution in #187
Full Changelog: jetstream-v0.2.3...jetstream-v0.2.4