Releases: AI-Hypercomputer/jetstream-pytorch
Releases · AI-Hypercomputer/jetstream-pytorch
jetstream-v0.2.4
Highlight
New commandline interface jpt
becomes the main interface.
What's Changed
- Add test for Mixtral model. by @wang2yn84 in #131
- fix mixtral quantization scaler axis when dimension > 2 by @sixiang-google in #132
- Add layer id in scope for each TransformerBlock layer by @FanhaiLu1 in #136
- Update README.md to state the limitation of accessing GCS when conver… by @wang2yn84 in #139
- Add left aligned cache support. by @wang2yn84 in #133
- add enable jax profiler to run_server by @bvrockwell in #140
- Update benchmark command in README.md by @bhavya01 in #141
- Add server tests by @bvrockwell in #142
- Set JAX_PLATFORMS to "tpu, cpu" for ray worker by @richardsliu in #145
- Fix exception in ray_worker by @richardsliu in #144
- Make prefilling return first token for loadgen integration by @sixiang-google in #143
- Jetstream + RayServe deployment for interleave mode by @richardsliu in #146
- Make Ray engine and worker process prefill returning first token by @richardsliu in #147
- prototyping better UX by @qihqi in #134
- Add mlperf benchmark scripts in-tree. by @qihqi in #148
- Set accumulate type to bf16 in activation quant by @lsy323 in #152
- Return np instead of jax array for prefill result tokens by @FanhaiLu1 in #158
- Correct typo enbedding -> embedding by @tengomucho in #157
- V5e8 ray by @FanhaiLu1 in #159
- Add newest llama-3 benchmarks by @qihqi in #160
- Update Ray version in Dockerfile and add v5 configs by @richardsliu in #161
- Handle v5e-8 in run_ray_serve_interleave by @richardsliu in #162
- Fix Ray engine crash on multihost by @richardsliu in #164
- Fix TPU head resource name for v4 and v5e by @richardsliu in #165
- Fixed exhausted bug between head and workers by @FanhaiLu1 in #163
- Optimize cache update. by @wang2yn84 in #151
- Add page attention manager and kvcache manager by @FanhaiLu1 in #167
- Add a script to measure speed of basic ops by @qihqi in #168
- Replace repeat kv with proper GQA handling. by @wang2yn84 in #171
- fix ray engine crashes on multihost by @sixiang-google in #170
- Fix the performance regression with ragged attention on for llama2 7b. by @wang2yn84 in #172
- Add mixtral support to new CLI by @qihqi in #174
- Use kwargs to simplify the call sites a bit by @yixinshi in #175
- Add gemma support in better cli by @qihqi in #176
- Update Jetstream, add optional sampler args. by @qihqi in #177
- Update README for new CLI by @qihqi in #178
- Support End To End PagedAttention in JetStream by @FanhaiLu1 in #180
- Add offline perf ci by @qihqi in #181
- Switch to NP from Jax to improve attention manager performance by @FanhaiLu1 in #184
- Fix too many positional arguments lint error by @FanhaiLu1 in #186
- Add model warmup and jax compilation cache flags by @vivianrwu in #187
- Fix ray recompilation and accuracy by @sixiang-google in #189
- Make jpt the default cli - remove other entry point scripts by @qihqi in #188
- Delete convert_checkpoints and helper classmethods. by @qihqi in #190
- add local tokenizer option for automated testing without hf token by @sixiang-google in #192
- feat: add quantize exclude layer flag by @tengomucho in #194
- Fix: correct quantization name filtering by @tengomucho in #196
New Contributors
- @sixiang-google made their first contribution in #132
- @richardsliu made their first contribution in #145
- @tengomucho made their first contribution in #157
- @yixinshi made their first contribution in #175
- @vivianrwu made their first contribution in #187
Full Changelog: jetstream-v0.2.3...jetstream-v0.2.4
jetstream-v0.2.3
What's Changed
- Enable jax profiler server in run with ray by @FanhaiLu1 in #112
- Add for readme interleave multiple host with ray by @FanhaiLu1 in #114
- Fix conversion bug by @yeandy in #116
- Integrate disaggregated serving with JetStream by @FanhaiLu1 in #117
- Support HF LLaMA ckpt conversion by @lsy323 in #118
- Add guide on adding HF ckpt conversion support by @lsy323 in #119
- Add support for Llama3-70b by @bhavya01 in #101
- Fix convert_checkpoint.py for hf and gemma by @qihqi in #121
- Mixtral enablement. by @wang2yn84 in #120
- add script to isntall for GPU by @qihqi in #122
- Add activation quantization support to per-channel quantized linear layers by @lsy323 in #105
- Remove JSON config mangling for Gemma ckpt by @lsy323 in #124
- Add different token sampling algorithms to decoder. by @bvrockwell in #123
- Add lock in prefill and generate to prevent starvation by @FanhaiLu1 in #126
- Update submodules, prepare for leasing v0.2.4 by @qihqi in #127
- Update README.md by @qihqi in #128
- Update summary.md by @qihqi in #125
- Update README.md by @bhavya01 in #129
- make sure GPU works by @qihqi in #130
New Contributors
- @yeandy made their first contribution in #116
- @bvrockwell made their first contribution in #123
Full Changelog: jetstream-v0.2.2...jetstream-v0.2.3
jetstream-v0.2.2
jetstream-pytorch 0.2.2
- Miscellaneous bug fixes.
- Support of Tiktoken tokenizer
- Support for Gemma - 2b model (running data parallel)
jetstream-v0.2.1
Key Changes
- Support Llama3
- Support Gemma
- Ray Multiple Host Single Pod Slice MVP
- Enable unit test and format check
jetstream-v0.2.0
Release JetStream Pytorch with JetStream v0.2.0 for inference