0.35.0
Recent updates
Version 0.35.0
- Update Molmo (tensorflow-cpu no longer required), and add autocast for faster, smaller types than float32.
- New option:
--use-double-quant
to enable double quantization with--load-in-4bit
, a little slower for a little less VRAM. - Molmo 72B will now run in under 48GB of vram using
--load-in-4bit --use-double-quant
. - Add
completion_tokens
counts in API and logged tokens/s for most results, other compatibility improvements - Include sample tokens/s data (A100) in
vision.sample.env