Release 0.35.0 · matatonic/openedai-vision

Recent updates

Version 0.35.0

Update Molmo (tensorflow-cpu no longer required), and add autocast for faster, smaller types than float32.
New option: --use-double-quant to enable double quantization with --load-in-4bit, a little slower for a little less VRAM.
Molmo 72B will now run in under 48GB of vram using --load-in-4bit --use-double-quant.
Add completion_tokens counts in API and logged tokens/s for most results, other compatibility improvements
Include sample tokens/s data (A100) in vision.sample.env