-
Worked at Kuaishou, Baidu, Meituan
- Beijing
- https://ageliss.github.io/gqjiang/
Starred repositories
HArmonizedSS / HASS
Forked from SafeAILab/EAGLEOfficial Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)
No fortress, purely open ground. OpenManus is Coming.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
nvidia-modelopt is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for do…
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Awesome LLM compression research papers and tools.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Let your Claude able to think
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
A throughput-oriented high-performance serving framework for LLMs
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
[ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality
SGLang is a fast serving framework for large language models and vision language models.
📰 Must-read papers and blogs on Speculative Decoding ⚡️
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference