Skip to content
View Ageliss's full-sized avatar

Block or report Ageliss

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)

Python 23 1 Updated Mar 14, 2025

No fortress, purely open ground. OpenManus is Coming.

Python 41,854 7,110 Updated Apr 1, 2025

Redis for LLMs

Python 703 81 Updated Apr 4, 2025

Tile primitives for speedy kernels

Cuda 2,223 133 Updated Apr 4, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,142 541 Updated Apr 3, 2025

FlashMLA: Efficient MLA decoding kernels

C++ 11,403 817 Updated Mar 1, 2025

Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators

C++ 374 169 Updated Apr 4, 2025

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

C++ 1,286 556 Updated Apr 4, 2025
Python 185 24 Updated Oct 1, 2024

nvidia-modelopt is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for do…

Python 840 63 Updated Apr 3, 2025

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Python 445 51 Updated Mar 28, 2025

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Python 4,996 285 Updated Mar 11, 2025

Awesome LLM compression research papers and tools.

1,450 93 Updated Apr 4, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,976 193 Updated Apr 3, 2025

主要记录大语言大模型(LLMs) 算法(应用)工程师相关的知识及面试题

HTML 6,638 751 Updated Oct 22, 2024

该仓库主要记录 大模型(LLMs) 算法工程师相关的面试题

1,894 133 Updated Dec 26, 2024

Let your Claude able to think

TypeScript 14,886 1,732 Updated Mar 10, 2025
C++ 324 30 Updated Jan 20, 2025

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Python 155 9 Updated Oct 30, 2024

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Python 1,166 107 Updated Apr 3, 2025

A throughput-oriented high-performance serving framework for LLMs

Cuda 788 32 Updated Sep 21, 2024

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.

Python 974 146 Updated Apr 3, 2025

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

C++ 221 21 Updated Sep 30, 2024

[ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)

Python 540 61 Updated Feb 29, 2024

Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 9,493 657 Updated Mar 27, 2025

Code repo for the paper "SpinQuant LLM quantization with learned rotations"

Python 249 36 Updated Feb 14, 2025

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality

Python 3,784 290 Updated Aug 10, 2024

SGLang is a fast serving framework for large language models and vision language models.

Python 12,858 1,432 Updated Apr 4, 2025

📰 Must-read papers and blogs on Speculative Decoding ⚡️

671 33 Updated Mar 27, 2025

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 268 28 Updated Nov 22, 2024
Next
Showing results