- Sunnyvale, CA
-
06:38
- 8h behind - https://www.linkedin.com/in/junliume/
- @junliume
Stars
A high-throughput and memory-efficient inference and serving engine for LLMs
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators
FlashInfer: Kernel Library for LLM Serving
HIP version of dietGPU for the ROCm platform, featuring a GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compress…
Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Dockerfiles for the various software layers defined in the ROCm software platform
Microsoft Quantum Development Kit Samples
Development repository for the Triton language and compiler
Clang build analysis tool using -ftime-trace
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
atamazov / MIOpen
Forked from ROCm/MIOpenAMD's Machine Intelligence Library
Helps with dual booting. Ubuntu tray application to reboot into different OSes or UEFI/BIOS
Legacy ROCm Software Platform Documentation