Skip to content

Commit

Permalink
Merge pull request #65 from Bowen12992/code_format
Browse files Browse the repository at this point in the history
[format] Add pre-commit and foramat all the code
  • Loading branch information
Bowen12992 committed Jun 14, 2024
2 parents 57d4612 + 168f5ac commit d3b5121
Show file tree
Hide file tree
Showing 87 changed files with 475 additions and 362 deletions.
17 changes: 17 additions & 0 deletions .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: code-format-check

on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- uses: pre-commit/[email protected]
4 changes: 2 additions & 2 deletions .github/workflows/python-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ on:
jobs:
container-unit-test:
runs-on: [self-hosted, docker]
timeout-minutes: 30
timeout-minutes: 50
container:
image: localhost:5000/flag-gems-ci:v1.0
ports:
Expand All @@ -30,7 +30,7 @@ jobs:
CUDA_VISIBLE_DEVICES=2 pytest -s tests/test_blas_ops.py &
CUDA_VISIBLE_DEVICES=3 pytest -s tests/test_reduction_ops.py &
CUDA_VISIBLE_DEVICES=4 pytest -s tests/test_special_ops.py && wait
container-model-test:
runs-on: [self-hosted, docker]
timeout-minutes: 5
Expand Down
30 changes: 30 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- id: flake8
language_version: python3.11
args: ["--ignore=F405,E731,F403,W503,E722,E203", --max-line-length=120]
# F405 : Name may be undefined, or defined from star imports: module
# E731 : Do not assign a lambda expression, use a def
# F403 : 'from module import *' used; unable to detect undefined names
# W503 : Line break before binary operator
# E722 : Do not use bare 'except'
# E203 : Whitespace before ':'

- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort
language_version: python3.11
args: ["--profile", "black"]

- repo: https://github.com/psf/black.git
rev: 23.7.0
hooks:
- id: black
language_version: python3.11
- id: black-jupyter
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -175,4 +175,4 @@ Copyright © 2024 BAAI. All rights reserved.
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS
END OF TERMS AND CONDITIONS
56 changes: 28 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## Introduction

FlagGems is a high-performance general operator library implemented in [OpenAI Triton](https://github.com/openai/triton). It aims to provide a suite of kernel functions to accelerate LLM training and inference.
FlagGems is a high-performance general operator library implemented in [OpenAI Triton](https://github.com/openai/triton). It aims to provide a suite of kernel functions to accelerate LLM training and inference.

By registering with the ATen backend of PyTorch, FlagGems facilitates a seamless transition, allowing users to switch to the Triton function library without the need to modify their model code. Users can still utilize the ATen backend as usual while experiencing significant performance enhancement. The Triton language offers benefits in readability, user-friendliness and performance comparable to CUDA. This convenience allows developers to engage in the development of FlagGems with minimal learning investment.
By registering with the ATen backend of PyTorch, FlagGems facilitates a seamless transition, allowing users to switch to the Triton function library without the need to modify their model code. Users can still utilize the ATen backend as usual while experiencing significant performance enhancement. The Triton language offers benefits in readability, user-friendliness and performance comparable to CUDA. This convenience allows developers to engage in the development of FlagGems with minimal learning investment.


## Feature
Expand Down Expand Up @@ -49,50 +49,50 @@ def ge(x, y):
## Changelog

### v1.0
- support BLAS operators: addmm, bmm, mm
- support pointwise operators: abs, add, div, dropout, exp, gelu, mul, pow, reciprocal, relu, rsqrt, silu, sub, triu
- support reduction operators: cumsum, layernorm, mean, softmax
- support BLAS operators: addmm, bmm, mm
- support pointwise operators: abs, add, div, dropout, exp, gelu, mul, pow, reciprocal, relu, rsqrt, silu, sub, triu
- support reduction operators: cumsum, layernorm, mean, softmax

### v2.0
- support BLAS operator: mv, outer
- support pointwise operators: bitwise_and, bitwise_not, bitwise_or, cos, clamp, eq, ge, gt, isinf, isnan, le, lt, ne, neg, or, sin, tanh, sigmoid
- support reduction operators: all, any, amax, argmax, max, min, prod, sum, var_mean, vector_norm, cross_entropy_loss, group_norm, log_softmax, rms_norm
- support fused operators: skip_rms_norm, skip_layer_norm, gelu_and_mul, silu_and_mul, apply_rotary_position_embedding
- support BLAS operator: mv, outer
- support pointwise operators: bitwise_and, bitwise_not, bitwise_or, cos, clamp, eq, ge, gt, isinf, isnan, le, lt, ne, neg, or, sin, tanh, sigmoid
- support reduction operators: all, any, amax, argmax, max, min, prod, sum, var_mean, vector_norm, cross_entropy_loss, group_norm, log_softmax, rms_norm
- support fused operators: skip_rms_norm, skip_layer_norm, gelu_and_mul, silu_and_mul, apply_rotary_position_embedding

## Quick Start

### Requirements

1. Triton >= 2.2.0
2. PyTorch >= 2.1.2
3. Transformers >= 4.40.2
1. Triton >= 2.2.0
2. PyTorch >= 2.1.2
3. Transformers >= 4.40.2

### Installation
### Installation

```shell
git clone https://github.com/FlagOpen/FlagGems.git
cd FlagGems
pip install .
```

## Usage
## Usage

### Import

1. Enable permanently
1. Enable permanently
```python
import flag_gems
flag_gems.enable()
```

2. Enable temporarily
2. Enable temporarily
```python
import flag_gems
with flag_gems.use_gems():
pass
```

3. Example
3. Example
```python
import torch
import flag_gems
Expand All @@ -106,50 +106,50 @@ pip install .

### Execute

1. Test Operator Accuracy
- Run reference on cuda
1. Test Operator Accuracy
- Run reference on cuda
```shell
cd tests
pytest test_xx_ops.py
```
- Run reference on cpu
- Run reference on cpu
```shell
cd tests
pytest test_xx_ops.py --device cpu
```

2. Test Model Accuracy
2. Test Model Accuracy
```shell
cd examples
pytest model_xx_test.py
```

3. Test Operator Performance
- Test CUDA performance
3. Test Operator Performance
- Test CUDA performance
```shell
cd benchmark
pytest test_xx_perf.py -s
```
- Test end-to-end performance
- Test end-to-end performance
```shell
cd benchmark
pytest test_xx_perf.py -s --mode cpu
```

4. Run tests with logging infomation
4. Run tests with logging infomation
```shell
pytest program.py --log-cli-level debug
```
Not recommended in performance testing.
Not recommended in performance testing.

## Supported Operators

Operators will be implemented according to [OperatorList.md](https://github.com/FlagOpen/FlagGems/blob/master/OperatorList.md).

## Supported Models

- Bert-base-uncased
- Llama-2-7b
- Bert-base-uncased
- Llama-2-7b

## Supported Platforms

Expand Down
56 changes: 28 additions & 28 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## 介绍

FlagGems是一个使用OpenAI推出的[Triton编程语言](https://github.com/openai/triton)实现的高性能通用算子库,旨在为大语言模型提供一系列可应用于PyTorch框架的算子,加速模型的推理与训练。
FlagGems是一个使用OpenAI推出的[Triton编程语言](https://github.com/openai/triton)实现的高性能通用算子库,旨在为大语言模型提供一系列可应用于PyTorch框架的算子,加速模型的推理与训练。

FlagGems通过对PyTorch的后端aten算子进行覆盖重写,实现算子库的无缝替换,使用户能够在不修改模型代码的情况下平稳地切换到triton算子库。FlagGems不会影响aten后端的正常使用,并且会带来良好的性能提升。Triton语言为算子库提供了更好的可读性和易用性,同时保持了不逊于CUDA的算子性能,因此开发者只需付出较低的学习成本,即可参与FlagGems的算子开发与建设。
FlagGems通过对PyTorch的后端aten算子进行覆盖重写,实现算子库的无缝替换,使用户能够在不修改模型代码的情况下平稳地切换到triton算子库。FlagGems不会影响aten后端的正常使用,并且会带来良好的性能提升。Triton语言为算子库提供了更好的可读性和易用性,同时保持了不逊于CUDA的算子性能,因此开发者只需付出较低的学习成本,即可参与FlagGems的算子开发与建设。


## 特性
Expand Down Expand Up @@ -49,50 +49,50 @@ def ge(x, y):
## 更新日志

### v1.0
- 支持BLAS类算子:addmm, bmm, mm
- 支持pointwise类算子:abs, add, div, dropout, exp, gelu, mul, pow, reciprocal, relu, rsqrt, silu, sub, triu
- 支持reduction类算子:cumsum, layernorm, mean, softmax
- 支持BLAS类算子:addmm, bmm, mm
- 支持pointwise类算子:abs, add, div, dropout, exp, gelu, mul, pow, reciprocal, relu, rsqrt, silu, sub, triu
- 支持reduction类算子:cumsum, layernorm, mean, softmax

### v2.0
- 支持BLAS类算子: mv, outer
- 支持pointwise类算子: bitwise_and, bitwise_not, bitwise_or, cos, clamp, eq, ge, gt, isinf, isnan, le, lt, ne, neg, or, sin, tanh, sigmoid
- 支持reduction类算子: all, any, amax, argmax, max, min, prod, sum, var_mean, vector_norm, cross_entropy_loss, group_norm, log_softmax, rms_norm
- 支持融合算子: skip_rms_norm, skip_layer_norm, gelu_and_mul, silu_and_mul, apply_rotary_position_embedding
- 支持BLAS类算子: mv, outer
- 支持pointwise类算子: bitwise_and, bitwise_not, bitwise_or, cos, clamp, eq, ge, gt, isinf, isnan, le, lt, ne, neg, or, sin, tanh, sigmoid
- 支持reduction类算子: all, any, amax, argmax, max, min, prod, sum, var_mean, vector_norm, cross_entropy_loss, group_norm, log_softmax, rms_norm
- 支持融合算子: skip_rms_norm, skip_layer_norm, gelu_and_mul, silu_and_mul, apply_rotary_position_embedding

## 快速入门

### 依赖

1. Triton >= 2.2.0
2. PyTorch >= 2.1.2
3. Transformers >= 4.40.2
1. Triton >= 2.2.0
2. PyTorch >= 2.1.2
3. Transformers >= 4.40.2

### 安装
### 安装

```shell
git clone https://github.com/FlagOpen/FlagGems.git
cd FlagGems
pip install .
```

## 使用
## 使用

### 导入

1. 在进程中永久启用
1. 在进程中永久启用
```python
import flag_gems
flag_gems.enable()
```

2. 暂时启用
2. 暂时启用
```python
import flag_gems
with flag_gems.use_gems():
pass
```

3. 示例
3. 示例
```python
import torch
import flag_gems
Expand All @@ -106,49 +106,49 @@ pip install .

### 执行

1. 算子正确性测试
-CUDA上运行参考实现
1. 算子正确性测试
-CUDA上运行参考实现
```shell
cd tests/flag_gems
pytest op_accu_test.py
```
-CPU上运行参考实现
-CPU上运行参考实现
```shell
cd tests
pytest test_xx_ops.py --device cpu
```
2. 模型正确性测试
2. 模型正确性测试
```shell
cd examples
pytest model_xx_test.py
```

3. 算子性能测试
- 测试CUDA性能
3. 算子性能测试
- 测试CUDA性能
```shell
cd benchmark
pytest test_xx_perf.py -s
```
- 测试端到端性能
- 测试端到端性能
```shell
cd benchmark
pytest test_xx_perf.py -s --mode cpu
```

2. 运行时打印日志信息
2. 运行时打印日志信息
```shell
pytest program.py --log-cli-level debug
```
测试性能时不建议打开。
测试性能时不建议打开。

## 支持算子

算子将按照文档[OperatorList.md](https://github.com/FlagOpen/FlagGems/blob/master/OperatorList.md)的顺序逐步实现。

## 支持模型

- Bert-base-uncased
- Llama-2-7b
- Bert-base-uncased
- Llama-2-7b

## 支持平台

Expand Down
10 changes: 6 additions & 4 deletions benchmark/performance_utils.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
import time

import torch
import triton
import time

import flag_gems
from .conftest import CPU_MODE

from .conftest import CPU_MODE

WARMUP = 10
REPETITION = 1000
Expand Down Expand Up @@ -42,8 +44,8 @@ def profile(self, op, *args):

def run(self):
print(f"Operator {self.op_name} Performance Test ({self.dtype})")
print(f"Size Torch Latency (ms) Gems Latency (ms)")
print(f"--------------------------------------------------")
print("Size Torch Latency (ms) Gems Latency (ms)")
print("--------------------------------------------------")
for size in self.sizes:
args = self.arg_func(self.dtype, self.batch, size)
torch_perf = self.profile(self.torch_op, *args)
Expand Down
Loading

0 comments on commit d3b5121

Please sign in to comment.