Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft : Caching device_info in device_ext #3

Open
wants to merge 120 commits into
base: mixed_types_gemm
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
07a3fc0
Removes multiple newlines at the end of files that is breaking the ed…
HanClinto Jul 2, 2024
3e2618b
Adding step to `clean` target to remove legacy binary names to reduce…
HanClinto Jul 2, 2024
a27152b
fix: add missing short command line argument -mli for multiline-input…
MistApproach Jul 2, 2024
b0536ed
caching device_info in device_ext to avoid extra queries + wg size ge…
OuadiElfarouki Jul 3, 2024
fadde67
Dequant improvements rebase (#8255)
Jul 3, 2024
6d5b0b4
minor updates to device_ext
OuadiElfarouki Jul 3, 2024
f8d6a23
fix typo (#8267)
foldl Jul 3, 2024
916248a
fix phi 3 conversion (#8262)
ngxson Jul 3, 2024
5f2d4e6
ppl : fix n_seq_max for perplexity (#8277)
slaren Jul 3, 2024
d23287f
Define and optimize RDNA1 (#8085)
daniandtheweb Jul 3, 2024
f619024
[SYCL] Remove unneeded semicolons (#8280)
Jul 4, 2024
20fc380
convert : fix gemma v1 tokenizer convert (#8248)
ggerganov Jul 4, 2024
402d6fe
llama : suppress unref var in Windows MSVC (#8150)
danbev Jul 4, 2024
f8c4c07
tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231)
danbev Jul 4, 2024
807b0c4
Inference support for T5 and FLAN-T5 model families (#5763)
fairydreaming Jul 4, 2024
985f03d
Merge branch 'master' into dev_ext_wg_query
OuadiElfarouki Jul 4, 2024
b0a4699
build(python): Package scripts with pip-0517 compliance
ditsuke Feb 27, 2024
b1c3f26
fix: Actually include scripts in build
ditsuke Feb 28, 2024
8219229
fix: Update script paths in CI scripts
ditsuke Mar 10, 2024
de14e2e
chore: ignore all __pychache__
ditsuke Jul 2, 2024
07786a6
chore: Fixup requirements and build
ditsuke Jul 2, 2024
01a5f06
chore: Remove rebase artifacts
ditsuke Jul 2, 2024
1e92001
doc: Add context for why we add an explicit pytorch source
ditsuke Jul 2, 2024
51d2eba
build: Export hf-to-gguf as snakecase
ditsuke Jul 4, 2024
6f63d64
tokenize : add --show-count (token) option (#8299)
danbev Jul 4, 2024
d7fd29f
llama : add OpenELM support (#7359)
icecream95 Jul 4, 2024
a38b884
cli: add EOT when user hit Ctrl+C (#8296)
ngxson Jul 4, 2024
f09b7cb
rm get_work_group_size() by local cache for performance (#8286)
NeoZhangJianyu Jul 5, 2024
e235b26
py : switch to snake_case (#8305)
ggerganov Jul 5, 2024
a9554e2
[SYCL] Fix WARP_SIZE=16 bug of Intel GPU (#8266)
luoyu-intel Jul 5, 2024
6c05752
contributing : update guidelines (#8316)
ggerganov Jul 5, 2024
aa5898d
llama : prefer n_ over num_ prefix (#8308)
ggerganov Jul 5, 2024
61ecafa
passkey : add short intro to README.md [no-ci] (#8317)
danbev Jul 5, 2024
5a7447c
readme : fix minor typos [no ci] (#8314)
pouwerkerk Jul 5, 2024
bcefa03
CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311)
JohannesGaessler Jul 5, 2024
d12f781
llama : streamline embeddings from "non-embedding" models (#8087)
iamlemec Jul 5, 2024
0a42380
CUDA: revert part of the RDNA1 optimizations (#8309)
daniandtheweb Jul 5, 2024
8e55830
CUDA: MMQ support for iq4_nl, iq4_xs (#8278)
JohannesGaessler Jul 5, 2024
2cccbaa
llama : minor indentation during tensor loading (#8304)
ggerganov Jul 5, 2024
148ec97
convert : remove AWQ remnants (#8320)
ggerganov Jul 5, 2024
1f3e1b6
Enabled more data types for oneMKL gemm_batch (#8236)
OuadiElfarouki Jul 5, 2024
1d894a7
cmake : add GGML_BUILD and GGML_SHARED macro definitions (#8281)
akemimadoka Jul 5, 2024
7ed03b8
llama : fix compile warning (#8304)
ggerganov Jul 5, 2024
be20e7f
Reorganize documentation pages (#8325)
ngxson Jul 5, 2024
213701b
Detokenizer fixes (#8039)
jaime-m-p Jul 5, 2024
87e25a1
llama : add early return for empty range (#8327)
danbev Jul 6, 2024
60d83a0
update main readme (#8333)
ngxson Jul 6, 2024
86e7299
added support for Authorization Bearer tokens when downloading model …
dwoolworth Jul 6, 2024
cb4d86c
server: Retrieve prompt template in /props (#8337)
bviksoe Jul 7, 2024
210eb9e
finetune: Rename an old command name in finetune.sh (#8344)
standby24x7 Jul 7, 2024
b81ba1f
finetune: Rename command name in README.md (#8343)
standby24x7 Jul 7, 2024
d39130a
py : use cpu-only torch in requirements.txt (#8335)
compilade Jul 7, 2024
b504008
llama : fix n_rot default (#8348)
ggerganov Jul 7, 2024
905942a
llama : support glm3 and glm4 (#8031)
youth123 Jul 7, 2024
f7cab35
gguf-hash: model wide and per tensor hashing using xxhash and sha1 (#…
mofosyne Jul 7, 2024
f1948f1
readme : update bindings list (#8222)
andy-tai Jul 7, 2024
4090ea5
ci : add checks for cmake,make and ctest in ci/run.sh (#8200)
AlexsCode Jul 7, 2024
a8db2a9
Update llama-cli documentation (#8315)
dspasyuk Jul 7, 2024
3fd62a6
py : type-check all Python scripts with Pyright (#8341)
compilade Jul 7, 2024
04ce3a8
readme : add supported glm models (#8360)
youth123 Jul 8, 2024
ffd0079
common : avoid unnecessary logits fetch (#8358)
kevmo314 Jul 8, 2024
6f0dbf6
infill : assert prefix/suffix tokens + remove old space logic (#8351)
ggerganov Jul 8, 2024
470939d
common : preallocate sampling token data vector (#8363)
kevmo314 Jul 8, 2024
fde13b3
feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854)
balisujohn Jul 2, 2024
6847d54
tests : fix whitespace (#0)
ggerganov Jul 8, 2024
2ee44c9
sync : ggml
ggerganov Jul 8, 2024
3f2d538
scripts : fix sync for sycl
ggerganov Jul 8, 2024
2ec846d
sycl : fix powf call in device code (#8368)
Alcpz Jul 8, 2024
c4dd11d
readme : fix web link error [no ci] (#8347)
b4b4o Jul 8, 2024
a130ecc
labeler : updated sycl to match docs and code refactor (#8373)
Alcpz Jul 8, 2024
7fdb6f7
flake.lock: Update (#8342)
ggerganov Jul 8, 2024
7d0e23d
gguf-py : do not use internal numpy types (#7472)
compilade Jul 9, 2024
9beb2dd
readme : fix typo [no ci] (#8389)
daghanerdonmez Jul 9, 2024
9925ca4
cmake : allow external ggml (#8370)
iboB Jul 9, 2024
5b0b8d8
sycl : Reenabled mmvq path for the SYCL Nvidia Backend (#8372)
Alcpz Jul 9, 2024
a03e8dd
make/cmake: LLAMA_NO_CCACHE -> GGML_NO_CCACHE (#8392)
JohannesGaessler Jul 9, 2024
e500d61
Deprecation warning to assist with migration to new binary names (#8283)
HanClinto Jul 9, 2024
fd560fe
Update README.md to fix broken link to docs (#8399)
andysalerno Jul 9, 2024
a59f8fd
Server: Enable setting default sampling parameters via command-line (…
HanClinto Jul 9, 2024
8f0fad4
py : fix extra space in convert_hf_to_gguf.py (#8407)
laik Jul 10, 2024
e4dd31f
py : fix converter for internlm2 (#8321)
RunningLeon Jul 10, 2024
a8be1e6
llama : add assert about missing llama_encode() call (#8400)
fairydreaming Jul 10, 2024
7a80710
msvc : silence codecvt c++17 deprecation warnings (#8395)
iboB Jul 10, 2024
cc61948
llama : C++20 compatibility for u8 strings (#8408)
iboB Jul 10, 2024
83321c6
gguf-py rel pipeline (#8410)
monatis Jul 10, 2024
0f1a39f
ggml : add AArch64 optimized GEMV and GEMM Q4 kernels (#5780)
Dibakar Jul 10, 2024
6b2a849
ggml : move sgemm sources to llamafile subfolder (#8394)
ggerganov Jul 10, 2024
f4444d9
[SYCL] Use multi_ptr to clean up deprecated warnings (#8256)
Jul 10, 2024
dd07a12
Name Migration: Build the deprecation-warning 'main' binary every tim…
HanClinto Jul 10, 2024
278d0e1
Initialize default slot sampling parameters from the global context. …
HanClinto Jul 11, 2024
7a221b6
llama : use F32 precision in Qwen2 attention and no FA (#8412)
ggerganov Jul 11, 2024
9a55ffe
tokenize : add --no-parse-special option (#8423)
compilade Jul 11, 2024
a977c11
gitignore : deprecated binaries
ggerganov Jul 11, 2024
808aba3
CUDA: optimize and refactor MMQ (#8416)
JohannesGaessler Jul 11, 2024
b078c61
cuda : suppress 'noreturn' warn in no_device_code (#8414)
danbev Jul 11, 2024
3686456
ggml : add NVPL BLAS support (#8329) (#8425)
nicholaiTukanov Jul 11, 2024
b549a1b
[SYCL] fix the mul_mat_id ut issues (#8427)
ClarkChin08 Jul 12, 2024
370b1f7
ggml : minor naming changes (#8433)
ggerganov Jul 12, 2024
71c1121
examples : sprintf -> snprintf (#8434)
ggerganov Jul 12, 2024
5aefbce
convert : remove fsep token from GPTRefactForCausalLM (#8237)
jpodivin Jul 12, 2024
8a4441e
docker : fix filename for convert-hf-to-gguf.py in tools.sh (#8441)
kriation Jul 12, 2024
c3ebcfa
server : ensure batches are either all embed or all completion (#8420)
iamlemec Jul 12, 2024
f532262
llama : suppress unary minus operator warning (#8448)
danbev Jul 12, 2024
6af51c0
main : print error on empty input (#8456)
ggerganov Jul 12, 2024
4e24cff
server : handle content array in chat API (#8449)
ggerganov Jul 12, 2024
c917b67
metal : template-ify some of the kernels (#8447)
ggerganov Jul 13, 2024
17eb6aa
vulkan : cmake integration (#8119)
bandoti Jul 13, 2024
fa79495
llama : fix pre-tokenization of non-special added tokens (#8228)
compilade Jul 14, 2024
e236528
gguf_hash.py: Add sha256 (#8470)
mofosyne Jul 14, 2024
73cf442
llama : fix Gemma-2 Query scaling factors (#8473)
ggerganov Jul 14, 2024
aaab241
flake.lock: Update (#8475)
ggerganov Jul 14, 2024
090fca7
pydantic : replace uses of __annotations__ with get_type_hints (#8474)
compilade Jul 14, 2024
bda62d7
Vulkan MMQ Fix (#8479)
0cc4m Jul 15, 2024
3dfda05
llama : de-duplicate deepseek2 norm
ggerganov Jul 15, 2024
16bdfa4
[SYCL] add concat through dim 1/2 (#8483)
airMeng Jul 15, 2024
fc690b0
docs: fix links in development docs [no ci] (#8481)
NikolaiLyssogor Jul 15, 2024
9104bc2
common : add --no-cont-batching arg (#6358)
ggerganov Jul 15, 2024
f17f39f
server: update README.md with llama-server --help output [no ci] (#8472)
maruel Jul 15, 2024
8fac431
ggml : suppress unknown pragma 'GCC' on windows (#8460)
danbev Jul 15, 2024
c3c57fb
Merge branch 'master' into dev_ext_wg_query
OuadiElfarouki Jul 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .devops/nix/package.nix
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
vulkan-headers,
vulkan-loader,
curl,
shaderc,
useBlas ? builtins.all (x: !x) [
useCuda
useMetalKit
Expand Down Expand Up @@ -89,6 +90,22 @@ let
ps.tiktoken
ps.torchWithoutCuda
ps.transformers

# server bench
ps.matplotlib

# server tests
ps.openai
ps.behave
ps.prometheus-client

# for examples/pydantic-models-to-grammar-examples.py
ps.docstring-parser
ps.pydantic

# for scripts/compare-llama-bench.py
ps.gitpython
ps.tabulate
]
);

Expand Down Expand Up @@ -130,6 +147,7 @@ let
vulkanBuildInputs = [
vulkan-headers
vulkan-loader
shaderc
];
in

Expand Down
2 changes: 1 addition & 1 deletion .devops/tools.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ arg1="$1"
shift

if [[ "$arg1" == '--convert' || "$arg1" == '-c' ]]; then
python3 ./convert-hf-to-gguf.py "$@"
python3 ./convert_hf_to_gguf.py "$@"
elif [[ "$arg1" == '--quantize' || "$arg1" == '-q' ]]; then
./llama-quantize "$@"
elif [[ "$arg1" == '--run' || "$arg1" == '-r' ]]; then
Expand Down
2 changes: 0 additions & 2 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,3 @@ contact_links:
- name: Want to contribute?
url: https://github.com/ggerganov/llama.cpp/wiki/contribute
about: Head to the contribution guide page of the wiki for areas you can help with


4 changes: 3 additions & 1 deletion .github/labeler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ SYCL:
- any-glob-to-any-file:
- ggml/include/ggml-sycl.h
- ggml/src/ggml-sycl.cpp
- README-sycl.md
- ggml/src/ggml-sycl/**
- docs/backend/SYCL.md
- examples/sycl/**
Nvidia GPU:
- changed-files:
- any-glob-to-any-file:
Expand Down
6 changes: 4 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -355,8 +355,10 @@ jobs:
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential libvulkan-dev
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo apt-key add -
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
sudo apt-get update -y
sudo apt-get install -y build-essential vulkan-sdk

- name: Build
id: cmake_build
Expand Down
38 changes: 38 additions & 0 deletions .github/workflows/python-type-check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Python Type-Check

on:
push:
paths:
- '.github/workflows/python-type-check.yml'
- '**.py'
- '**/requirements*.txt'
pull_request:
paths:
- '.github/workflows/python-type-check.yml'
- '**.py'
- '**/requirements*.txt'

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true

jobs:
python-type-check:
runs-on: ubuntu-latest
name: pyright type-check
steps:
- name: Check out source repository
uses: actions/checkout@v4
- name: Set up Python environment
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Python dependencies
# TODO: use a venv
run: pip install -r requirements/requirements-all.txt
- name: Type-check with Pyright
uses: jakebailey/pyright-action@v2
with:
version: 1.1.370
level: warning
warnings: true
17 changes: 12 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ build*
!build-info.cpp.in
!build-info.sh
!build.zig
!docs/build.md
/libllama.so
/llama-*
android-ndk-*
Expand All @@ -60,6 +61,11 @@ llama-batched-swift
out/
tmp/

# Deprecated

/main
/server

# CI

!.github/workflows/*.yml
Expand Down Expand Up @@ -98,13 +104,14 @@ examples/server/*.mjs.hpp

# Python

__pycache__
.venv
/Pipfile
dist
poetry.lock
/.venv
__pycache__/
*/poetry.lock
poetry.toml

# Nix
/result

# Test binaries
/tests/test-backend-ops
/tests/test-double-float
Expand Down
26 changes: 19 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,14 @@ endif()

option(BUILD_SHARED_LIBS "build shared libraries" ${BUILD_SHARED_LIBS_DEFAULT})

if (WIN32)
add_compile_definitions(_CRT_SECURE_NO_WARNINGS)
endif()

#
# option list
#

# general
option(LLAMA_CCACHE "llama: use ccache if available" ON)

# debug
option(LLAMA_ALL_WARNINGS "llama: enable all compiler warnings" ON)
option(LLAMA_ALL_WARNINGS_3RD_PARTY "llama: enable all compiler warnings in 3rd party libs" OFF)
Expand All @@ -73,7 +74,6 @@ option(LLAMA_CURL "llama: use libcurl to download model from an URL" OFF)
include(${CMAKE_CURRENT_SOURCE_DIR}/cmake/build-info.cmake)

# override ggml options
set(GGML_CCACHE ${LLAMA_CCACHE})
set(GGML_SANITIZE_THREAD ${LLAMA_SANITIZE_THREAD})
set(GGML_SANITIZE_ADDRESS ${LLAMA_SANITIZE_ADDRESS})
set(GGML_SANITIZE_UNDEFINED ${LLAMA_SANITIZE_UNDEFINED})
Expand Down Expand Up @@ -111,7 +111,10 @@ llama_option_depr(WARNING LLAMA_SYCL_F16 GGML_SYCL_F16)
# build the library
#

add_subdirectory(ggml)
if (NOT TARGET ggml)
add_subdirectory(ggml)
# ... otherwise assume ggml is added by a parent CMakeLists.txt
endif()
add_subdirectory(src)

#
Expand All @@ -129,7 +132,16 @@ set(LLAMA_INCLUDE_INSTALL_DIR ${CMAKE_INSTALL_INCLUDEDIR} CACHE PATH "Location o
set(LLAMA_LIB_INSTALL_DIR ${CMAKE_INSTALL_LIBDIR} CACHE PATH "Location of library files")
set(LLAMA_BIN_INSTALL_DIR ${CMAKE_INSTALL_BINDIR} CACHE PATH "Location of binary files")

get_directory_property(LLAMA_TRANSIENT_DEFINES COMPILE_DEFINITIONS)

# At the moment some compile definitions are placed within the ggml/src
# directory but not exported on the `ggml` target. This could be improved by
# determining _precisely_ which defines are necessary for the llama-config
# package.
#
get_directory_property(GGML_DIR_DEFINES DIRECTORY ggml/src COMPILE_DEFINITIONS)
get_target_property(GGML_TARGET_DEFINES ggml COMPILE_DEFINITIONS)
set(GGML_TRANSIENT_DEFINES ${GGML_TARGET_DEFINES} ${GGML_DIR_DEFINES})
get_target_property(GGML_LINK_LIBRARIES ggml LINK_LIBRARIES)

set_target_properties(llama PROPERTIES PUBLIC_HEADER ${CMAKE_CURRENT_SOURCE_DIR}/include/llama.h)
install(TARGETS llama LIBRARY PUBLIC_HEADER)
Expand All @@ -152,7 +164,7 @@ install(FILES ${CMAKE_CURRENT_BINARY_DIR}/llama-config.cmake
DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/llama)

install(
FILES convert-hf-to-gguf.py
FILES convert_hf_to_gguf.py
PERMISSIONS
OWNER_READ
OWNER_WRITE
Expand Down
30 changes: 20 additions & 10 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,24 @@
# Contributing Guidelines
# Pull requests

## Checklist
- Always squash-merge the PR before merging
- Use the following format for your final commit: `<module> : <commit title> (#<issue_number>)`. For example: `utils : fix typo in utils.py (#1234)`
- Test your changes:
- Using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the GGML library
- Execute [the full CI locally on your machine](ci/README.md) before publishing
- If the pull request contains only documentation changes (e.g., updating READMEs, adding new wiki pages), please add `[no ci]` to the commit title. This will skip unnecessary CI checks and help reduce build times
- Please rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs.
- The PR template has a series of review complexity checkboxes `[ ]` that [you can mark as](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists) `[X]` for your conveience

* Make sure your PR follows the [coding guidelines](https://github.com/ggerganov/llama.cpp/blob/master/README.md#coding-guidelines)
* Test your changes using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the GGML library
* Execute [the full CI locally on your machine](ci/README.md) before publishing
# Coding guidelines

## PR formatting
- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
- Naming usually optimizes for common prefix (see https://github.com/ggerganov/ggml/pull/302#discussion_r1243240963)
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
- Matrix multiplication is unconventional: [`C = ggml_mul_mat(ctx, A, B)`](https://github.com/ggerganov/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means $C^T = A B^T \Leftrightarrow C = B A^T.$

![matmul](media/matmul.png)

* Please rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs.
- The PR template has a series of review complexity checkboxes `[ ]` that you can mark as `[X]` for your conveience. Refer to [About task lists](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists) for more information.
* If the pull request only contains documentation changes (e.g., updating READMEs, adding new wiki pages), please add `[no ci]` to the commit title. This will skip unnecessary CI checks and help reduce build times.
* When squashing multiple commits on merge, use the following format for your commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : Fix typo in utils.py (#1234)`
Loading
Loading