Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extension near to genai library #461

Closed
Changes from 227 commits
Commits
Show all changes
239 commits
Select commit Hold shift + click to select a range
ba91fde
initial generate
pavel-esir Mar 26, 2024
9d85a0e
LLM pipeline
pavel-esir Mar 28, 2024
b21c6c1
Added calculating for several batches
pavel-esir Apr 2, 2024
e52e90d
Greedy search works
pavel-esir Apr 3, 2024
745a804
rename to GenerationConfig
pavel-esir Apr 4, 2024
8895ed0
Add fluent interface
pavel-esir Apr 5, 2024
b24977d
Update text_generation/causal_lm/cpp/generate_pipeline/generate_pipel…
pavel-esir Apr 5, 2024
c933ca0
cosmetic changes in main
pavel-esir Apr 5, 2024
c43e901
greedy search with batches and left padding works
pavel-esir Apr 10, 2024
5a914f6
combine LLModel with LLMPipeline
pavel-esir Apr 10, 2024
c1e0c9d
wip: enable calling tokenize/detokenize for LLMPipeline
pavel-esir Apr 10, 2024
8d66353
add callback to generate
pavel-esir Apr 11, 2024
fa12da7
cleanup generate_sample.cpp
pavel-esir Apr 11, 2024
5ceb9d5
add speculative decoding
pavel-esir Apr 16, 2024
a5083c7
separate Tokenizer
pavel-esir Apr 17, 2024
7692160
wip
pavel-esir Apr 23, 2024
d3f6339
add start/stop conversation
pavel-esir Apr 24, 2024
3776433
use text in streamer instead of raw tokens
pavel-esir Apr 23, 2024
964a5e8
add apply_chat_template
pavel-esir Apr 23, 2024
e57aa4c
fix difference between accumulating conversation as text and keeping …
pavel-esir Apr 26, 2024
d0c1341
cleanup
pavel-esir Apr 26, 2024
8dcea1f
add Jinja2cpp submodule
pavel-esir Apr 26, 2024
754a462
add ov namespace
pavel-esir May 2, 2024
9b19c6f
return scores for batched outputs
pavel-esir May 2, 2024
9bf6caa
add AnyMap
pavel-esir May 3, 2024
39fd73c
Merge remote-tracking branch 'upstream/master' into generate_pipeline
pavel-esir May 3, 2024
63d8f6d
cleanup
pavel-esir May 3, 2024
a833760
before moving to pimpl
pavel-esir May 6, 2024
1681654
move to separate include & src
pavel-esir May 6, 2024
9fe73c6
pimpl implementation
pavel-esir May 6, 2024
053708f
temporary disable jinja2cpp
pavel-esir May 6, 2024
bd6849a
add python api draft, hide implementations from user & refactor imple…
pavel-esir May 7, 2024
62c471e
extract decoding methods to separate files
pavel-esir May 7, 2024
f1d54f4
extended python api, added python api test
pavel-esir May 7, 2024
3c82e11
remove call method
pavel-esir May 8, 2024
5543cee
init
Wovchena May 6, 2024
abb8835
add_subdirectory
Wovchena May 7, 2024
0998abc
add files
Wovchena May 8, 2024
15492c4
add __init__.py
Wovchena May 8, 2024
005d3fb
removed set_streamer
pavel-esir May 8, 2024
cc44bc8
use std::optional
pavel-esir May 8, 2024
d8cab05
started to add Readme docs
pavel-esir May 8, 2024
2535394
reoder Readme
pavel-esir May 8, 2024
95c1bfb
rm generate_pipeline/python
Wovchena May 9, 2024
4510f71
update Readme; cleanup LLMPipeline and add docstring
pavel-esir May 9, 2024
507bc49
refactor folder structure
pavel-esir May 9, 2024
af747d4
cleanup generation_config and ov::Tokenizer
pavel-esir May 9, 2024
c6620d9
move includes to a separate openvino/genai folder
pavel-esir May 10, 2024
59c3e0b
Merge branch 'generate_pipeline' into package
Wovchena May 10, 2024
be84345
align names
Wovchena May 10, 2024
bced64a
Dont modify text_generation/causal_lm/cpp/CMakeLists.txt
Wovchena May 10, 2024
f4e82b6
rm -r text_generation/causal_lm/cpp/generate_pipeline/python-bindings/
Wovchena May 10, 2024
5b2b0ca
fix build
Wovchena May 10, 2024
0dd8f59
add tokenizers only once
Wovchena May 10, 2024
23638ff
change cmake.source-dir
Wovchena May 10, 2024
d8c5349
restore openvino/genai inits
Wovchena May 10, 2024
24faefe
Integrate JinjaCpp
ilya-lavrenov May 10, 2024
598dda3
install genai lib
Wovchena May 10, 2024
f274b93
Merge pull request #2 from ilya-lavrenov/jinja-integration-pavel
pavel-esir May 10, 2024
02d0eae
import openvino for win and lin
Wovchena May 10, 2024
e6695f3
Merge branch 'generate_pipeline' into package
Wovchena May 10, 2024
a27c5a7
put the line back
Wovchena May 10, 2024
0849c41
Added cmake build type before project clause
ilya-lavrenov May 10, 2024
34cddff
one line properties
Wovchena May 10, 2024
023cf1e
Merge pull request #3 from ilya-lavrenov/cmake-build-type
pavel-esir May 10, 2024
6a5d750
Export API symbols
ilya-lavrenov May 10, 2024
27f385e
Merge pull request #4 from ilya-lavrenov/generate_pipeline
pavel-esir May 10, 2024
a9332f0
Merge branch 'generate_pipeline' into package
Wovchena May 10, 2024
9ef488c
rename
Wovchena May 10, 2024
4fad7d5
add .github/workflows/genai_lib.yml
Wovchena May 10, 2024
51e03a2
on: pull_request
Wovchena May 10, 2024
e23a7bb
spelling
Wovchena May 10, 2024
fc5b753
install openvino
Wovchena May 10, 2024
09f8806
add syntacis sugar for geenrate, optimize value passing by reference
pavel-esir May 10, 2024
af22a8a
remove speculative decoding
pavel-esir May 11, 2024
e7db7e8
update
Wovchena May 13, 2024
f279363
add rpath
Wovchena May 13, 2024
83d77c8
add rpath to libopenvino.so
Wovchena May 13, 2024
167f924
py_generate_pipeline
Wovchena May 13, 2024
a111a3f
reorder tokenizer.cpp, add comments to BaseStreamer
pavel-esir May 11, 2024
813d80a
install centos7
Wovchena May 13, 2024
6227b65
install nightly
Wovchena May 13, 2024
74fc107
Merge branch 'generate_pipeline' into package
Wovchena May 13, 2024
9b83a7e
propagate _GLIBCXX_USE_CXX11_ABI
Wovchena May 13, 2024
2d15752
Populate python with the libraries to allow skipping wheel installation
Wovchena May 13, 2024
8025554
run setupvars
Wovchena May 13, 2024
2b14286
update .gitignore, install numpy
Wovchena May 13, 2024
1c11bc7
quotes
Wovchena May 13, 2024
e7fce82
fix PYTHONPATH
Wovchena May 13, 2024
64608d1
fix PYTHONPATH
Wovchena May 13, 2024
43b87c7
quotes
Wovchena May 13, 2024
fef9674
reorder vars
Wovchena May 14, 2024
b21286c
openvino.genai-
Wovchena May 14, 2024
d393f89
Merge pull request #1 from Wovchena/package
pavel-esir May 14, 2024
2b8954d
Merge branch 'master' into generate_pipeline
pavel-esir May 14, 2024
11e872b
Update CMakeLists.txt
pavel-esir May 14, 2024
442dcbf
move group beam searcher to src
pavel-esir May 13, 2024
53d534e
Update .gitignore (#5)
Wovchena May 15, 2024
dcb4b86
Merge remote-tracking branch 'origin/generate_pipeline' into generate…
pavel-esir May 15, 2024
72c045e
fixed difference between old greddy sample and generate
pavel-esir May 15, 2024
11fbaa2
tokenizer minor fixes
pavel-esir May 15, 2024
264e99f
apply comments
pavel-esir May 15, 2024
11032b4
remove accidentally added test_cpp_samples.py
pavel-esir May 15, 2024
7d0c80b
fix build
pavel-esir May 15, 2024
2e3cd73
fix causal_lm comparison error
pavel-esir May 15, 2024
e7fa974
fix different outputs
pavel-esir May 15, 2024
78d0b88
Archive (#7)
Wovchena May 20, 2024
5eb59ea
add tests
pavel-esir May 16, 2024
ce4eb00
Apply suggestions from code review
pavel-esir May 22, 2024
aa90e9d
names correction
pavel-esir May 22, 2024
d843229
enable
Wovchena May 22, 2024
2c1d1ef
libtbb-dev
Wovchena May 22, 2024
57ca2d4
move
Wovchena May 22, 2024
37844c9
slash
Wovchena May 22, 2024
5cff21e
install
Wovchena May 22, 2024
561b55a
core_genai_dev
Wovchena May 22, 2024
260d913
remove export
Wovchena May 22, 2024
54cbb52
update URL_HASH
Wovchena May 22, 2024
82a9449
remove submodules from .gitmodules
Wovchena May 22, 2024
5a0079b
install openvino_tokenizers for genai_python_lib
pavel-esir May 22, 2024
73e4312
Update Jinja2Cpp fork commit
Wovchena May 22, 2024
75b7c37
remove group_beam_searcher.hpp; copy fast_tokenizer
pavel-esir May 22, 2024
b6cf954
rreorganaise components
Wovchena May 22, 2024
aaf5c78
add SOVERSION, and requirements-build.txt
Wovchena May 22, 2024
5537d3b
repalce SKBUILD with EXCLUDE_FROM_ALL because the effect is the same
Wovchena May 22, 2024
9966be4
fix NAMELINK_COMPONENT
Wovchena May 22, 2024
2486e53
remove extraline
Wovchena May 22, 2024
7953c0f
Merge branch 'generate_pipeline' into fix-archive
Wovchena May 22, 2024
786eac7
add soft restrictions
Wovchena May 22, 2024
7324da9
Fix build to unblock packaging
Wovchena May 22, 2024
5577e84
improve naming
Wovchena May 23, 2024
b679fc7
install samples
Wovchena May 23, 2024
26f9fe1
remove quotes
Wovchena May 23, 2024
1dcd40b
use main target name because an alias can't be specified in cmake --t…
Wovchena May 23, 2024
8c00ccb
define CMAKE_BUILD_PARALLEL_LEVEL
Wovchena May 23, 2024
61fba58
Ensure ./requirements-build.txt won't outdate
Wovchena May 23, 2024
d78fa3b
Use ./requirements-build.txt in python lib build
Wovchena May 23, 2024
757b738
Add missing &&
Wovchena May 23, 2024
51ace23
Test Debug
Wovchena May 23, 2024
e53c525
add matrix for windows_genai_package
Wovchena May 23, 2024
73ac7b1
openvino_tokenizers from form
Wovchena May 23, 2024
e7e50cb
update openvino_tokenizers
Wovchena May 23, 2024
3339407
update openvino_tokenizers
Wovchena May 23, 2024
9b5b915
update openvino_tokenizers
Wovchena May 23, 2024
1fe85b9
revert openvino_tokenizers
Wovchena May 23, 2024
7e23930
tokenizers from fork
Wovchena May 23, 2024
62f5e34
update tokenizers
Wovchena May 23, 2024
63262d7
centos7_2024.2.0.dev
Wovchena May 23, 2024
2d5fc6f
copy target
Wovchena May 23, 2024
6f53005
revert tokenizers
Wovchena May 23, 2024
d8e5bf9
reapply useful changes
Wovchena May 23, 2024
9866f5c
copy so only
Wovchena May 23, 2024
2c691c3
Update tokenizers, centos7_2024.2.0.dev
Wovchena May 23, 2024
3507deb
single thread
Wovchena May 23, 2024
70f1177
Fix archive (#8)
Wovchena May 23, 2024
da729ba
Apply suggestions from code review
pavel-esir May 24, 2024
28c313b
add groups to GenerationConfig docstring
pavel-esir May 24, 2024
c395a8d
refactor namespace ov::* -> ov::genai::*
pavel-esir May 24, 2024
bbc8c25
removed ov_tokenizers_path when ov::gena::Tokenizer is passed to LLMP…
pavel-esir May 24, 2024
18f8ca8
ubuntu22
Wovchena May 24, 2024
3e914c5
nightyl
Wovchena May 24, 2024
ad49d94
--pre --extra-index-url
Wovchena May 24, 2024
963a520
update tokenizers
Wovchena May 24, 2024
72bede7
space
Wovchena May 24, 2024
e8f4cbe
move --pre --extra-index-url https://storage.openvinotoolkit.org/simp…
Wovchena May 24, 2024
5afd763
release tokenizers
Wovchena May 24, 2024
b47d6d5
merge
Wovchena May 24, 2024
7a28144
downgrade tokenizers
Wovchena May 24, 2024
b7493a1
downgrade
Wovchena May 24, 2024
ee97729
two steps
Wovchena May 24, 2024
0a5d765
downgrade tokenizers
Wovchena May 24, 2024
f4e444f
dont setupvars
Wovchena May 24, 2024
8bcf504
source
Wovchena May 24, 2024
f457faf
fix
Wovchena May 24, 2024
7a2986a
submodule
Wovchena May 24, 2024
25ea88c
releases/2024/2 tokenizers
Wovchena May 25, 2024
2f88d0a
fix-2
Wovchena May 25, 2024
829b40e
rebase
Wovchena May 25, 2024
3a7db44
use make
Wovchena May 25, 2024
b5e5800
comment
Wovchena May 25, 2024
72a041c
CMAKE_GENERATOR=Unix Makefiles
Wovchena May 25, 2024
6116bd1
update openvino
Wovchena May 27, 2024
959f0c2
space
Wovchena May 27, 2024
312e0ae
optimum-cli from fork
Wovchena May 27, 2024
0286c96
different commit
Wovchena May 27, 2024
78666da
from branch
Wovchena May 27, 2024
140b59c
Merge branch 'generate_pipeline' into fix-abi
Wovchena May 27, 2024
a413be8
remove exrtra-index for SD
Wovchena May 27, 2024
de3a17e
reorder pip install
Wovchena May 27, 2024
4adaa33
revert unwanted changes
Wovchena May 27, 2024
0d7f893
Ubuntu-22
Wovchena May 27, 2024
9e37273
Add sampling decoding (#6)
as-suvorov May 27, 2024
82a7823
openvino_tokenizers~=2024.2.0.0
Wovchena May 27, 2024
323e7ac
remove -pre . --extra-index-url https://storage.openvinotoolkit.org/s…
Wovchena May 27, 2024
95e5a01
upgrade to prerelease
Wovchena May 27, 2024
4f22d86
revert requirements.txt
Wovchena May 27, 2024
d94ba2e
remove --pre, setupvars
Wovchena May 27, 2024
501cb8b
get openvino_tokenizers._ext_path
Wovchena May 27, 2024
336036a
take release pybind, fix soversion, and tokenizers folder
Wovchena May 27, 2024
3fd374f
spelling
Wovchena May 27, 2024
2eaf369
dont copy libs
Wovchena May 27, 2024
07e2385
put ov_tokenizers_path back
Wovchena May 27, 2024
7a79f8d
GENAI_BUILD_DIR=../../build
Wovchena May 28, 2024
2705867
Add extension near to genai library
Wovchena May 28, 2024
ce79a0e
include openvino/util/file_util.hpp
Wovchena May 28, 2024
f4d6c1f
get_absolute_file_path
Wovchena May 28, 2024
d99aca1
remove namepsace
Wovchena May 28, 2024
e375901
# include <limits.h>
Wovchena May 28, 2024
f9d9b18
more than one .
Wovchena May 28, 2024
2ac081c
till next dot
Wovchena May 28, 2024
0e18c9c
_ext_path
Wovchena May 28, 2024
4c10755
-1
Wovchena May 28, 2024
38076fc
+1
Wovchena May 28, 2024
d9edf2d
+1
Wovchena May 28, 2024
8c44fdd
path
Wovchena May 28, 2024
3030852
ext name
Wovchena May 28, 2024
dc885bb
with_openvino_tokenizers
Wovchena May 28, 2024
6856b5e
char
Wovchena May 28, 2024
5b5fd01
revert test
Wovchena May 28, 2024
014e9ee
tokenizers from form
Wovchena May 28, 2024
adc1f72
update fork
Wovchena May 28, 2024
4b806c0
lib
Wovchena May 28, 2024
cb81756
fix cherry-pick
Wovchena May 28, 2024
0110e51
update fork
Wovchena May 28, 2024
c97f2f8
dont spoil source dir
Wovchena May 28, 2024
934e438
Generator expressions to disable appending a per-configuration subdir…
Wovchena May 28, 2024
c976ff8
remove versions
Wovchena May 28, 2024
9483cb6
fix path
Wovchena May 28, 2024
ebad130
try
Wovchena May 28, 2024
38cbffd
try
Wovchena May 28, 2024
dc80b54
verbose
Wovchena May 28, 2024
25059f0
spelling
Wovchena May 28, 2024
3c5e130
rename file
Wovchena May 28, 2024
60cb221
remove build.tool-args
Wovchena May 28, 2024
c52dc22
Release
Wovchena May 28, 2024
9ef686a
dont speciify targets
Wovchena May 28, 2024
81ec069
Fix library loading by updating dependencies (#10)
Wovchena May 28, 2024
13ad3d2
Merge branch 'generate_pipeline' into add-extension-near-to-genai-lib…
Wovchena May 29, 2024
24e0e41
revert 81ec069
Wovchena May 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "./"
schedule:
interval: "weekly"
- package-ecosystem: "pip"
directory: "image_generation/stable_diffusion_1_5/cpp/scripts/"
schedule:
129 changes: 65 additions & 64 deletions .github/workflows/causal_lm_cpp.yml

Large diffs are not rendered by default.

63 changes: 63 additions & 0 deletions .github/workflows/genai_package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: genai_package
on: pull_request
jobs:
ubuntu_genai_package:
strategy:
matrix:
build-type: [Release, Debug]
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- uses: actions/setup-python@v4
with:
python-version: 3.8
- run: mkdir ./ov/
- run: curl https://storage.openvinotoolkit.org/repositories/openvino/packages/pre-release/2024.2.0rc1/linux/l_openvino_toolkit_ubuntu20_2024.2.0.dev20240524_x86_64.tgz | tar --directory ./ov/ --strip-components 1 -xz
- run: sudo ./ov/install_dependencies/install_openvino_dependencies.sh
- run: sudo apt-get install libtbb-dev
- run: source ./ov/setupvars.sh && cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -S ./ -B ./build/
- run: source ./ov/setupvars.sh && cmake --build ./build/ --config ${{ matrix.build-type }} --target package -j
- run: source ./ov/setupvars.sh && cmake --install ./build/ --config ${{ matrix.build-type }} --prefix ov
- run: ov/samples/cpp/build_samples.sh -i ${{ github.workspace }}/s\ pace
if: ${{ 'Release' == matrix.build-type }} # build_samples enforces Release build
- run: source ./ov/setupvars.sh && python -m pip install --upgrade-strategy eager -r text_generation/causal_lm/cpp/requirements.txt
if: ${{ 'Release' == matrix.build-type }}
- run: source ./ov/setupvars.sh && python -m pip install ./thirdparty/openvino_tokenizers/[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
if: ${{ 'Release' == matrix.build-type }}
- run: source ./ov/setupvars.sh && optimum-cli export openvino --trust-remote-code --weight-format fp16 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
if: ${{ 'Release' == matrix.build-type }}
- run: source ./ov/setupvars.sh && timeout 50s ${{ github.workspace }}/s\ pace/samples_bin/greedy_causal_lm ./TinyLlama-1.1B-Chat-v1.0/ ""
if: ${{ 'Release' == matrix.build-type }}

windows_genai_package:
strategy:
matrix:
build-type: [Release, Debug]
runs-on: windows-latest
defaults:
run:
shell: cmd
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- uses: actions/setup-python@v4
with:
python-version: 3.8
- run: curl --output ov.zip https://storage.openvinotoolkit.org/repositories/openvino/packages/pre-release/2024.2.0rc1/windows/w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64.zip
- run: unzip ov.zip
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && cmake -DCMAKE_BUILD_TYPE=${{ matrix.build-type }} -S ./ -B ./build/
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && cmake --build ./build/ --config ${{ matrix.build-type }} --target package -j
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && cmake --install ./build/ --config ${{ matrix.build-type }} --prefix w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\samples\cpp\build_samples_msvc.bat -i "${{ github.workspace }}/samples_install"
if: ${{ 'Release' == matrix.build-type }} # build_samples enforces Release build
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && python -m pip install --upgrade-strategy eager -r text_generation/causal_lm/cpp/requirements.txt
if: ${{ 'Release' == matrix.build-type }}
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && python -m pip install ./thirdparty/openvino_tokenizers/[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
if: ${{ 'Release' == matrix.build-type }}
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && optimum-cli export openvino --trust-remote-code --weight-format fp16 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
if: ${{ 'Release' == matrix.build-type }}
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && "${{ github.workspace }}/samples_install/samples_bin/greedy_causal_lm" .\TinyLlama-1.1B-Chat-v1.0\ ""
if: ${{ 'Release' == matrix.build-type }}
59 changes: 59 additions & 0 deletions .github/workflows/genai_python_lib.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: genai_python_lib
on: pull_request
jobs:
ubuntu_genai_python_lib:
# A tokenizers' dependency fails to compile on ubuntu-20 n CenOS7 env
runs-on: ubuntu-22.04
env:
# A tokenizers' dependency fails to compile with Ninja in CenOS7 env
CMAKE_GENERATOR: Unix Makefiles
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- uses: actions/setup-python@v4
with:
python-version: 3.8
- run: mkdir ./ov/
- run: curl https://storage.openvinotoolkit.org/repositories/openvino/packages/pre-release/2024.2.0rc1/linux/l_openvino_toolkit_centos7_2024.2.0.dev20240524_x86_64.tgz | tar --directory ./ov/ --strip-components 1 -xz # Install CentOS7 instead of Ubuntu to match PyPI distribution ABI
- run: sudo ./ov/install_dependencies/install_openvino_dependencies.sh
- run: source ./ov/setupvars.sh && cmake -DCMAKE_BUILD_TYPE=Release -S ./ -B ./build/
- run: source ./ov/setupvars.sh && cmake --build ./build/ --config Release -j
# GitHub Actions already provides what is listed in ./requirements-build.txt but the internal
# build system doesn't. Install ./requirements-build.txt to detect possible conflicts.
- run: source ./ov/setupvars.sh && python -m pip install ./thirdparty/openvino_tokenizers/[transformers] -r ./requirements-build.txt --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
- run: source ./ov/setupvars.sh && PYTHONPATH=./build/ python -c "from openvino_genai import LLMPipeline"
- run: source ./ov/setupvars.sh && CMAKE_BUILD_PARALLEL_LEVEL="" python -m pip install .
- run: python -c "from openvino_genai import LLMPipeline"
- name: GenAI Python API tests
run: |
cd ./tests/python_tests/
python -m pip install -r requirements.txt
models=$(python list_test_models.py)
echo "$models" | while read -r model_name model_path; do
optimum-cli export openvino --trust-remote-code --weight-format fp16 --model "$model_name" "$model_path"
done
python -m pytest test_generate_api.py
windows_genai_python_lib:
runs-on: windows-latest
defaults:
run:
shell: cmd
steps:
- uses: actions/checkout@v4
with:
submodules: recursive
- uses: actions/setup-python@v4
with:
python-version: 3.8
- run: curl --output ov.zip https://storage.openvinotoolkit.org/repositories/openvino/packages/pre-release/2024.2.0rc1/windows/w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64.zip
- run: unzip ov.zip
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && cmake -DCMAKE_BUILD_TYPE=Release -S ./ -B ./build/
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && cmake --build ./build/ --config Release -j
# GitHub Actions already provides what is listed in ./requirements-build.txt but the internal
# build system doesn't. Install ./requirements-build.txt to detect possible conflicts.
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && python -m pip install ./thirdparty/openvino_tokenizers/[transformers] -r ./requirements-build.txt --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && python -m pip install ./thirdparty/openvino_tokenizers/[transformers] -r ./requirements-build.txt --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release
- run: call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && python -m pip install ./thirdparty/openvino_tokenizers/[transformers] -v -r ./requirements-build.txt --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release

to see logs, maybe they will help

- run: set "PYTHONPATH=./build/" && call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && python -c "from openvino_genai import LLMPipeline" # cmd evaluates variables in a different way. Setting PYTHONPATH before setupvars.bat instead of doing that after solves that.
- run: set CMAKE_BUILD_PARALLEL_LEVEL=&& call w_openvino_toolkit_windows_2024.2.0.dev20240524_x86_64\setupvars.bat && python -m pip install .
- run: python -c "from openvino_genai import LLMPipeline"
8 changes: 4 additions & 4 deletions .github/workflows/lcm_dreamshaper_cpp.yml
Original file line number Diff line number Diff line change
@@ -40,15 +40,15 @@ jobs:
run: |
conda activate openvino_lcm_cpp
conda update -c conda-forge --all
conda install -c conda-forge openvino=2024.1.0 c-compiler cxx-compiler git make cmake
conda install -c conda-forge -c conda-forge/label/openvino_dev openvino==2024.2.0.dev20240513 c-compiler cxx-compiler git make cmake
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
- name: Install python dependencies
working-directory: ${{ env.working_directory }}
run: |
conda activate openvino_lcm_cpp
python -m pip install -r requirements.txt
python -m pip install ../../../thirdparty/openvino_tokenizers/[transformers]
python -m pip install -r requirements.txt
- name: Download and convert model and tokenizer
working-directory: ${{ env.working_directory }}
@@ -85,15 +85,15 @@ jobs:
run: |
conda activate openvino_lcm_cpp
conda update -c conda-forge --all
conda install -c conda-forge openvino=2024.1.0 c-compiler cxx-compiler git make cmake
conda install -c conda-forge -c conda-forge/label/openvino_dev openvino==2024.2.0.dev20240513 c-compiler cxx-compiler git make cmake
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
- name: Install python dependencies
working-directory: ${{ env.working_directory }}
run: |
conda activate openvino_lcm_cpp
python -m pip install -r requirements.txt
python -m pip install ../../../thirdparty/openvino_tokenizers/[transformers]
python -m pip install -r requirements.txt
- name: Download and convert model and tokenizer
working-directory: ${{ env.working_directory }}
8 changes: 4 additions & 4 deletions .github/workflows/stable_diffusion_1_5_cpp.yml
Original file line number Diff line number Diff line change
@@ -39,15 +39,15 @@ jobs:
- name: Install OpenVINO and other conda dependencies
run: |
conda activate openvino_sd_cpp
conda install -c conda-forge openvino=2024.1.0 c-compiler cxx-compiler git make cmake
conda install -c conda-forge -c conda-forge/label/openvino_dev openvino==2024.2.0.dev20240513 c-compiler cxx-compiler git make cmake
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
- name: Install python dependencies
working-directory: ${{ env.working_directory }}
run: |
conda activate openvino_sd_cpp
python -m pip install -r requirements.txt
python -m pip install ../../../thirdparty/openvino_tokenizers/[transformers]
python -m pip install -r requirements.txt
- name: Download and convert model and tokenizer
working-directory: ${{ env.working_directory }}
@@ -83,14 +83,14 @@ jobs:
- name: Install OpenVINO and other conda dependencies
run: |
conda activate openvino_sd_cpp
conda install -c conda-forge openvino=2024.1.0 c-compiler cxx-compiler git make cmake
conda install -c conda-forge -c conda-forge/label/openvino_dev openvino==2024.2.0.dev20240513 c-compiler cxx-compiler git make cmake
- name: Install python dependencies
working-directory: ${{ env.working_directory }}
run: |
conda activate openvino_sd_cpp
python -m pip install -r requirements.txt
python -m pip install ../../../thirdparty/openvino_tokenizers/[transformers]
python -m pip install -r requirements.txt
- name: Download and convert model and tokenizer
working-directory: ${{ env.working_directory }}
52 changes: 52 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Copyright (C) 2018-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#

cmake_minimum_required(VERSION 3.15)

# Multi config generators such as Visual Studio ignore CMAKE_BUILD_TYPE. Multi config generators are configured with
# CMAKE_CONFIGURATION_TYPES, but limiting options in it completely removes such build options
get_property(GENERATOR_IS_MULTI_CONFIG_VAR GLOBAL PROPERTY GENERATOR_IS_MULTI_CONFIG)
if(NOT GENERATOR_IS_MULTI_CONFIG_VAR AND NOT DEFINED CMAKE_BUILD_TYPE)
message(STATUS "CMAKE_BUILD_TYPE is not defined, 'Release' will be used")
# Setting CMAKE_BUILD_TYPE as CACHE must go before project(). Otherwise project() sets its value and set() doesn't take an effect
set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build, options are: None Debug Release RelWithDebInfo MinSizeRel ...")
endif()

project(OpenVINOGenAI VERSION 2024.2.0.0)

add_subdirectory(./thirdparty/openvino_tokenizers/ "${CMAKE_CURRENT_BINARY_DIR}/openvino_tokenizers/")
# Put binaries to a single dir to mimic package structure.
set_target_properties(openvino_tokenizers PROPERTIES
# Generator expressions to disable appending a per-configuration subdirectory (Release, Debug).
LIBRARY_OUTPUT_DIRECTORY "$<1:${CMAKE_BINARY_DIR}/openvino_genai/>"
RUNTIME_OUTPUT_DIRECTORY "$<1:${CMAKE_BINARY_DIR}/openvino_genai/>"
)
if(TARGET core_tokenizers)
set_target_properties(core_tokenizers PROPERTIES
LIBRARY_OUTPUT_DIRECTORY "$<1:${CMAKE_BINARY_DIR}/openvino_genai/>"
RUNTIME_OUTPUT_DIRECTORY "$<1:${CMAKE_BINARY_DIR}/openvino_genai/>"
)
else()
# Prebuilt dependencies
if(WIN32)
set(extra_libs "${CMAKE_BINARY_DIR}/_deps/fast_tokenizer-src/lib/core_tokenizers.dll"
"${CMAKE_BINARY_DIR}/_deps/fast_tokenizer-src/third_party/lib/icudt70.dll"
"${CMAKE_BINARY_DIR}/_deps/fast_tokenizer-src/third_party/lib/icuuc70.dll")
elseif(LINUX)
set(extra_libs "${CMAKE_BINARY_DIR}/_deps/fast_tokenizer-src/lib/libcore_tokenizers.so")
elseif(APPLE)
set(extra_libs "${CMAKE_BINARY_DIR}/_deps/fast_tokenizer-srclib/libcore_tokenizers.dylib")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set(extra_libs "${CMAKE_BINARY_DIR}/_deps/fast_tokenizer-srclib/libcore_tokenizers.dylib")
set(extra_libs "${CMAKE_BINARY_DIR}/_deps/fast_tokenizer-src/lib/libcore_tokenizers.dylib")

endif()
add_custom_command(OUTPUT "${extra_libs}"
COMMAND "${CMAKE_COMMAND}" -E copy "${extra_libs}" "${CMAKE_BINARY_DIR}/openvino_genai/"
DEPENDS openvino_tokenizers)
endif()
add_subdirectory(src)
add_subdirectory(text_generation/causal_lm/cpp)

install(DIRECTORY text_generation/causal_lm/cpp/ DESTINATION samples/cpp/causal_lm COMPONENT cpp_samples_genai)
install(FILES LICENSE DESTINATION licensing COMPONENT licensing_genai RENAME LICENSE-GENAI)
install(FILES third-party-programs.txt DESTINATION licensing COMPONENT licensing_genai RENAME third-party-programs-genai.txt)
set(CPACK_GENERATOR "ZIP")
include(CPack)
2 changes: 1 addition & 1 deletion image_generation/lcm_dreamshaper_v7/cpp/README.md
Original file line number Diff line number Diff line change
@@ -18,7 +18,7 @@ Prepare a python environment and install dependencies:
conda create -n openvino_lcm_cpp python==3.10
conda activate openvino_lcm_cpp
conda update -c conda-forge --all
conda install -c conda-forge openvino=2024.1.0 c-compiler cxx-compiler git make cmake
conda install -c conda-forge openvino=2024.2.0 c-compiler cxx-compiler git make cmake
# Ensure that Conda standard libraries are used
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
```
2 changes: 1 addition & 1 deletion image_generation/stable_diffusion_1_5/cpp/README.md
Original file line number Diff line number Diff line change
@@ -18,7 +18,7 @@ Prepare a python environment and install dependencies:
```shell
conda create -n openvino_sd_cpp python==3.10
conda activate openvino_sd_cpp
conda install -c conda-forge openvino=2024.1.0 c-compiler cxx-compiler git make cmake
conda install -c conda-forge openvino=2024.2.0 c-compiler cxx-compiler git make cmake
# Ensure that Conda standard libraries are used
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
```
41 changes: 41 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
[project]
name = "openvino_genai"
version = "2024.2.0.0"
description = "Python bindings for https://github.com/openvinotoolkit/openvino.genai"
requires-python = ">=3.8"
readme = {file = "text_generation/causal_lm/cpp/README.md", content-type="text/markdown"}
license = {text = "OSI Approved :: Apache Software License"}
authors = [
{ name = "OpenVINO Developers", email = "[email protected]" },
]
classifiers = [
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
]
dependencies = [
"openvino_tokenizers~=2024.2.0.0"
]

[tool.scikit-build]
cmake.source-dir = "./"
cmake.build-type = "Release"
cmake.targets = ["py_generate_pipeline", "genai"]
install.components = ["wheel_genai"]
sdist.cmake = true
wheel.packages = ["src/python/openvino_genai"]
wheel.install-dir = "openvino_genai"
wheel.build-tag = "000"
wheel.license-files = ["LICENSE", "SECURITY.md", "third-party-programs.txt"]

[[tool.scikit-build.generate]]
path = "openvino_genai/__version__.py"
template = '''
__version__ = "${version}"
'''

[build-system]
requires = ["scikit-build-core~=0.8.0", "cmake~=3.23"] # See https://github.com/openvinotoolkit/openvino_tokenizers/pull/123
build-backend = "scikit_build_core.build"
1 change: 1 addition & 0 deletions requirements-build.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
build~=1.2.1
13 changes: 13 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (C) 2018-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#

# Find OpenVINODeveloperPackage first to compile with SDL flags
find_package(OpenVINODeveloperPackage QUIET
PATHS "${OpenVINO_DIR}")
if(NOT OpenVINODeveloperPackage_FOUND)
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
endif()

add_subdirectory(cpp)
add_subdirectory(python)
163 changes: 163 additions & 0 deletions src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# OpenVINO Generate API

## Usage

First of all you need to convert your model with optimum-cli
``` sh
optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format fp16 --trust-remote-code "TinyLlama-1.1B-Chat-v1.0"
pip install openvino-genai
```

`LLMPipeline` is the main object used for decoding. You can construct it straight away from the folder with the converted model. It will automatically load the main model, tokenizer, detokenizer and default generation configuration.

### Python

A minimalist example:
```python
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")
print(pipe.generate("The Sun is yellow bacause"))
```

Calling generate with custom generation config parameters, e.g. config for grouped beam search
```python
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline(model_path, "CPU")

result = pipe.generate("The Sun is yellow bacause", max_new_tokens=30, num_groups=3, group_size=5, diversity_penalty=1.5)
print(result)
```

output:
```
'it is made up of carbon atoms. The carbon atoms are arranged in a linear pattern, which gives the yellow color. The arrangement of carbon atoms in'
```

A simples chat in python:
```python
import openvino_genai as ov_genai
pipe = ov_ov_genai.LLMPipeline(model_path)

config = {'num_groups': 3, 'group_size': 5, 'diversity_penalty': 1.5}
pipe.set_generation_cofnig(config)

pipe.start_chat()
while True:
    print('question:')
    prompt = input()
if prompt == 'Stop!':
        break
    print(pipe(prompt))
pipe.finish_chat()
```

Test to compare with Huggingface outputs

### C++

Minimalistc example
```cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow bacause");
}
```
Using Group Beam Search Decoding
```cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
ov::genai::GenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 256;
config.num_groups = 3;
config.group_size = 5;
config.diversity_penalty = 1.0f;
std::cout << pipe.generate("The Sun is yellow bacause", config);
}
```

A simple chat in C++ using grouped beam search decoding
``` cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
std::string prompt;

std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");

ov::genai::GenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 256;
config.num_groups = 3;
config.group_size = 5;
config.diversity_penalty = 1.0f;

pipe.start_chat();
for (;;;) {
std::cout << "question:\n";
std::getline(std::cin, prompt);
if (prompt == "Stop!")
break;

std::cout << "answer:\n";
auto answer = pipe(prompt, config);
std::cout << answer << std::endl;
}
pipe.finish_chat();
}
```
Streaming example with lambda function
``` cpp
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
auto streamer = [](std::string word) { std::cout << word << std::flush; };
std::cout << pipe.generate("The Sun is yellow bacause", streamer);
}
```

Streaming with a custom class
``` cpp
#include "openvino/genai/streamer_base.hpp"
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>

class CustomStreamer: public ov::genai::StreamerBase {
public:
void put(int64_t token) {
/* custom decoding/tokens processing code
tokens_cache.push_back(token);
std::string text = m_tokenizer.decode(tokens_cache);
...
*/
};

void end() {
/* custom finalization */
};
};

int main(int argc, char* argv[]) {
CustomStreamer custom_streamer;

std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");
std::cout << pipe.generate("The Sun is yellow bacause", custom_streamer);
}
```
101 changes: 101 additions & 0 deletions src/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Copyright (C) 2018-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#

# Dependencies

include(FetchContent)

FetchContent_Declare(nlohmann_json
URL https://github.com/nlohmann/json/archive/refs/tags/v3.11.3.tar.gz
URL_HASH SHA256=0d8ef5af7f9794e3263480193c491549b2ba6cc74bb018906202ada498a79406)
FetchContent_MakeAvailable(nlohmann_json)

function(ov_genai_build_jinja2cpp)
FetchContent_Declare(jinja2cpp
URL https://github.com/jinja2cpp/Jinja2Cpp/archive/9ae7e1fc45d707e1686dd425a154d30963801944.tar.gz
URL_HASH SHA256=aa41ae425225623ba91be5de3ef1e0d942e682d519311e6235b04b4e7d880e01)

FetchContent_GetProperties(jinja2cpp)
if(NOT jinja2cpp_POPULATED)
FetchContent_Populate(jinja2cpp)

set(BUILD_SHARED_LIBS OFF)
set(JINJA2CPP_INSTALL OFF CACHE BOOL "")
set(JINJA2CPP_CXX_STANDARD 17 CACHE STRING "")
set(JINJA2CPP_BUILD_SHARED OFF CACHE BOOL "")
set(JINJA2CPP_USE_REGEX "std" CACHE STRING "")
set(JINJA2CPP_WITH_JSON_BINDINGS "none" CACHE STRING "")
set(JINJA2CPP_STRICT_WARNINGS OFF CACHE BOOL "")
set(JINJA2CPP_PIC ON CACHE BOOL "")

add_subdirectory("${jinja2cpp_SOURCE_DIR}" "${jinja2cpp_BINARY_DIR}" EXCLUDE_FROM_ALL)
endif()
endfunction()

ov_genai_build_jinja2cpp()

# Library

file(GLOB SOURCE_FILES "${CMAKE_CURRENT_SOURCE_DIR}/src/*.cpp")

set(TARGET_NAME genai)
add_library(${TARGET_NAME} SHARED ${SOURCE_FILES})
add_library(openvino::${TARGET_NAME} ALIAS ${TARGET_NAME})

target_include_directories(${TARGET_NAME}
PUBLIC "$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>" "$<INSTALL_INTERFACE:runtime/include>")

target_link_libraries(${TARGET_NAME} PUBLIC openvino::runtime PRIVATE nlohmann_json::nlohmann_json jinja2cpp)

target_compile_features(${TARGET_NAME} PUBLIC cxx_std_17)

# Extract two last digits from CMAKE_PROJECT_VERSION_MAJOR because SOVERSION can only contain up to 4 symbols.
string(REGEX MATCH [=[[0-9][0-9]$]=] MAJOR_SUFFIX ${CMAKE_PROJECT_VERSION_MAJOR})
set_target_properties(${TARGET_NAME} PROPERTIES
VERSION ${CMAKE_PROJECT_VERSION}
SOVERSION ${MAJOR_SUFFIX}${CMAKE_PROJECT_VERSION_MINOR}${CMAKE_PROJECT_VERSION_PATCH}
LIBRARY_OUTPUT_DIRECTORY "$<1:${CMAKE_BINARY_DIR}/openvino_genai/>"
RUNTIME_OUTPUT_DIRECTORY "$<1:${CMAKE_BINARY_DIR}/openvino_genai/>"
)

find_package(Python3 REQUIRED COMPONENTS Interpreter Development)
install(TARGETS ${TARGET_NAME}
LIBRARY DESTINATION python/openvino_genai/ COMPONENT pygenai_${Python_VERSION_MAJOR}_${Python_VERSION_MINOR}
RUNTIME DESTINATION python/openvino_genai/ COMPONENT pygenai_${Python_VERSION_MAJOR}_${Python_VERSION_MINOR})

# - Windows: `<openvino_dir>\runtime\bin\intel64\Release\`
# - MacOS_x86: `<openvino_dir>/runtime/lib/intel64/Release`
# - MacOS_arm64: `<openvino_dir>/runtime/lib/arm64/Release/`
# - Linux_x86: `<openvino_dir>/runtime/lib/intel64/`
# - Linux_arm64: `<openvino_dir>/runtime/lib/aarch64/`
string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" ARCH_DIR)
if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*")
set(ARCH_DIR intel64)
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "^(arm64.*|aarch64.*|AARCH64.*|ARM64.*)")
if(APPLE)
set(ARCH_DIR "arm64")
else()
set(ARCH_DIR "aarch64")
endif()
elseif(ARCH_DIR STREQUAL "x86_64" OR ARCH_DIR STREQUAL "amd64" # Windows detects Intel's 64-bit CPU as AMD64
OR CMAKE_OSX_ARCHITECTURES STREQUAL "x86_64")
set(ARCH_DIR intel64)
endif()
if(MSVC OR APPLE)
set(ARCH_DIR ${ARCH_DIR}/${CMAKE_BUILD_TYPE})
endif()
install(TARGETS ${TARGET_NAME} EXPORT OpenVINOGenAITargets
LIBRARY DESTINATION runtime/lib/${ARCH_DIR} COMPONENT core_genai
NAMELINK_COMPONENT core_genai_dev
ARCHIVE DESTINATION runtime/lib/${ARCH_DIR} COMPONENT core_genai_dev
RUNTIME DESTINATION runtime/bin/${ARCH_DIR} COMPONENT core_genai
INCLUDES DESTINATION runtime/include)
install(DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/include/ DESTINATION runtime/include COMPONENT core_genai_dev)
install(EXPORT OpenVINOGenAITargets FILE OpenVINOGenAITargets.cmake NAMESPACE openvino:: DESTINATION runtime/cmake)
include(CMakePackageConfigHelpers)
configure_package_config_file(OpenVINOGenAIConfig.cmake.in "${CMAKE_BINARY_DIR}/OpenVINOGenAIConfig.cmake" INSTALL_DESTINATION runtime/cmake)
install(FILES "${CMAKE_BINARY_DIR}/OpenVINOGenAIConfig.cmake" "${CMAKE_BINARY_DIR}/OpenVINOGenAIConfig.cmake" DESTINATION runtime/cmake COMPONENT core_genai_dev)
include(CMakePackageConfigHelpers)
write_basic_package_version_file("${CMAKE_BINARY_DIR}/OpenVINOGenAIConfigVersion.cmake" VERSION ${CMAKE_PROJECT_VERSION} COMPATIBILITY AnyNewerVersion)
export(EXPORT OpenVINOGenAITargets FILE "${CMAKE_BINARY_DIR}/OpenVINOGenAITargets.cmake" NAMESPACE openvino::)
10 changes: 10 additions & 0 deletions src/cpp/OpenVINOGenAIConfig.cmake.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
@PACKAGE_INIT@

include(CMakeFindDependencyMacro)
find_dependency(OpenVINO COMPONENTS Runtime)

if(NOT TARGET genai)
include("${CMAKE_CURRENT_LIST_DIR}/OpenVINOGenAITargets.cmake")
endif()

check_required_components(openvino_genai)
107 changes: 107 additions & 0 deletions src/cpp/include/openvino/genai/generation_config.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#pragma once

#include <limits>
#include <variant>
#include <string>

#include "openvino/runtime/compiled_model.hpp"
#include "openvino/runtime/infer_request.hpp"
#include "openvino/genai/tokenizer.hpp"

namespace ov {
namespace genai {

/**
* @brief controls the stopping condition for grouped beam search. The following values are possible:
* "early" stops as soon as there are `num_beams` complete candidates.
"heuristic" stops when is it unlikely to find better candidates.
"never" stops when there cannot be better candidates.
*/
enum class StopCriteria { early, heuristic, never };

/**
* @brief Structure to keep generation config parameters. For a selected method of decoding, only parameters from that group
* and generic parameters are used. For example, if do_sample is set to true, then only generic parameters and random sampling parameters will
* be used while greedy and beam search parameters will not affect decoding at all.
*
* Generic parameters:
* @param max_length the maximum length the generated tokens can have. Corresponds to the length of the input prompt +
* `max_new_tokens`. Its effect is overridden by `max_new_tokens`, if also set.
* @param max_new_tokens the maximum numbers of tokens to generate, excluding the number of tokens in the prompt. max_new_tokens has priority over max_length.
* @param ignore_eos if set to true, then generation will not stop even if <eos> token is met.
* @param pad_token_id token_id of <pad> (padding)
* @param bos_token_id token_id of <bos> (beggining of sentence)
* @param eos_token_id token_id of <eos> (end of sentence)
* @param bos_token <bos> token string representation
* @param eos_token <eos> token string representation
*
* Beam search specific parameters:
* @param num_beams number of beams for beam search. 1 disables beam search.
* @param num_beam_groups number of groups to divide `num_beams` into in order to ensure diversity among different groups of beams.
* @param diversity_penalty this value is subtracted from a beam's score if it generates the same token as any beam from other group at a
* particular time. See https://arxiv.org/pdf/1909.05858.
* @param length_penalty exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to
* the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log
* likelihood of the sequence (i.e. negative), `length_penalty` > 0.0 promotes longer sequences, while
* `length_penalty` < 0.0 encourages shorter sequences.
* @param num_return_sequences the number of sequences to return for grouped beam search decoding.
* @param no_repeat_ngram_size if set to int > 0, all ngrams of that size can only occur once.
* @param stop_criteria controls the stopping condition for grouped beam search. It accepts the following values:
* "early", where the generation stops as soon as there are `num_beams` complete candidates; "heuristic", where an
* heuristic is applied and the generation stops when is it very unlikely to find better candidates;
* "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).
*
* Random sampling parameters:
* @param temperature the value used to modulate token probabilities for random sampling.
* @param top_p - if set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
* @param top_k the number of highest probability vocabulary tokens to keep for top-k-filtering.
* @param do_sample whether or not to use multinomial random sampling that add up to `top_p` or higher are kept.
* @param repetition_penalty the parameter for repetition penalty. 1.0 means no penalty.
*/
class OPENVINO_GENAI_EXPORTS GenerationConfig {
public:
GenerationConfig() = default;
explicit GenerationConfig(std::string json_path);

// Generic
size_t max_new_tokens = SIZE_MAX;
size_t max_length = SIZE_MAX;
bool ignore_eos = false;

// Beam search specific
size_t num_beam_groups = 1;
size_t num_beams = 1;
float diversity_penalty = 1.0f;
float length_penalty = 1.0f;
size_t num_return_sequences = 1;
size_t no_repeat_ngram_size = std::numeric_limits<size_t>::max();
StopCriteria stop_criteria = StopCriteria::heuristic;

// Multinomial
float temperature = 0.0f;
float top_p = 1.0f;
int top_k = -1;
bool do_sample = false;
float repetition_penalty = 1.0f;

// special tokens
int64_t pad_token_id = 0;
int64_t bos_token_id = 1;
int64_t eos_token_id = 2;

// used for chat scenario
std::string bos_token = "<s>";
std::string eos_token = "</s>";

size_t get_max_new_tokens(size_t prompt_length = 0) const;
bool is_greedy_decoding() const;
bool is_beam_search() const;
bool is_multimomial() const;
static GenerationConfig anymap_to_generation_config(const ov::AnyMap& config_map = {});
};

} // namespace genai
} // namespace ov
217 changes: 217 additions & 0 deletions src/cpp/include/openvino/genai/llm_pipeline.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#pragma once

#include <optional>
#include <variant>

#include "openvino/core/any.hpp"
#include "openvino/genai/generation_config.hpp"
#include "openvino/genai/tokenizer.hpp"
#include "openvino/genai/streamer_base.hpp"

namespace ov {
namespace genai {

using StreamerVariant = std::variant<std::function<void (std::string)>, std::shared_ptr<StreamerBase>>;
using OptionalGenerationConfig = std::optional<GenerationConfig>;
using OptionalStreamerVariant = std::optional<StreamerVariant>;

/**
* @brief Structure to store resulting batched tokens and scores for each batch sequence
*
* @param tokens sequence of resulting tokens
* @param scores scores for each sequence
*/
class EncodedResults {
public:
std::vector<std::vector<int64_t>> tokens;
std::vector<float> scores;
};

/**
* @brief Structure to store resulting batched text outputs and scores for each batch
*
* @param texts vector of resulting sequences
* @param scores scores for each sequence
*/
class DecodedResults {
public:
std::vector<std::string> texts;
std::vector<float> scores;

// @brief Convert DecodedResults to a vector of strings.
// @return A std::vector<std::string> containing the texts from the DecodedResults object.
operator std::vector<std::string>() const {
return texts;
}

// @brief Overloads operator<< to enhance output the contents of DecodedResults.
// @return A reference to the output stream with the concatenated texts.
friend std::ostream& operator<<(std::ostream& os, const DecodedResults& dr) {
for (size_t i = 0; i < dr.texts.size(); ++i) {
os << dr.texts[i];
if (i != dr.texts.size() - 1) {
os << std::endl;
}
}
return os;
}
};

/**
* @brief This class is used for generation with LLMs.
*/
class OPENVINO_GENAI_EXPORTS LLMPipeline {
public:
/**
* @brief Constructs an LLMPipeline from xml/bin files, tokenizers and configuration in the same dir.
*
* @param model_path Path to the dir model xml/bin files, tokenizers and generation_configs.json
* @param device optional device
* @param plugin_config optional plugin_config
* @param ov_tokenizers_path optional path to an extension to add. Empty adds openvino_tokenizers from openvini_genai library folder.
*/
LLMPipeline(const std::string& path, const std::string& device="CPU",
const ov::AnyMap& plugin_config={},
const std::string& ov_tokenizers_path="");

/**
* @brief Constructs a LLMPipeline when ov::Tokenizer is initialized manually using file from the different dirs.
*
* @param model_path Path to the dir with model, tokenizer .xml/.bin files, and generation_configs.json
* @param tokenizer manually initialized ov::Tokenizer
* @param device optional device
* @param plugin_config optional plugin_config
*/
LLMPipeline(
const std::string& model_path,
const ov::genai::Tokenizer& tokenizer,
const std::string& device="CPU",
const ov::AnyMap& plugin_config = {}
);

~LLMPipeline();

/**
* @brief High level generate for the input with a single prompt which encodes inputs and returns decoded output
*
* @param text input prompt
* @param generation_config optional GenerationConfig
* @param streamer optional streamer
* @return std::string decoded resulting text
*/
std::string generate(std::string text, OptionalGenerationConfig generation_config=std::nullopt, OptionalStreamerVariant streamer=std::nullopt);

template <typename... Properties>
util::EnableIfAllStringAny<std::string, Properties...> generate(
std::string text,
Properties&&... properties) {
return generate(text, AnyMap{std::forward<Properties>(properties)...});
}
std::string generate(std::string text, const ov::AnyMap& config);

template <typename... Properties>
util::EnableIfAllStringAny<EncodedResults, Properties...> generate(
ov::Tensor input_ids,
Properties&&... properties) {
return generate(input_ids, AnyMap{std::forward<Properties>(properties)...});
}
EncodedResults generate(ov::Tensor input_ids, const ov::AnyMap& config);

/**
* @brief High level generate for batched prompts which encodes inputs and returns decoded outputs.
* Streamer cannot be used for multibatch inputs.
*
* @param text input prompt
* @param generation_config optional GenerationConfig
* @return DecodedResults a structure with resulting texts & scores
*/
DecodedResults generate(const std::vector<std::string>& texts, OptionalGenerationConfig generation_config);

/**
* @brief Low level generate to be called with already encoded input_ids tokens.
* Streamer cannot be used for multibatch inputs.
*
* @param input_ids encoded input prompt tokens
* @param attention_mask optional attention_mask
* @param generation_config optional GenerationConfig
* @param streamer optional streamer
* @return EncodedResults a structure with resulting tokens and scores
* @throws Exception if the stremaer is set for inputs_ids with multiple batches
*/
EncodedResults generate(ov::Tensor input_ids,
std::optional<ov::Tensor> attention_mask,
OptionalGenerationConfig generation_config=std::nullopt,
OptionalStreamerVariant streamer=std::nullopt);

template <typename InputsType, typename... Properties>
util::EnableIfAllStringAny<std::string, Properties...> operator()(
InputsType text,
Properties&&... properties) {
return generate(text, AnyMap{std::forward<Properties>(properties)...});
}

DecodedResults operator()(const std::vector<std::string>& text, OptionalGenerationConfig generation_config=std::nullopt) {
return generate(text, generation_config);
}

std::string operator()(
std::string text,
OptionalGenerationConfig generation_config=std::nullopt,
OptionalStreamerVariant streamer=std::nullopt
) {
return generate(text, generation_config, streamer);
}

ov::genai::Tokenizer get_tokenizer();
GenerationConfig get_generation_config() const;
void set_generation_config(const GenerationConfig& generation_config);

void start_chat();
void finish_chat();
void reset_state();
std::string apply_chat_template(std::string prompt, std::string role = "user") const;
private:
class LLMPipelineImpl;
std::unique_ptr<LLMPipelineImpl> m_pimpl;
};

/*
* utils that allow to use generate and operator() in the following way:
* pipe.generate(input_ids, ov::max_new_tokens(200), ov::temperature(1.0f),...)
* pipe(text, ov::max_new_tokens(200), ov::temperature(1.0f),...)
*/
static constexpr ov::Property<size_t> max_new_tokens{"max_new_tokens"};
static constexpr ov::Property<size_t> max_length{"max_length"};
static constexpr ov::Property<bool> ignore_eos{"ignore_eos"};

static constexpr ov::Property<size_t> num_beam_groups{"num_beam_groups"};
static constexpr ov::Property<size_t> num_beams{"num_beams"};
static constexpr ov::Property<float> diversity_penalty{"diversity_penalty"};
static constexpr ov::Property<float> length_penalty{"length_penalty"};
static constexpr ov::Property<size_t> num_return_sequences{"num_return_sequences"};
static constexpr ov::Property<size_t> no_repeat_ngram_size{"no_repeat_ngram_size"};
static constexpr ov::Property<StopCriteria> stop_criteria{"stop_criteria"};

static constexpr ov::Property<float> temperature{"temperature"};
static constexpr ov::Property<float> top_p{"top_p"};
static constexpr ov::Property<int> top_k{"top_k"};
static constexpr ov::Property<bool> do_sample{"do_sample"};
static constexpr ov::Property<float> repetition_penalty{"repetition_penalty"};


static constexpr ov::Property<int64_t> pad_token_id{"pad_token_id"};
static constexpr ov::Property<int64_t> bos_token_id{"bos_token_id"};
static constexpr ov::Property<int64_t> eos_token_id{"eos_token_id"};

static constexpr ov::Property<std::string> bos_token{"bos_token"};
static constexpr ov::Property<std::string> eos_token{"eos_token"};

// only lambda streamer can be set via ov::streamer(),... syntaxic sugar,
// because std::variant<StremaerBase, std::function<>> can not be stored in AnyMap
static constexpr ov::Property<std::function<void (std::string)>> streamer{"streamer"};

} // namespace genai
} // namespace ov
30 changes: 30 additions & 0 deletions src/cpp/include/openvino/genai/streamer_base.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#pragma once

#include "openvino/genai/tokenizer.hpp"

namespace ov {
namespace genai {

/**
* @brief base class for streamers. In order to use inherit from from this class and inplement put, and methods
*
* @param m_tokenizer tokenizer
*/
class StreamerBase {
public:
Tokenizer m_tokenizer;
explicit StreamerBase(Tokenizer tokenizer): m_tokenizer(tokenizer) {}
StreamerBase() = default;

/// @brief put is called every time new token is decoded
virtual void put(int64_t token) = 0;

/// @brief end is called at the end of generation. It can be used to flush cache if your own streamer has one
virtual void end() = 0;
};

} // namespace genai
} // namespace ov
83 changes: 83 additions & 0 deletions src/cpp/include/openvino/genai/tokenizer.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#pragma once

#include <string>
#include <vector>
#include <initializer_list>
#include <openvino/runtime/tensor.hpp>
#include "openvino/genai/visibility.hpp"

namespace ov {
namespace genai {

/**
* @brief class is used to encode prompts and decode resulting tokens
*/
class OPENVINO_GENAI_EXPORTS Tokenizer {
public:
/**
* @brief ov::Tokenizer constructor.
* @param tokenizer_path openvino_tokenizer.xml and openvino_detokenizer.xml should be located in the tokenizer_path
* @param device device. Currently only 'CPU' is supported
*/
Tokenizer(const std::string& tokenizers_path, const std::string& device="CPU", const std::string& ov_tokenizers_path="");

/**
* @brief encode a single prompt
* @return pair of [input_ids, attention_mask]
*/
std::pair<ov::Tensor, ov::Tensor> encode(const std::string prompt);

/**
* @brief encode batch of prompts. Left padding will be applied by default
* @param prompts vector storing batch of prompts
* @return pair of [input_ids, attention_mask]
*/
std::pair<ov::Tensor, ov::Tensor> encode(std::vector<std::string>& prompts);
std::pair<ov::Tensor, ov::Tensor> encode(std::vector<std::string>&& prompts);
std::pair<ov::Tensor, ov::Tensor> encode(std::initializer_list<std::string>& prompts);

/**
* @brief decode sequence of tokens
* @param tokens vector storing tokens
* @return sequence string
*/
std::string decode(std::vector<int64_t> tokens);

/**
* @brief decode tokens.
* @param tokens ov::Tensor with tokens with shape [batch_size, seq_len]
* @return vector of std::string, with size = batch_size
*/
std::vector<std::string> decode(ov::Tensor tokens);

/**
* @brief batched decoding of tokens.
* @param tokens vector of vectors with tokens, tokens.size() is equal to batch_size
* @return vector of std::string, with size equal to batch_size
*/
std::vector<std::string> decode(std::vector<std::vector<int64_t>> tokens);

// information about <bos>, <eos> tokens should be public,
// they are used at least in StreamerBase descendants
int64_t get_bos_token_id() const;
int64_t get_eos_token_id() const;
int64_t get_pad_token_id() const;

// Also need write access to set these tokens when they are not successfully read from xml rt_info.
// In the latter case values can be read from config.json in LLMPipeline
void set_bos_token_id(int64_t);
void set_eos_token_id(int64_t);
void set_pad_token_id(int64_t);

Tokenizer() = default;
~Tokenizer();
private:
class TokenizerImpl;
std::shared_ptr<TokenizerImpl> m_pimpl;
};

} // namespace genai
} // namespace ov
10 changes: 10 additions & 0 deletions src/cpp/include/openvino/genai/visibility.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "openvino/core/visibility.hpp"

#ifdef genai_EXPORTS
# define OPENVINO_GENAI_EXPORTS OPENVINO_CORE_EXPORTS
#else
# define OPENVINO_GENAI_EXPORTS OPENVINO_CORE_IMPORTS
#endif // genai_EXPORTS
108 changes: 108 additions & 0 deletions src/cpp/src/generation_config.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <fstream>
#include <limits>

#include <nlohmann/json.hpp>
#include <openvino/runtime/core.hpp>
#include "openvino/genai/generation_config.hpp"
#include "utils.hpp"


namespace ov {
namespace genai {

GenerationConfig::GenerationConfig(std::string json_path) {
using ov::genai::utils::read_json_param;

std::ifstream f(json_path);
OPENVINO_ASSERT(f.is_open(), "Failed to open '" + json_path + "' with generation config");

nlohmann::json data = nlohmann::json::parse(f);

read_json_param(data, "max_new_tokens", max_new_tokens);
read_json_param(data, "max_length", max_length);
// note that ignore_eos is not present in HF GenerationConfig
read_json_param(data, "num_beam_groups", num_beam_groups);
read_json_param(data, "num_beams", num_beams);
read_json_param(data, "diversity_penalty", diversity_penalty);
read_json_param(data, "length_penalty", length_penalty);
read_json_param(data, "num_return_sequences", num_return_sequences);
read_json_param(data, "no_repeat_ngram_size", no_repeat_ngram_size);
read_json_param(data, "temperature", temperature);
read_json_param(data, "top_p", top_p);
read_json_param(data, "top_k", top_k);
read_json_param(data, "do_sample", do_sample);
read_json_param(data, "repetition_penalty", repetition_penalty);
read_json_param(data, "pad_token_id", pad_token_id);
read_json_param(data, "bos_token_id", bos_token_id);
read_json_param(data, "eos_token_id", eos_token_id);
read_json_param(data, "bos_token", bos_token);
read_json_param(data, "eos_token", eos_token);

if (data.contains("early_stopping")) {
auto field_type = data["early_stopping"].type();
if (field_type == nlohmann::json::value_t::string && data["early_stopping"] == "never") {
stop_criteria = StopCriteria::never;
} else if (field_type == nlohmann::json::value_t::boolean && data["early_stopping"] == true) {
stop_criteria = StopCriteria::early;
} else if (field_type == nlohmann::json::value_t::boolean && data["early_stopping"] == false) {
stop_criteria = StopCriteria::heuristic;
}
}


}

GenerationConfig GenerationConfig::anymap_to_generation_config(const ov::AnyMap& config_map) {
using ov::genai::utils::read_anymap_param;

GenerationConfig config;
read_anymap_param(config_map, "max_new_tokens", config.max_new_tokens);
read_anymap_param(config_map, "max_length", config.max_length);
read_anymap_param(config_map, "ignore_eos", config.ignore_eos);
read_anymap_param(config_map, "num_beam_groups", config.num_beam_groups);
read_anymap_param(config_map, "num_beams", config.num_beams);
read_anymap_param(config_map, "diversity_penalty", config.diversity_penalty);
read_anymap_param(config_map, "length_penalty", config.length_penalty);
read_anymap_param(config_map, "num_return_sequences", config.num_return_sequences);
read_anymap_param(config_map, "no_repeat_ngram_size", config.no_repeat_ngram_size);
read_anymap_param(config_map, "stop_criteria", config.stop_criteria);
read_anymap_param(config_map, "temperature", config.temperature);
read_anymap_param(config_map, "top_p", config.top_p);
read_anymap_param(config_map, "top_k", config.top_k);
read_anymap_param(config_map, "do_sample", config.do_sample);
read_anymap_param(config_map, "repetition_penalty", config.repetition_penalty);
read_anymap_param(config_map, "pad_token_id", config.pad_token_id);
read_anymap_param(config_map, "bos_token_id", config.bos_token_id);
read_anymap_param(config_map, "eos_token_id", config.eos_token_id);
read_anymap_param(config_map, "bos_token", config.bos_token);
read_anymap_param(config_map, "eos_token", config.eos_token);

return config;
}

size_t GenerationConfig::get_max_new_tokens(size_t prompt_length) const {
// max_new_tokens has priority over max_length, only if max_new_tokens was not specified use max_length
if (max_new_tokens != SIZE_MAX) {
return max_new_tokens;
} else {
return max_length - prompt_length;
}
}

bool GenerationConfig::is_greedy_decoding() const {
return !do_sample && !is_beam_search();
}

bool GenerationConfig::is_beam_search() const {
return num_beams > 1;
}

bool GenerationConfig::is_multimomial() const {
return do_sample;
}

} // namespace genai
} // namespace ov
130 changes: 130 additions & 0 deletions src/cpp/src/greedy_decoding.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "openvino/genai/llm_pipeline.hpp"
#include "utils.hpp"

namespace ov {
namespace genai {

EncodedResults greedy_decoding(
ov::InferRequest& m_model_runner,
ov::Tensor input_ids,
ov::Tensor attention_mask,
const ov::genai::GenerationConfig generation_config,
const std::shared_ptr<StreamerBase> streamer,
const bool is_chat_conversation
) {

ov::Shape prompts_shape = input_ids.get_shape();
size_t batch_size = prompts_shape[0];
size_t prompt_len = prompts_shape[1];

auto kv_cache_len = m_model_runner.query_state()[0].get_state().get_shape()[2];

// todo: make this work even if position_ids are not specified
auto position_ids = ov::Tensor{ov::element::i64, input_ids.get_shape()};
utils::initialize_position_ids(position_ids, attention_mask, kv_cache_len);

EncodedResults results;
results.scores.resize(batch_size);
results.tokens.resize(batch_size);
std::fill(results.scores.begin(), results.scores.end(), 0);

if (is_chat_conversation && kv_cache_len > 0) {
auto attentions_mask_history = m_model_runner.get_tensor("attention_mask");

size_t new_prompt_len = attention_mask.get_shape()[1];
size_t context_len = attentions_mask_history.get_shape()[1];
ov::Tensor new_attention_mask = ov::Tensor{ov::element::i64, {1, context_len + new_prompt_len}};

for (size_t i = 0; i < context_len; ++i) {
auto r = attentions_mask_history.data<int64_t>()[i];
new_attention_mask.data<int64_t>()[i] = attentions_mask_history.data<int64_t>()[i];
}
for (size_t i = context_len; i < context_len + new_prompt_len; ++i) {
auto r = attention_mask.data<int64_t>()[i];
new_attention_mask.data<int64_t>()[i] = attention_mask.data<int64_t>()[i - context_len];
}
m_model_runner.set_tensor("attention_mask", new_attention_mask);
} else {
m_model_runner.set_tensor("attention_mask", attention_mask);
}

auto atten_shape = attention_mask.get_shape();
auto pos_shape = position_ids.get_shape();
auto input_ids_shape = input_ids.get_shape();

m_model_runner.set_tensor("input_ids", input_ids);
m_model_runner.set_tensor("position_ids", position_ids);

m_model_runner.get_tensor("beam_idx").set_shape({batch_size});
auto beam_data = m_model_runner.get_tensor("beam_idx").data<int32_t>();
std::iota(beam_data, beam_data + batch_size, 0);

size_t max_tokens = generation_config.get_max_new_tokens(prompt_len);

m_model_runner.infer();
auto logits = m_model_runner.get_tensor("logits");
ov::Shape logits_shape = logits.get_shape();
size_t seq_len = logits_shape[1], vocab_size = logits_shape[2];
m_model_runner.get_tensor("input_ids").set_shape({batch_size, 1});

std::vector<int64_t> token_iter_results(batch_size); // results of a single infer request
std::vector<int> eos_met(batch_size, 0); // use int because can not use std::all_of with vector<bool>
for (size_t batch = 0; batch < batch_size; ++batch) {
auto res = utils::softmax(logits, batch);
auto out_token = res.first;
results.tokens[batch].emplace_back(res.first);
results.scores[batch] += res.second;

token_iter_results[batch] = out_token;
eos_met[batch] = (out_token == generation_config.eos_token_id);
m_model_runner.get_tensor("input_ids").data<int64_t>()[batch] = out_token;
}
if (streamer)
streamer->put(token_iter_results[0]);

bool all_are_eos = std::all_of(eos_met.begin(), eos_met.end(), [](int elem) { return elem == 1; });
if (!generation_config.ignore_eos && all_are_eos)
return results;

for (size_t i = 0; i < max_tokens - 1; ++i) {
utils::update_position_ids(m_model_runner.get_tensor("position_ids"), m_model_runner.get_tensor("attention_mask"));
m_model_runner.set_tensor("attention_mask", utils::extend_attention(m_model_runner.get_tensor("attention_mask")));

// todo: consider replacing with start_async and run callback right after that
m_model_runner.infer();
auto logits = m_model_runner.get_tensor("logits");
ov::Shape logits_shape = logits.get_shape();
size_t seq_len = logits_shape[1], vocab_size = logits_shape[2];

std::vector<int64_t> token_iter_results(batch_size); // results of a single infer request
std::vector<int> eos_met(batch_size, 0); // use int because can not use std::all_of with vector<bool>
for (size_t batch = 0; batch < batch_size; ++batch) {

auto res = ov::genai::utils::softmax(logits, batch);
auto out_token = res.first;
results.tokens[batch].emplace_back(res.first);
results.scores[batch] += res.second;

token_iter_results[batch] = out_token;
eos_met[batch] = (out_token == generation_config.eos_token_id);

m_model_runner.get_tensor("input_ids").data<int64_t>()[batch] = out_token;
}
if (streamer)
streamer->put(token_iter_results[0]);

// stop generation when EOS is met in all batches
bool all_are_eos = std::all_of(eos_met.begin(), eos_met.end(), [](int elem) { return elem == 1; });
if (!generation_config.ignore_eos && all_are_eos)
break;
}
if (streamer)
streamer->end();
return results;
}

} //namespace genai
} //namespace ov
Original file line number Diff line number Diff line change
@@ -2,6 +2,10 @@
// SPDX-License-Identifier: Apache-2.0

#include <openvino/runtime/tensor.hpp>
#include "openvino/genai/llm_pipeline.hpp"
#include "utils.hpp"

namespace {

// Modifyed Knuth–Morris–Pratt algorithm which returns tokens following after every needle occurance in haystack
std::vector<int64_t> kmp_search(const std::vector<int64_t>& haystack, const std::vector<int64_t>& needle) {
@@ -80,16 +84,14 @@ bool greater(const Beam& left, const Beam& right) {
return left.score > right.score;
}

enum class StopCriteria { early, heuristic, never };

struct Parameters {
std::vector<std::vector<int64_t>> prompts;
int64_t eos_token;
int64_t eos_token_id;
size_t n_groups = 3;
size_t group_size = 5;
float diversity_penalty = 1.0;
size_t max_new_tokens = 20;
StopCriteria stop_criteria = StopCriteria::heuristic;
ov::genai::StopCriteria stop_criteria = ov::genai::StopCriteria::heuristic;
float length_penalty = 1.0;
size_t no_repeat_ngram_size = std::numeric_limits<size_t>::max();

@@ -107,7 +109,7 @@ struct Group {
beam.score /= std::pow(float(beam.tokens.size()), parameters.length_penalty);

// HF implementation counts eos_token for length penalty calculation
if (beam.tokens.back() == parameters.eos_token) {
if (beam.tokens.back() == parameters.eos_token_id) {
beam.tokens.pop_back();
}

@@ -126,15 +128,15 @@ struct Group {
float best_sum_logprobs = ongoing.front().score;
float worst_score = min_heap.front().score;
switch (parameters.stop_criteria) {
case StopCriteria::early:
case ov::genai::StopCriteria::early:
done = true;
return;
case StopCriteria::heuristic: {
case ov::genai::StopCriteria::heuristic: {
float highest_attainable_score = best_sum_logprobs / std::pow(float(cur_len), parameters.length_penalty);
done = worst_score >= highest_attainable_score;
return;
}
case StopCriteria::never: {
case ov::genai::StopCriteria::never: {
size_t length = parameters.length_penalty > 0.0 ? parameters.max_new_tokens : cur_len;
float highest_attainable_score = best_sum_logprobs / std::pow(float(length), parameters.length_penalty);
done = worst_score >= highest_attainable_score;
@@ -267,7 +269,7 @@ struct GroupBeamSearcher {
std::partial_sort(candidates.begin(), to_sort, candidates.end(), greater);
group->ongoing.clear();
for (size_t cand_idx = 0; cand_idx < candidates.size(); ++cand_idx) {
if (parameters.eos_token == candidates.at(cand_idx).tokens.back()) {
if (parameters.eos_token_id == candidates.at(cand_idx).tokens.back()) {
// If beam_token does not belong to top num_beams tokens, it should not be added
if (cand_idx >= parameters.group_size) {
continue;
@@ -313,3 +315,126 @@ std::vector<std::vector<std::vector<Beam>>> finalize(GroupBeamSearcher&& group_b

return finalized;
}

void initialize_inputs(const ov::Tensor& input_ids, const ov::Tensor& attention_mask, ov::InferRequest& request) {
request.set_tensor("input_ids", input_ids);
request.set_tensor("attention_mask", attention_mask);

ov::Shape input_shape = input_ids.get_shape();

ov::Tensor position_ids = request.get_tensor("position_ids");
position_ids.set_shape(input_shape);
ov::genai::utils::initialize_position_ids(position_ids, attention_mask);

ov::Tensor beam_idx = request.get_tensor("beam_idx");
beam_idx.set_shape({input_shape.at(0)});
std::fill_n(beam_idx.data<int32_t>(), input_shape.at(0), 0);
}


void update_attention_mask_with_beams(ov::Tensor&& attention_mask, std::vector<int32_t> next_beams) {
ov::Tensor original_mask{ov::element::i64, attention_mask.get_shape()};
ov::Shape original_shape = original_mask.get_shape();
attention_mask.copy_to(original_mask);

ov::Shape new_shape{next_beams.size(), original_mask.get_shape().at(1) + 1};
attention_mask.set_shape(new_shape);

for (size_t beam_id = 0; beam_id < next_beams.size(); beam_id++) {
const size_t original_prompt_offset = next_beams.at(beam_id) * original_shape.at(1);
const size_t result_prompt_offset = beam_id * new_shape.at(1);

int64_t* dest = attention_mask.data<int64_t>() + result_prompt_offset;
const int64_t* src = original_mask.data<int64_t>() + original_prompt_offset;

std::memcpy(dest, src, original_shape.at(1) * sizeof(int64_t));
attention_mask.data<int64_t>()[result_prompt_offset + new_shape.at(1) - 1] = 1;
}
}

void update_position_ids(ov::Tensor&& position_ids, const ov::Tensor&& attention_mask) {
const size_t batch_size = attention_mask.get_shape().at(0);
const size_t sequence_length = attention_mask.get_shape().at(1);
position_ids.set_shape({batch_size, 1});

for (size_t batch = 0; batch < batch_size; batch++) {
int64_t* mask_start = attention_mask.data<int64_t>() + batch * sequence_length;
position_ids.data<int64_t>()[batch] = std::accumulate(mask_start, mask_start + sequence_length - 1, 0);
}
}

} // namespace


namespace ov {
namespace genai {

EncodedResults beam_search(ov::InferRequest& lm, ov::Tensor input_ids, ov::Tensor attention_mask, GenerationConfig config) {
OPENVINO_ASSERT(config.num_beams % config.num_beam_groups == 0, "number of beams should be divisible by number of groups");

// Initialize beam search
const int64_t* prompt_data = input_ids.data<const int64_t>();
std::vector<std::vector<int64_t>> prompts;
prompts.reserve(input_ids.get_shape().at(0));
for (size_t batch = 0; batch < input_ids.get_shape().at(0); batch++) {
size_t sequence_length = input_ids.get_shape().at(1);
size_t batch_offset = batch * sequence_length;
const int64_t* prompt_start = prompt_data + batch_offset;
prompts.push_back(std::vector<int64_t>{prompt_start, prompt_start + sequence_length});
}

initialize_inputs(input_ids, attention_mask, lm);

Parameters parameters{std::move(prompts)};
parameters.max_new_tokens = config.max_new_tokens;
parameters.eos_token_id = config.eos_token_id;
parameters.n_groups = config.num_beam_groups;
parameters.group_size = config.num_beams / config.num_beam_groups;
parameters.diversity_penalty = config.diversity_penalty;
parameters.length_penalty = config.length_penalty;
parameters.stop_criteria = config.stop_criteria;
parameters.no_repeat_ngram_size = config.no_repeat_ngram_size;
GroupBeamSearcher group_beam_searcher{parameters};

std::vector<int64_t> next_tokens;
std::vector<int32_t> next_beams;

for (size_t length_count = 0; length_count < parameters.max_new_tokens; ++length_count) {
lm.infer();

std::tie(next_tokens, next_beams) = group_beam_searcher.select_next_tokens(lm.get_tensor("logits"));
if (next_tokens.empty()) {
break;
}
size_t batch_size = next_tokens.size();
// Set pointers
lm.set_tensor("input_ids", ov::Tensor{ov::element::i64, {batch_size, 1}, next_tokens.data()});
lm.set_tensor("beam_idx", ov::Tensor{ov::element::i32, {batch_size}, next_beams.data()});
// Set auxiliary inputs
update_attention_mask_with_beams(lm.get_tensor("attention_mask"), next_beams);
update_position_ids(lm.get_tensor("position_ids"), lm.get_tensor("attention_mask"));
}

std::vector<Beam> beams;
for (const std::vector<std::vector<Beam>>& prompt_group : finalize(std::move(group_beam_searcher))) {
for (const std::vector<Beam> group : prompt_group) {
for (const Beam& beam : group) {
beams.emplace_back(beam);
}
}
}

// return sorted scores
auto compare_scores = [](Beam left, Beam right) { return (left.score > right.score); };
std::sort(beams.begin(), beams.end(), compare_scores);

ov::genai::EncodedResults results;
for (auto beam = beams.begin(); beam != beams.begin() + config.num_return_sequences; ++beam) {
results.scores.emplace_back(beam->score);
results.tokens.emplace_back(beam->tokens);
}
return results;
}

} // namespace genai
} // namespace ov
408 changes: 408 additions & 0 deletions src/cpp/src/llm_pipeline.cpp

Large diffs are not rendered by default.

75 changes: 75 additions & 0 deletions src/cpp/src/text_callback_streamer.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#include "text_callback_streamer.hpp"

namespace ov {
namespace genai {

TextCallbackStreamer::TextCallbackStreamer(const Tokenizer& tokenizer, std::function<void (std::string)> callback, bool print_eos_token) {
m_tokenizer = tokenizer;
m_print_eos_token = print_eos_token;
on_decoded_text_callback = callback;
m_enabled = true;
}

TextCallbackStreamer::TextCallbackStreamer(const Tokenizer& tokenizer, bool print_eos_token) {
m_tokenizer = tokenizer;
m_print_eos_token = print_eos_token;
}

void TextCallbackStreamer::put(int64_t token) {
std::stringstream res;
// do nothing if <eos> token is met and if print_eos_token=false
if (!m_print_eos_token && token == m_tokenizer.get_eos_token_id())
return;

m_tokens_cache.push_back(token);
std::string text = m_tokenizer.decode(m_tokens_cache);
if (!text.empty() && '\n' == text.back()) {
// Flush the cache after the new line symbol
res << std::string_view{text.data() + print_len, text.size() - print_len};
m_tokens_cache.clear();
print_len = 0;
on_finalized_text(res.str());
return;
}
if (text.size() >= 3 && text.compare(text.size() - 3, 3, "") == 0) {
// Don't print incomplete text
on_finalized_text(res.str());
return;
}
res << std::string_view{text.data() + print_len, text.size() - print_len} << std::flush;
print_len = text.size();
on_finalized_text(res.str());
return;
}

void TextCallbackStreamer::end() {
std::stringstream res;
std::string text = m_tokenizer.decode(m_tokens_cache);
res << std::string_view{text.data() + print_len, text.size() - print_len} << std::flush;
m_tokens_cache.clear();
print_len = 0;
on_finalized_text(res.str());
}

void TextCallbackStreamer::set_tokenizer(Tokenizer tokenizer) {
this->m_tokenizer = tokenizer;
}

void TextCallbackStreamer::set_callback(std::function<void (std::string)> callback) {
on_decoded_text_callback = callback;
m_enabled = true;
}

void TextCallbackStreamer::set_callback() {
on_decoded_text_callback = [](std::string words){};
m_enabled = false;
}

void TextCallbackStreamer::on_finalized_text(const std::string& subword) {
if (m_enabled) {
on_decoded_text_callback(subword);
}
}

} // namespace genai
} // namespace ov
37 changes: 37 additions & 0 deletions src/cpp/src/text_callback_streamer.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
#pragma once

#include "openvino/genai/streamer_base.hpp"
#include "openvino/genai/tokenizer.hpp"

namespace ov {
namespace genai {

class TextCallbackStreamer: public StreamerBase {
public:
void put(int64_t token) override;
void end() override;

TextCallbackStreamer(const Tokenizer& tokenizer, std::function<void (std::string)> callback, bool print_eos_token = false);
TextCallbackStreamer(const Tokenizer& tokenizer, bool print_eos_token = false);
TextCallbackStreamer() = default;
~TextCallbackStreamer() = default;

void set_tokenizer(Tokenizer tokenizer);
void set_callback(std::function<void (std::string)> callback);
void set_callback();

std::function<void (std::string)> on_decoded_text_callback = [](std::string words){};
bool m_enabled = false;
int64_t m_eos_token;
private:
bool m_print_eos_token = false;
Tokenizer m_tokenizer;
std::vector<int64_t> m_tokens_cache;
size_t print_len = 0;
void on_finalized_text(const std::string& subword);
};

} // namespace genai
} // namespace ov
201 changes: 201 additions & 0 deletions src/cpp/src/tokenizer.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <openvino/openvino.hpp>
#include "openvino/genai/tokenizer.hpp"
#include "utils.hpp"

namespace {

// todo: remove when openvino-tokenizers will support left padding
std::pair<ov::Tensor, ov::Tensor> pad_left(ov::Tensor&& input_ids, ov::Tensor&& attention_mask, int64_t pad_token) {
const size_t batch_size = input_ids.get_shape()[0];
const size_t sequence_length = input_ids.get_shape()[1];
int64_t* inputs_data = input_ids.data<int64_t>();
int64_t* attention_mask_data = attention_mask.data<int64_t>();

for (size_t batch = 0; batch < batch_size; batch++) {
const size_t batch_offset = batch * sequence_length;

// last token in the sequence is not a PAD_TOKEN, skipping
if (inputs_data[batch_offset + sequence_length - 1] != pad_token)
continue;

size_t pad_tokens_number = 0;
for (int i = sequence_length - 1; i >= 0; i--) {
const size_t token_offset = batch_offset + i;

if (inputs_data[token_offset] == pad_token)
continue;

if (pad_tokens_number == 0)
pad_tokens_number = sequence_length - i - 1;

std::swap(inputs_data[token_offset], inputs_data[token_offset + pad_tokens_number]);
std::swap(attention_mask_data[token_offset], attention_mask_data[token_offset + pad_tokens_number]);
}
}

return {input_ids, attention_mask};
}

}

namespace ov {
namespace genai {

class Tokenizer::TokenizerImpl {
public:
ov::InferRequest m_tokenize_request;
ov::InferRequest m_detokenizer_request;
int64_t m_pad_token_id = 0;
int64_t m_bos_token_id = 1;
int64_t m_eos_token_id = 2;

TokenizerImpl() = default;
TokenizerImpl(std::string tokenizers_path, const std::string device, const std::string& ov_tokenizers_path) {
ov::Core core;

if (ov::genai::utils::is_xml(tokenizers_path))
OPENVINO_THROW("tokenizers_path should be a path to a dir not a xml file");

core.add_extension(ov_tokenizers_path);
std::shared_ptr<ov::Model> tokenizer_model, detokenizer_model;
try {
tokenizer_model = core.read_model(tokenizers_path + "/openvino_tokenizer.xml");
detokenizer_model = core.read_model(tokenizers_path + "/openvino_detokenizer.xml");
} catch (...) {
OPENVINO_THROW("Cannot compile tokenizer and/or detokenizer. Please check that "
"openvino_tokenizer.xml and openvino_detokenizer.xml exist in \"" + tokenizers_path + "\"");
}
m_tokenize_request = core.compile_model(tokenizer_model, device).create_infer_request();
m_detokenizer_request = core.compile_model(detokenizer_model, device).create_infer_request();

auto rt_info = tokenizer_model->get_rt_info();
if (rt_info.count("eos_token_id") > 0)
m_eos_token_id = rt_info["eos_token_id"].as<int64_t>();
if (rt_info.count("bos_token_id") > 0)
m_bos_token_id = rt_info["bos_token_id"].as<int64_t>();
if (rt_info.count("pad_token_id") > 0)
m_pad_token_id = rt_info["pad_token_id"].as<int64_t>();
}

std::pair<ov::Tensor, ov::Tensor> encode(std::string prompt) {
size_t batch_size = 1;
m_tokenize_request.set_input_tensor(ov::Tensor{ov::element::string, {batch_size}, &prompt});
m_tokenize_request.infer();
return {m_tokenize_request.get_tensor("input_ids"), m_tokenize_request.get_tensor("attention_mask")};
}

std::pair<ov::Tensor, ov::Tensor> encode(std::vector<std::string>& prompts) {
m_tokenize_request.set_input_tensor(ov::Tensor{ov::element::string, {prompts.size()}, prompts.data()});
auto size_ = m_tokenize_request.get_input_tensor().get_shape();
m_tokenize_request.infer();
pad_left(m_tokenize_request.get_tensor("input_ids"), m_tokenize_request.get_tensor("attention_mask"), m_pad_token_id);

// todo: fix mask filled with '2' instead of '0'
// https://github.com/openvinotoolkit/openvino_tokenizers/pull/90 should've fixed this
ov::Tensor attention_mask = m_tokenize_request.get_tensor("attention_mask");
int64_t* attention_mask_data = attention_mask.data<int64_t>();
std::replace(attention_mask_data, attention_mask_data + attention_mask.get_size(), 2, 0);

return {m_tokenize_request.get_tensor("input_ids"), m_tokenize_request.get_tensor("attention_mask")};
}

std::string decode(std::vector<int64_t> tokens) {
size_t batch_size = 1;
m_detokenizer_request.set_input_tensor(ov::Tensor{ov::element::i64, {batch_size, tokens.size()}, tokens.data()});
m_detokenizer_request.infer();
return m_detokenizer_request.get_output_tensor().data<std::string>()[0];
}

std::vector<std::string> decode(ov::Tensor tokens) {
m_detokenizer_request.set_input_tensor(tokens);
auto shape = tokens.get_shape();
auto data = tokens.data<int64_t>();
m_detokenizer_request.infer();
auto res = m_detokenizer_request.get_output_tensor();

std::vector<std::string> strings;
for (int i = 0; i < res.get_shape()[0]; ++i) {
strings.emplace_back(res.data<std::string>()[i]);
}
return strings;
}

std::vector<std::string> decode(std::vector<std::vector<int64_t>> lines) {
// todo: implement calling detokenizer in a single batch
std::vector<std::string> results;
for (auto& line: lines){
ov::Tensor tokens = ov::Tensor{ov::element::i64, {1, line.size()}, line.data()};
m_detokenizer_request.set_input_tensor(tokens);
m_detokenizer_request.infer();
auto res = m_detokenizer_request.get_output_tensor();
auto res_str = res.data<std::string>()[0];
results.emplace_back(res_str);
}

return results;
}
};

Tokenizer::Tokenizer(const std::string& tokenizers_path, const std::string& device, const std::string& ov_tokenizers_path) {
m_pimpl = std::make_shared<TokenizerImpl>(tokenizers_path, device, ov_tokenizers_path);
}

std::pair<ov::Tensor, ov::Tensor> Tokenizer::encode(const std::string prompt) {
return m_pimpl->encode(std::move(prompt));
}

std::pair<ov::Tensor, ov::Tensor> Tokenizer::encode(std::vector<std::string>& prompts) {
return m_pimpl->encode(prompts);
}

std::pair<ov::Tensor, ov::Tensor> Tokenizer::encode(std::vector<std::string>&& prompts) {
return m_pimpl->encode(prompts);
}

std::pair<ov::Tensor, ov::Tensor> Tokenizer::encode(std::initializer_list<std::string>& text) {
return encode(std::vector<std::string>(text.begin(), text.end()));
}

std::string Tokenizer::decode(std::vector<int64_t> tokens) {
return m_pimpl->decode(tokens);
}

std::vector<std::string> Tokenizer::decode(ov::Tensor tokens) {
return m_pimpl->decode(tokens);
}

std::vector<std::string> Tokenizer::decode(std::vector<std::vector<int64_t>> lines) {
return m_pimpl->decode(lines);
}

int64_t Tokenizer::get_bos_token_id() const {
return m_pimpl->m_bos_token_id;
}

int64_t Tokenizer::get_eos_token_id() const {
return m_pimpl->m_eos_token_id;
}

int64_t Tokenizer::get_pad_token_id() const {
return m_pimpl->m_pad_token_id;
}

void Tokenizer::set_pad_token_id(int64_t pad_token_id) {
m_pimpl->m_pad_token_id = pad_token_id;
}

void Tokenizer::set_bos_token_id(int64_t bos_token_id) {
m_pimpl->m_bos_token_id = bos_token_id;
}

void Tokenizer::set_eos_token_id(int64_t eos_token_id) {
m_pimpl->m_eos_token_id = eos_token_id;
}

Tokenizer::~Tokenizer() = default;

} // namespace genai
} // namespace ov
141 changes: 141 additions & 0 deletions src/cpp/src/utils.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "utils.hpp"

namespace ov {
namespace genai {
namespace utils {

Tensor init_attention_mask(Tensor& position_ids) {
auto shape = position_ids.get_shape();
auto attention_mask = ov::Tensor{position_ids.get_element_type(), shape};
std::fill_n(attention_mask.data<int64_t>(), shape[0] * shape[1], 1);
return attention_mask;
}

void print_tensor(const ov::Tensor& tensor) {
std::vector<int64_t> res;

auto t_shape = tensor.get_shape();
std::cout << "[";
for (size_t i = 0; i < t_shape[0]; ++i) {
std::cout << "|";
for (size_t j = 0; j < t_shape[1]; ++j) {
if (tensor.get_element_type() == ov::element::i64) {
res.emplace_back(tensor.data<int64_t>()[t_shape[1] * i + j]);
std::cout << tensor.data<int64_t>()[t_shape[1] * i + j] << " ";
}
}
std::cout << "|";
}
std::cout << "]" << std::endl;
}

bool is_xml(const std::string& path) { return path.compare(path.length() - 4, 4, ".xml") == 0;}

std::pair<int64_t, float> softmax(const ov::Tensor& logits, const size_t batch_idx) {
if (logits.get_shape()[0] <= batch_idx) {
OPENVINO_THROW("logits batch size doesn't match the number of beams");
}

size_t vocab_size = logits.get_shape().back();
size_t batch_offset = batch_idx * logits.get_shape()[1] * vocab_size;
size_t sequence_offset = (logits.get_shape()[1] - 1) * vocab_size;
const float* logits_data = logits.data<const float>() + batch_offset + sequence_offset;

int64_t out_token = std::max_element(logits_data, logits_data + vocab_size) - logits_data;
float max_logit = logits_data[out_token];

float log_sum = std::log(
std::accumulate(logits_data, logits_data + vocab_size, 0.0f, [max_logit](float accumulated, float to_add) {
return accumulated + std::exp(to_add - max_logit);
}));
return {out_token, log_sum};
}

void initialize_position_ids(ov::Tensor& position_ids, const ov::Tensor& attention_mask, int64_t start_pos) {
const size_t batch_size = attention_mask.get_shape()[0];
const size_t seq_length = attention_mask.get_shape()[1];

const int64_t* attention_mask_data = attention_mask.data<int64_t>();
int64_t* position_ids_data = position_ids.data<int64_t>();

for (size_t batch = 0; batch < batch_size; batch++) {
size_t sum = start_pos;
for (size_t i = 0; i < seq_length; i++) {
const size_t element_offset = batch * seq_length + i;
position_ids_data[element_offset] = sum;
if (attention_mask_data[element_offset] == 1) {
sum += 1;
}
}
}
}

void initialize_beam_inputs(const ov::Tensor& input_ids, const ov::Tensor& attention_mask, ov::InferRequest& request) {
request.set_tensor("input_ids", input_ids);
request.set_tensor("attention_mask", attention_mask);

ov::Shape input_shape = input_ids.get_shape();

ov::Tensor position_ids = request.get_tensor("position_ids");
position_ids.set_shape(input_shape);
initialize_position_ids(position_ids, attention_mask);

ov::Tensor beam_idx = request.get_tensor("beam_idx");
beam_idx.set_shape({input_shape.at(0)});
std::fill_n(beam_idx.data<int32_t>(), input_shape.at(0), 0);
}


void set_attention_mask(ov::Tensor&& attention_mask, std::vector<int32_t> next_beams) {
ov::Tensor original_mask{ov::element::i64, attention_mask.get_shape()};
ov::Shape original_shape = original_mask.get_shape();
attention_mask.copy_to(original_mask);

ov::Shape new_shape{next_beams.size(), original_mask.get_shape().at(1) + 1};
attention_mask.set_shape(new_shape);

for (size_t beam_id = 0; beam_id < next_beams.size(); beam_id++) {
const size_t original_prompt_offset = next_beams.at(beam_id) * original_shape.at(1);
const size_t result_prompt_offset = beam_id * new_shape.at(1);

int64_t* dest = attention_mask.data<int64_t>() + result_prompt_offset;
const int64_t* src = original_mask.data<int64_t>() + original_prompt_offset;

std::memcpy(dest, src, original_shape.at(1) * sizeof(int64_t));
attention_mask.data<int64_t>()[result_prompt_offset + new_shape.at(1) - 1] = 1;
}
}

void update_position_ids(ov::Tensor&& position_ids, const ov::Tensor&& attention_mask) {
const size_t batch_size = attention_mask.get_shape().at(0);
const size_t atten_length = attention_mask.get_shape().at(1);
position_ids.set_shape({batch_size, 1});

for (size_t batch = 0; batch < batch_size; batch++) {
int64_t* start = attention_mask.data<int64_t>() + batch * atten_length;
// todo: be careful with start + atten_length, probably need to replace with start + atten_length -1
position_ids.data<int64_t>()[batch] = std::accumulate(start, start + atten_length, 0);
}
}

ov::Tensor extend_attention(ov::Tensor attention_mask) {
auto shape = attention_mask.get_shape();
auto batch_size = shape[0];
auto seq_len = shape[1];

ov::Tensor new_atten_mask = ov::Tensor{attention_mask.get_element_type(), {batch_size, seq_len + 1}};
auto old_data = attention_mask.data<int64_t>();
auto new_data = new_atten_mask.data<int64_t>();
for (size_t batch = 0; batch < batch_size; ++batch) {
std::memcpy(new_data + batch * (seq_len + 1), old_data + batch * seq_len, seq_len * sizeof(int64_t));
new_data[batch * (seq_len + 1) + seq_len] = 1;
}
return new_atten_mask;
}

} // namespace utils
} // namespace genai
} // namespace ov
65 changes: 65 additions & 0 deletions src/cpp/src/utils.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#pragma once

#include <openvino/openvino.hpp>
#include <nlohmann/json.hpp>

namespace ov {
namespace genai {
namespace utils {

Tensor init_attention_mask(Tensor& position_ids);

void print_tensor(const ov::Tensor& tensor);

std::pair<int64_t, float> softmax(const ov::Tensor& logits, const size_t batch_idx);

void initialize_position_ids(ov::Tensor& position_ids, const ov::Tensor& attention_mask, int64_t start_pos = 0);

ov::Tensor extend_attention(ov::Tensor attention_mask);

void update_position_ids(ov::Tensor&& position_ids, const ov::Tensor&& attention_mask);

bool is_xml(const std::string& path);

template <typename>
struct json_type_traits {};

template <>
struct json_type_traits<int> { static constexpr auto json_value_t = nlohmann::json::value_t::number_integer; };

template <>
struct json_type_traits<int64_t> { static constexpr auto json_value_t = nlohmann::json::value_t::number_integer; };

template <>
struct json_type_traits<size_t> { static constexpr auto json_value_t = nlohmann::json::value_t::number_unsigned; };

template <>
struct json_type_traits<float> { static constexpr auto json_value_t = nlohmann::json::value_t::number_float; };

template <>
struct json_type_traits<std::string> { static constexpr auto json_value_t = nlohmann::json::value_t::string; };

template <>
struct json_type_traits<bool> { static constexpr auto json_value_t = nlohmann::json::value_t::boolean; };

template <typename T>
void read_json_param(const nlohmann::json& data, const std::string& name, T& param) {
if (data.contains(name) && data[name].type() == json_type_traits<T>::json_value_t) {
param = data[name];
}
}

template <typename T>
void read_anymap_param(const ov::AnyMap& config_map, const std::string& name, T& param) {
if (config_map.count(name)) {
param = config_map.at(name).as<T>();
}
}

} // namespace utils
} // namespace genai
} // namespace ov

49 changes: 49 additions & 0 deletions src/python/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright (C) 2018-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#

include(FetchContent)
FetchContent_Declare(
pybind11
URL https://github.com/pybind/pybind11/archive/refs/tags/v2.12.0.tar.gz
URL_HASH SHA256=bf8f242abd1abcd375d516a7067490fb71abd79519a282d22b6e4d19282185a7
)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
FetchContent_GetProperties(pybind11)
if(NOT pybind11_POPULATED)
FetchContent_Populate(pybind11)
add_subdirectory(${pybind11_SOURCE_DIR} ${pybind11_BINARY_DIR})
endif()

pybind11_add_module(py_generate_pipeline py_generate_pipeline.cpp)
target_link_libraries(py_generate_pipeline PRIVATE openvino::genai)
set_target_properties(py_generate_pipeline PROPERTIES
LIBRARY_OUTPUT_DIRECTORY "$<1:${CMAKE_BINARY_DIR}/openvino_genai/>"
)
file(COPY "${CMAKE_CURRENT_SOURCE_DIR}/openvino_genai/__init__.py" DESTINATION "${CMAKE_BINARY_DIR}/openvino_genai/")
write_file("${CMAKE_BINARY_DIR}/openvino_genai/__version__.py" "__version__ = \"${CMAKE_PROJECT_VERSION}\"")

# setting RPATH / LC_RPATH depending on platform
if(LINUX)
# to find libgenai.so in the same folder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# to find libgenai.so in the same folder
# to find libopenvino_genai.so in the same folder

set(rpaths "$ORIGIN")
elseif(APPLE)
# to find libgenai.dylib in the same folder
set(rpaths "@loader_path")
if(DEFINED SKBUILD)
# in case we build pip package, we need to refer to libopenvino.dylib from 'openvino' package
list(APPEND rpaths "@loader_path/../openvino/libs")
endif()
endif()

if(rpaths)
set_target_properties(py_generate_pipeline PROPERTIES INSTALL_RPATH "${rpaths}")
endif()

find_package(Python3 REQUIRED COMPONENTS Interpreter Development)
install(FILES "${CMAKE_BINARY_DIR}/openvino_genai/__init__.py" "${CMAKE_BINARY_DIR}/openvino_genai/__version__.py" DESTINATION python/openvino_genai/ COMPONENT pygenai_${Python_VERSION_MAJOR}_${Python_VERSION_MINOR})
install(TARGETS genai py_generate_pipeline LIBRARY DESTINATION python/openvino_genai/ COMPONENT pygenai_${Python_VERSION_MAJOR}_${Python_VERSION_MINOR})

# wheel_genai component is used for wheel generation in pyproject.toml.
# Exclude wheel_genai from normal packaging process because there's pygenai_X_Y component for that.
install(TARGETS genai py_generate_pipeline LIBRARY DESTINATION . COMPONENT wheel_genai RUNTIME DESTINATION . COMPONENT wheel_genai EXCLUDE_FROM_ALL)
14 changes: 14 additions & 0 deletions src/python/openvino_genai/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import openvino # add_dll_directory for openvino lib
import os
from .__version__ import __version__


if hasattr(os, "add_dll_directory"):
os.add_dll_directory(os.path.dirname(__file__))

from .py_generate_pipeline import LLMPipeline, Tokenizer, GenerationConfig, DecodedResults, EncodedResults

__all__ = ['LLMPipeline', 'Tokenizer', 'GenerationConfig', 'DecodedResults', 'EncodedResults']
2 changes: 2 additions & 0 deletions src/python/openvino_genai/__version__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Will be overwritten by pyproject.toml or cmake.
__version__ = "0.0.0.0"
225 changes: 225 additions & 0 deletions src/python/py_generate_pipeline.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <filesystem>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <pybind11/functional.h>
#include "openvino/genai/llm_pipeline.hpp"

#ifdef _WIN32
# include <windows.h>
# define MAX_ABS_PATH _MAX_PATH
# define get_absolute_path(result, path) _fullpath(result, path.c_str(), MAX_ABS_PATH)
#else
# include <dlfcn.h>
# include <limits.h>
# define MAX_ABS_PATH PATH_MAX
# define get_absolute_path(result, path) realpath(path.c_str(), result)
namespace {
std::string get_absolute_file_path(const std::string& path) {
std::string absolutePath;
absolutePath.resize(MAX_ABS_PATH);
std::ignore = get_absolute_path(&absolutePath[0], path);
if (!absolutePath.empty()) {
// on Linux if file does not exist or no access, function will return NULL, but
// `absolutePath` will contain resolved path
absolutePath.resize(absolutePath.find('\0'));
return std::string(absolutePath);
}
std::stringstream ss;
ss << "Can't get absolute file path for [" << path << "], err = " << strerror(errno);
throw std::runtime_error(ss.str());
}
}
#endif

namespace py = pybind11;
using ov::genai::LLMPipeline;
using ov::genai::Tokenizer;
using ov::genai::GenerationConfig;
using ov::genai::EncodedResults;
using ov::genai::DecodedResults;
using ov::genai::StopCriteria;
using ov::genai::StreamerBase;

namespace {
void str_to_stop_criteria(GenerationConfig& config, const std::string& stop_criteria_str){
if (stop_criteria_str == "early") config.stop_criteria = StopCriteria::early;
else if (stop_criteria_str == "never") config.stop_criteria = StopCriteria::never;
else if (stop_criteria_str == "heuristic") config.stop_criteria = StopCriteria::heuristic;
else OPENVINO_THROW(stop_criteria_str + " is incorrect value of stop_criteria. "
"Allowed values are: \"early\", \"never\", \"heuristic\". ");
}

std::string stop_criteria_to_str(const GenerationConfig& config) {
switch (config.stop_criteria) {
case StopCriteria::early: return "early";
case StopCriteria::heuristic: return "heuristic";
case StopCriteria::never: return "never";
default: throw std::runtime_error("Incorrect stop_criteria");
}
}

void update_config_from_kwargs(GenerationConfig& config, const py::kwargs& kwargs) {
if (kwargs.contains("max_new_tokens")) config.max_new_tokens = kwargs["max_new_tokens"].cast<size_t>();
if (kwargs.contains("max_length")) config.max_length = kwargs["max_length"].cast<size_t>();
if (kwargs.contains("ignore_eos")) config.ignore_eos = kwargs["ignore_eos"].cast<bool>();
if (kwargs.contains("num_beam_groups")) config.num_beam_groups = kwargs["num_beam_groups"].cast<size_t>();
if (kwargs.contains("num_beams")) config.num_beams = kwargs["num_beams"].cast<size_t>();
if (kwargs.contains("diversity_penalty")) config.diversity_penalty = kwargs["diversity_penalty"].cast<float>();
if (kwargs.contains("length_penalty")) config.length_penalty = kwargs["length_penalty"].cast<float>();
if (kwargs.contains("num_return_sequences")) config.num_return_sequences = kwargs["num_return_sequences"].cast<size_t>();
if (kwargs.contains("no_repeat_ngram_size")) config.no_repeat_ngram_size = kwargs["no_repeat_ngram_size"].cast<size_t>();
if (kwargs.contains("stop_criteria")) str_to_stop_criteria(config, kwargs["stop_criteria"].cast<std::string>());
if (kwargs.contains("temperature")) config.temperature = kwargs["temperature"].cast<float>();
if (kwargs.contains("top_p")) config.top_p = kwargs["top_p"].cast<float>();
if (kwargs.contains("top_k")) config.top_k = kwargs["top_k"].cast<size_t>();
if (kwargs.contains("do_sample")) config.do_sample = kwargs["do_sample"].cast<bool>();
if (kwargs.contains("repetition_penalty")) config.repetition_penalty = kwargs["repetition_penalty"].cast<float>();
if (kwargs.contains("pad_token_id")) config.pad_token_id = kwargs["pad_token_id"].cast<int64_t>();
if (kwargs.contains("bos_token_id")) config.bos_token_id = kwargs["bos_token_id"].cast<int64_t>();
if (kwargs.contains("eos_token_id")) config.eos_token_id = kwargs["eos_token_id"].cast<int64_t>();
if (kwargs.contains("eos_token")) config.eos_token = kwargs["eos_token"].cast<std::string>();
if (kwargs.contains("bos_token")) config.bos_token = kwargs["bos_token"].cast<std::string>();
}

// operator() and generate methods are identical, operator() is just an alias for generate
std::string call_with_kwargs(LLMPipeline& pipeline, const std::string& text, const py::kwargs& kwargs) {
// Create a new GenerationConfig instance and initialize from kwargs
GenerationConfig config = pipeline.get_generation_config();
update_config_from_kwargs(config, kwargs);
return pipeline(text, config);
}

std::string call_with_config(LLMPipeline& pipe, const std::string& text, const GenerationConfig& config) {
std::shared_ptr<StreamerBase> streamer;
return pipe(text, config);
}

std::filesystem::path with_openvino_tokenizers(const std::filesystem::path& path) {
#ifdef _WIN32
constexpr char tokenizers[] = "openvino_tokenizers.dll";
#elif __linux__
constexpr char tokenizers[] = "libopenvino_tokenizers.so";
#elif __APPLE__
constexpr char tokenizers[] = "libopenvino_tokenizers.dylib";
#endif
return path.parent_path() / tokenizers;
}

std::string get_ov_genai_bindings_path() {
#ifdef _WIN32
CHAR genai_library_path[MAX_PATH];
HMODULE hm = NULL;
if (!GetModuleHandleExA(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
reinterpret_cast<LPSTR>(get_ov_genai_bindings_path),
&hm)) {
std::stringstream ss;
ss << "GetModuleHandle returned " << GetLastError();
throw std::runtime_error(ss.str());
}
GetModuleFileNameA(hm, (LPSTR)genai_library_path, sizeof(genai_library_path));
return std::string(genai_library_path);
#elif defined(__APPLE__) || defined(__linux__) || defined(__EMSCRIPTEN__)
Dl_info info;
dladdr(reinterpret_cast<void*>(get_ov_genai_bindings_path), &info);
return get_absolute_file_path(info.dli_fname).c_str();
#else
# error "Unsupported OS"
#endif // _WIN32
}

std::string ov_tokenizers_module_path() {
// Try a path relative to build artifacts folder first.
std::filesystem::path from_library = with_openvino_tokenizers(get_ov_genai_bindings_path());
if (std::filesystem::exists(from_library)) {
return from_library.string();
}
return py::str(py::module_::import("openvino_tokenizers").attr("_ext_path"));
}
}

PYBIND11_MODULE(py_generate_pipeline, m) {
m.doc() = "Pybind11 binding for LLM Pipeline";

py::class_<LLMPipeline>(m, "LLMPipeline")
.def(py::init<const std::string, const Tokenizer&, const std::string, const ov::AnyMap&>(),
py::arg("model_path"), py::arg("tokenizer"), py::arg("device") = "CPU",
py::arg("plugin_config") = ov::AnyMap{})
.def(py::init<std::string&, std::string, const ov::AnyMap&, const std::string>(),
py::arg("path"), py::arg("device") = "CPU", py::arg("plugin_config") = ov::AnyMap{}, py::arg("ov_tokenizers_path") = ov_tokenizers_module_path())
.def("__call__", py::overload_cast<LLMPipeline&, const std::string&, const py::kwargs&>(&call_with_kwargs))
.def("__call__", py::overload_cast<LLMPipeline&, const std::string&, const GenerationConfig&>(&call_with_config))
.def("generate", py::overload_cast<LLMPipeline&, const std::string&, const py::kwargs&>(&call_with_kwargs))
.def("generate", py::overload_cast<LLMPipeline&, const std::string&, const GenerationConfig&>(&call_with_config))

// todo: if input_ids is a ov::Tensor/numpy tensor
// todo: implement calling generate/operator() with StreamerBase or lambda streamer
// signature to be implemented:
// EncodedResults generate(ov::Tensor input_ids,
// std::optional<ov::Tensor> attention_mask,
// OptionalGenerationConfig generation_config=nullopt,
// OptionalStreamerVariant streamer=nullopt);


.def("get_tokenizer", &LLMPipeline::get_tokenizer)
.def("start_chat", &LLMPipeline::start_chat)
.def("finish_chat", &LLMPipeline::finish_chat)
.def("reset_state", &LLMPipeline::reset_state)
.def("get_generation_config", &LLMPipeline::get_generation_config, py::return_value_policy::copy)
.def("set_generation_config", &LLMPipeline::set_generation_config)
.def("apply_chat_template", &LLMPipeline::apply_chat_template);

// Binding for Tokenizer
py::class_<Tokenizer>(m, "Tokenizer")
.def(py::init<>())
.def(py::init<std::string&, const std::string&, const std::string&>(),
py::arg("tokenizers_path"),
py::arg("device") = "CPU",
py::arg("ov_tokenizers_path") = py::str(ov_tokenizers_module_path()))

// todo: implement encode/decode when for numpy inputs and outputs
.def("encode", py::overload_cast<const std::string>(&Tokenizer::encode), "Encode a single prompt")
// TODO: common.h(1106...) template argument deduction/substitution failed:
// .def("encode", py::overload_cast<std::vector<std::string>&>(&Tokenizer::encode), "Encode multiple prompts")
.def("decode", py::overload_cast<std::vector<int64_t>>(&Tokenizer::decode), "Decode a list of tokens")
.def("decode", py::overload_cast<ov::Tensor>(&Tokenizer::decode), "Decode a tensor of tokens")
.def("decode", py::overload_cast<std::vector<std::vector<int64_t>>>(&Tokenizer::decode), "Decode multiple lines of tokens");

// Binding for GenerationConfig
py::class_<GenerationConfig>(m, "GenerationConfig")
.def(py::init<>())
.def(py::init<std::string>())
.def_readwrite("max_new_tokens", &GenerationConfig::max_new_tokens)
.def_readwrite("max_length", &GenerationConfig::max_length)
.def_readwrite("ignore_eos", &GenerationConfig::ignore_eos)
.def_readwrite("num_beam_groups", &GenerationConfig::num_beam_groups)
.def_readwrite("num_beams", &GenerationConfig::num_beams)
.def_readwrite("diversity_penalty", &GenerationConfig::diversity_penalty)
.def_readwrite("length_penalty", &GenerationConfig::length_penalty)
.def_readwrite("num_return_sequences", &GenerationConfig::num_return_sequences)
.def_readwrite("no_repeat_ngram_size", &GenerationConfig::no_repeat_ngram_size)
.def_property("stop_criteria", &stop_criteria_to_str, &str_to_stop_criteria)
.def_readwrite("temperature", &GenerationConfig::temperature)
.def_readwrite("top_p", &GenerationConfig::top_p)
.def_readwrite("top_k", &GenerationConfig::top_k)
.def_readwrite("do_sample", &GenerationConfig::do_sample)
.def_readwrite("repetition_penalty", &GenerationConfig::repetition_penalty)
.def_readwrite("pad_token_id", &GenerationConfig::pad_token_id)
.def_readwrite("bos_token_id", &GenerationConfig::bos_token_id)
.def_readwrite("eos_token_id", &GenerationConfig::eos_token_id)
.def_readwrite("eos_token", &GenerationConfig::eos_token)
.def_readwrite("bos_token", &GenerationConfig::bos_token);

py::class_<DecodedResults>(m, "DecodedResults")
.def(py::init<>())
.def_readwrite("texts", &DecodedResults::texts)
.def_readwrite("scores", &DecodedResults::scores);

py::class_<EncodedResults>(m, "EncodedResults")
.def(py::init<>())
.def_readwrite("tokens", &EncodedResults::tokens)
.def_readwrite("scores", &EncodedResults::scores);

}
24 changes: 24 additions & 0 deletions tests/python_tests/list_test_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
def models_list():
model_ids = [
("TinyLlama/TinyLlama-1.1B-Chat-v1.0", "TinyLlama-1.1B-Chat-v1.0"),
# ("google/gemma-2b-it", "gemma-2b-it"),
# ("google/gemma-7b-it", "gemma-7b-it"),
# ("meta-llama/Llama-2-7b-chat-hf", "Llama-2-7b-chat-hf"),
# ("meta-llama/Llama-2-13b-chat-hf", "Llama-2-13b-chat-hf"),
# ("openlm-research/open_llama_3b", "open_llama_3b"),
# ("openlm-research/open_llama_7b", "open_llama_7b"),
# ("databricks/dolly-v2-3b", "dolly-v2-3b"),
# ("databricks/dolly-v2-12b", "dolly-v2-12b"),
# ("mistralai/Mistral-7B-v0.1", "Mistral-7B-v0.1"),
# ("ikala/redpajama-3b-chat", "redpajama-3b-chat"),
# ("microsoft/phi-1_5", "phi-1_5/"),
# ("Qwen/Qwen1.5-7B-Chat", "Qwen1.5-7B-Chat"),
]
import os
prefix = os.getenv('GENAI_MODELS_PATH_PREFIX', '')
return [(model_id, os.path.join(prefix, model_path)) for model_id, model_path in model_ids]


if __name__ == "__main__":
for model_id, model_path in models_list():
print(model_id, model_path)
4 changes: 4 additions & 0 deletions tests/python_tests/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
pytest
transformers
torch
optimum-intel[openvino] @ git+https://github.com/huggingface/optimum-intel.git@fb1b35bef23242d65b2fb057c4a7ac78a7cfd4c3
116 changes: 116 additions & 0 deletions tests/python_tests/test_generate_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Copyright (C) 2023-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import pytest
from list_test_models import models_list


@pytest.fixture(scope="module", params=models_list())
def model_fixture(request):
model_id, path = request.param
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
return model_id, path, tokenizer, model

def run_hf_ov_genai_comparison(model_fixture, generation_config, prompt):
import openvino_genai as ov_genai
model_id, path, tokenizer, model = model_fixture

generation_config_hf = generation_config.copy()
# in OpenVINO GenAI this parameter is called stop_criteria,
# while in HF it's called early_stopping.
# HF values True, False and "never" correspond to OV GenAI values "early", "heuristic" and "never"
if generation_config_hf.get('stop_criteria'):
generation_config_hf['early_stopping'] = stop_criteria_map()[generation_config_hf.pop('stop_criteria')]

encoded_prompt = tokenizer.encode(prompt, return_tensors='pt', add_special_tokens=True)
hf_encoded_output = model.generate(encoded_prompt, **generation_config_hf)
hf_output = tokenizer.decode(hf_encoded_output[0, encoded_prompt.shape[1]:])

device = 'CPU'
# pipe = ov_genai.LLMPipeline(path, device)

pipe = ov_genai.LLMPipeline(path, device)

ov_output = pipe.generate(prompt, **generation_config)

if hf_output != ov_output:
print(f'hf_output: {hf_output}')
print(f'ov_output: {ov_output}')

assert hf_output == ov_output


def stop_criteria_map():
return {"never": "never", "early": True, "heuristic": False}

test_cases = [
(dict(max_new_tokens=20, do_sample=False), 'table is made of'), # generation_config, prompt
(dict(num_beam_groups=3, num_beams=15, num_return_sequences=15, max_new_tokens=20, diversity_penalty=1.0), 'table is made of'),
# (dict(num_beam_groups=3, num_beams=15, num_return_sequences=15, max_new_tokens=20, diversity_penalty=1.0), 'Alan Turing was a'),
# (dict(num_beam_groups=3, num_beams=15, num_return_sequences=15, max_new_tokens=30, diversity_penalty=1.0), 'Alan Turing was a'),
# (dict(num_beam_groups=2, num_beams=8, num_return_sequences=8, max_new_tokens=20, diversity_penalty=1.0), 'table is made of'),
# (dict(num_beam_groups=2, num_beams=8, num_return_sequences=8, max_new_tokens=20, diversity_penalty=1.0), 'The Sun is yellow because'),
# (dict(num_beam_groups=2, num_beams=8, num_return_sequences=8, max_new_tokens=20, diversity_penalty=1.5), 'The Sun is yellow because'),
]
@pytest.mark.parametrize("generation_config,prompt", test_cases)
def test_greedy_decoding(model_fixture, generation_config, prompt):
run_hf_ov_genai_comparison(model_fixture, generation_config, prompt)


prompts = ['The Sun is yellow because', 'Alan Turing was a', 'table is made of']
@pytest.mark.parametrize("num_beam_groups", [2, 3, 8])
@pytest.mark.parametrize("group_size", [5, 3, 10])
@pytest.mark.parametrize("max_new_tokens", [20, 15])
@pytest.mark.parametrize("diversity_penalty", [1.0, 1.5])
@pytest.mark.parametrize("prompt", prompts)
@pytest.mark.skip # temporarily
def test_beam_search_decoding(model_fixture, num_beam_groups, group_size,
max_new_tokens, diversity_penalty, prompt):
generation_config = dict(
num_beam_groups=num_beam_groups,
num_beams=num_beam_groups * group_size,
diversity_penalty=diversity_penalty,
num_return_sequences=num_beam_groups * group_size,
max_new_tokens=max_new_tokens,
)
run_hf_ov_genai_comparison(model_fixture, generation_config, prompt)


@pytest.mark.parametrize("stop_criteria", ["never", "early", "heuristic"])
@pytest.mark.parametrize("prompt", prompts)
@pytest.mark.parametrize("max_new_tokens", [20, 40, 300])
@pytest.mark.skip # temporarily
def test_stop_criteria(model_fixture, stop_criteria, prompt, max_new_tokens):
# todo: for long sentences early stop_criteria fails
if (stop_criteria == 'early' and max_new_tokens >= 300):
pytest.skip()
generation_config = dict(
num_beam_groups=2,
num_beams=2 * 3,
diversity_penalty=1.0,
num_return_sequences=2 * 3,
max_new_tokens=max_new_tokens,
stop_criteria=stop_criteria,
)
run_hf_ov_genai_comparison(model_fixture, generation_config, prompt)


# test long sequences
@pytest.mark.parametrize("num_beam_groups", [2])
@pytest.mark.parametrize("group_size", [5])
@pytest.mark.parametrize("max_new_tokens", [800, 2000])
@pytest.mark.parametrize("diversity_penalty", [1.0])
@pytest.mark.parametrize("prompt", prompts)
@pytest.mark.skip # will be enabled in nightly since are computationally expensive
def test_beam_search_long_sentences(model_fixture, num_beam_groups, group_size,
max_new_tokens, diversity_penalty, prompt):
generation_config = dict(
num_beam_groups=num_beam_groups,
num_beams=num_beam_groups * group_size,
diversity_penalty=1.0,
num_return_sequences=num_beam_groups * group_size,
max_new_tokens=max_new_tokens,
)
run_hf_ov_genai_comparison(model_fixture, generation_config, prompt)
35 changes: 25 additions & 10 deletions text_generation/causal_lm/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -4,25 +4,29 @@
cmake_minimum_required(VERSION 3.15)
project(causal_lm)

add_subdirectory(../../../thirdparty/openvino_tokenizers/ "${CMAKE_CURRENT_BINARY_DIR}/openvino_tokenizers/")
if(TARGET openvino_tokenizers)
set(OPENVINO_TOKENIZERS_PATH $<TARGET_FILE:openvino_tokenizers>)
else()
set(OPENVINO_TOKENIZERS_PATH ${CMAKE_CURRENT_SOURCE_DIR}/../../bin/openvino_tokenizers.dll) # TODO: I'll go away after the generate() gets a way to find openvino_tokenizers
endif()

find_package(OpenVINOGenAI REQUIRED PATHS
"${CMAKE_BINARY_DIR}" # Reuse the package from the build.
${OpenVINO_DIR} # GenAI may be installed alogside OpenVINO.
)

add_executable(greedy_causal_lm greedy_causal_lm.cpp)
target_compile_definitions(greedy_causal_lm PRIVATE OPENVINO_TOKENIZERS_PATH=\"$<TARGET_FILE:openvino_tokenizers>\")
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
target_link_libraries(greedy_causal_lm PRIVATE openvino::runtime)
target_link_libraries(greedy_causal_lm PRIVATE openvino::genai)
set_target_properties(greedy_causal_lm PROPERTIES CXX_STANDARD 17)
set_target_properties(greedy_causal_lm PROPERTIES CXX_STANDARD_REQUIRED ON)

add_executable(beam_search_causal_lm beam_search_causal_lm.cpp)
target_compile_definitions(beam_search_causal_lm PRIVATE OPENVINO_TOKENIZERS_PATH=\"$<TARGET_FILE:openvino_tokenizers>\")
target_include_directories(beam_search_causal_lm PRIVATE ./)
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
target_link_libraries(beam_search_causal_lm PRIVATE openvino::runtime)
target_link_libraries(beam_search_causal_lm PRIVATE openvino::genai)
set_target_properties(beam_search_causal_lm PROPERTIES CXX_STANDARD 17)
set_target_properties(beam_search_causal_lm PROPERTIES CXX_STANDARD_REQUIRED ON)

add_executable(speculative_decoding_lm speculative_decoding_lm.cpp)
target_compile_definitions(speculative_decoding_lm PRIVATE OPENVINO_TOKENIZERS_PATH=\"$<TARGET_FILE:openvino_tokenizers>\")
target_compile_definitions(speculative_decoding_lm PRIVATE OPENVINO_TOKENIZERS_PATH="${OPENVINO_TOKENIZERS_PATH}")
target_include_directories(speculative_decoding_lm PRIVATE ./)
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
target_link_libraries(speculative_decoding_lm PRIVATE openvino::runtime)
@@ -32,11 +36,22 @@ find_package(TBB REQUIRED COMPONENTS tbb)
target_link_libraries(speculative_decoding_lm PRIVATE TBB::tbb)

add_executable(prompt_lookup_decoding_lm prompt_lookup_decoding_lm.cpp)
target_compile_definitions(prompt_lookup_decoding_lm PRIVATE OPENVINO_TOKENIZERS_PATH=\"$<TARGET_FILE:openvino_tokenizers>\")
target_compile_definitions(prompt_lookup_decoding_lm PRIVATE OPENVINO_TOKENIZERS_PATH="${OPENVINO_TOKENIZERS_PATH}")
target_include_directories(prompt_lookup_decoding_lm PRIVATE ./)
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
target_link_libraries(prompt_lookup_decoding_lm PRIVATE openvino::runtime)
set_target_properties(prompt_lookup_decoding_lm PROPERTIES CXX_STANDARD 17)
set_target_properties(prompt_lookup_decoding_lm PROPERTIES CXX_STANDARD_REQUIRED ON)
find_package(TBB REQUIRED COMPONENTS tbb)
target_link_libraries(prompt_lookup_decoding_lm PRIVATE TBB::tbb)

add_executable(chat_sample chat_sample.cpp)
target_link_libraries(chat_sample PRIVATE openvino::genai)
target_include_directories(chat_sample PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}")
set_target_properties(chat_sample PROPERTIES CXX_STANDARD 17)
set_target_properties(chat_sample PROPERTIES CXX_STANDARD_REQUIRED ON)

install(TARGETS greedy_causal_lm beam_search_causal_lm speculative_decoding_lm prompt_lookup_decoding_lm chat_sample
RUNTIME DESTINATION samples_bin/
COMPONENT samples_bin
EXCLUDE_FROM_ALL)
240 changes: 22 additions & 218 deletions text_generation/causal_lm/cpp/beam_search_causal_lm.cpp
Original file line number Diff line number Diff line change
@@ -1,232 +1,36 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <group_beam_searcher.hpp>
#include <openvino/openvino.hpp>
#include <openvino/genai/llm_pipeline.hpp>

namespace {

enum SPECIAL_TOKEN { PAD_TOKEN = 2 };

std::string detokenize(ov::InferRequest& detokenizer, const std::vector<int64_t>& tokens) {
constexpr size_t BATCH_SIZE = 1;
ov::Tensor inp = detokenizer.get_input_tensor();
inp.set_shape({BATCH_SIZE, tokens.size()});
for (size_t idx = 0; idx < tokens.size(); ++idx) {
inp.data<int64_t>()[idx] = tokens.at(idx);
}
detokenizer.infer();
return detokenizer.get_output_tensor().data<std::string>()[0];
}

std::pair<ov::Tensor, ov::Tensor> pad_left(ov::Tensor&& input_ids, ov::Tensor&& attention_mask) {
const size_t batch_size = input_ids.get_shape().at(0);
const size_t sequence_length = input_ids.get_shape().at(1);
int64_t* inputs_data = input_ids.data<int64_t>();
int64_t* attention_mask_data = attention_mask.data<int64_t>();

for (size_t batch = 0; batch < batch_size; batch++) {
const size_t batch_offset = batch * sequence_length;

// last token in the sequence is not a PAD_TOKEN, skipping
if (inputs_data[batch_offset + sequence_length - 1] != SPECIAL_TOKEN::PAD_TOKEN) {
continue;
}

size_t pad_tokens_number = 0;
for (int i = sequence_length - 1; i >= 0; i--) {
const size_t token_offset = batch_offset + i;

if (inputs_data[token_offset] == SPECIAL_TOKEN::PAD_TOKEN) {
continue;
}

if (pad_tokens_number == 0) {
pad_tokens_number = sequence_length - i - 1;
}

std::swap(inputs_data[token_offset], inputs_data[token_offset + pad_tokens_number]);
std::swap(attention_mask_data[token_offset], attention_mask_data[token_offset + pad_tokens_number]);
}
}

return {input_ids, attention_mask};
enum SPECIAL_TOKEN { PAD_TOKEN = 2 };
}

std::pair<ov::Tensor, ov::Tensor> tokenize(ov::InferRequest& tokenizer, std::vector<std::string> prompts) {
tokenizer.set_input_tensor(ov::Tensor{ov::element::string, {prompts.size()}, prompts.data()});

tokenizer.infer();

pad_left(tokenizer.get_tensor("input_ids"), tokenizer.get_tensor("attention_mask"));

// fix mask filled with '2' instead of '0'
ov::Tensor attention_mask = tokenizer.get_tensor("attention_mask");
int64_t* attention_mask_data = attention_mask.data<int64_t>();
std::replace(attention_mask_data, attention_mask_data + attention_mask.get_size(), 2, 0);

return {tokenizer.get_tensor("input_ids"), tokenizer.get_tensor("attention_mask")};
}

void initialize_position_ids(ov::Tensor& position_ids, const ov::Tensor& attention_mask) {
const size_t batch_size = attention_mask.get_shape().at(0);
const size_t sequence_length = attention_mask.get_shape().at(1);

const int64_t* attention_mask_data = attention_mask.data<int64_t>();
int64_t* position_ids_data = position_ids.data<int64_t>();

for (size_t batch = 0; batch < batch_size; batch++) {
const size_t batch_offset = batch * sequence_length;
size_t sum = 0;

for (size_t i = 0; i < sequence_length; i++) {
const size_t element_offset = batch_offset + i;
position_ids_data[element_offset] = sum;
if (attention_mask_data[element_offset] == 1) {
sum += 1;
}
}
}
}

void initialize_inputs(const ov::Tensor& input_ids, const ov::Tensor& attention_mask, ov::InferRequest& request) {
request.set_tensor("input_ids", input_ids);
request.set_tensor("attention_mask", attention_mask);

ov::Shape input_shape = input_ids.get_shape();

ov::Tensor position_ids = request.get_tensor("position_ids");
position_ids.set_shape(input_shape);
initialize_position_ids(position_ids, attention_mask);

ov::Tensor beam_idx = request.get_tensor("beam_idx");
beam_idx.set_shape({input_shape.at(0)});
std::fill_n(beam_idx.data<int32_t>(), input_shape.at(0), 0);
}

void set_attention_mask(ov::Tensor&& attention_mask, std::vector<int32_t> next_beams) {
ov::Tensor original_mask{ov::element::i64, attention_mask.get_shape()};
ov::Shape original_shape = original_mask.get_shape();
attention_mask.copy_to(original_mask);

ov::Shape new_shape{next_beams.size(), original_mask.get_shape().at(1) + 1};
attention_mask.set_shape(new_shape);

for (size_t beam_id = 0; beam_id < next_beams.size(); beam_id++) {
const size_t original_prompt_offset = next_beams.at(beam_id) * original_shape.at(1);
const size_t result_prompt_offset = beam_id * new_shape.at(1);

int64_t* dest = attention_mask.data<int64_t>() + result_prompt_offset;
const int64_t* src = original_mask.data<int64_t>() + original_prompt_offset;

std::memcpy(dest, src, original_shape.at(1) * sizeof(int64_t));
attention_mask.data<int64_t>()[result_prompt_offset + new_shape.at(1) - 1] = 1;
}
}

void set_position_ids(ov::Tensor&& position_ids, const ov::Tensor&& attention_mask) {
const size_t batch_size = attention_mask.get_shape().at(0);
const size_t sequence_length = attention_mask.get_shape().at(1);
position_ids.set_shape({batch_size, 1});

for (size_t batch = 0; batch < batch_size; batch++) {
int64_t* mask_start = attention_mask.data<int64_t>() + batch * sequence_length;
position_ids.data<int64_t>()[batch] = std::accumulate(mask_start, mask_start + sequence_length - 1, 0);
}
}

std::vector<std::string> prompts_arguments_to_vector(int argc, char* argv[]) {
std::vector<std::string> prompts;
prompts.reserve(argc - 2);
for (size_t i = 2; i < argc; i++) {
prompts.push_back(std::string{argv[i]});
}
return prompts;
}

} // namespace

int main(int argc, char* argv[]) try {
if (argc < 3) {
throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> '<PROMPT 1>' ['<PROMPT 2>' ...]");
}

// Compile models
ov::Core core;
core.add_extension(OPENVINO_TOKENIZERS_PATH); // OPENVINO_TOKENIZERS_PATH is defined in CMakeLists.txt
// Read the tokenizer model information from the file to later get the runtime information
auto tokenizer_model = core.read_model(std::string{argv[1]} + "/openvino_tokenizer.xml");
// tokenizer and detokenizer work on CPU only
ov::InferRequest tokenizer = core.compile_model(tokenizer_model, "CPU").create_infer_request();
ov::InferRequest detokenizer =
core.compile_model(std::string{argv[1]} + "/openvino_detokenizer.xml", "CPU").create_infer_request();
// The model can be compiled for GPU as well
ov::InferRequest lm =
core.compile_model(std::string{argv[1]} + "/openvino_model.xml", "CPU").create_infer_request();

auto [input_ids, attention_mask] = tokenize(tokenizer, prompts_arguments_to_vector(argc, argv));

// Initialize beam search
const int64_t* prompt_data = input_ids.data<const int64_t>();
std::vector<std::vector<int64_t>> prompts;
prompts.reserve(input_ids.get_shape().at(0));
for (size_t batch = 0; batch < input_ids.get_shape().at(0); batch++) {
size_t sequence_length = input_ids.get_shape().at(1);
size_t batch_offset = batch * sequence_length;
const int64_t* prompt_start = prompt_data + batch_offset;
prompts.push_back(std::vector<int64_t>{prompt_start, prompt_start + sequence_length});
}

// Get the runtime info from the tokenizer model that we read earlier
auto rt_info = tokenizer_model->get_rt_info(); // Get the runtime info for the model
int64_t SPECIAL_EOS_TOKEN;

if (rt_info.count("eos_token_id") > 0) { // check if the runtime information has a valid EOS token ID
SPECIAL_EOS_TOKEN = rt_info["eos_token_id"].as<int64_t>();

} else {
throw std::runtime_error("EOS token ID not found in model's runtime information.");
}

Parameters parameters{std::move(prompts), SPECIAL_EOS_TOKEN};
GroupBeamSearcher group_beam_searcher{parameters};

initialize_inputs(input_ids, attention_mask, lm);

std::vector<int64_t> next_tokens;
std::vector<int32_t> next_beams;

for (size_t length_count = 0; length_count < parameters.max_new_tokens; ++length_count) {
lm.infer();

std::tie(next_tokens, next_beams) = group_beam_searcher.select_next_tokens(lm.get_tensor("logits"));
if (next_tokens.empty()) {
break;
}
size_t batch_size = next_tokens.size();
// Set pointers
lm.set_tensor("input_ids", ov::Tensor{ov::element::i64, {batch_size, 1}, next_tokens.data()});
lm.set_tensor("beam_idx", ov::Tensor{ov::element::i32, {batch_size}, next_beams.data()});
// Set auxiliary inputs
set_attention_mask(lm.get_tensor("attention_mask"), next_beams);
set_position_ids(lm.get_tensor("position_ids"), lm.get_tensor("attention_mask"));
}

for (const std::vector<std::vector<Beam>>& prompt_group : finalize(std::move(group_beam_searcher))) {
std::cout << "Prompt:\n";
for (const std::vector<Beam> group : prompt_group) {
std::cout << "Group:\n";
for (const Beam& beam : group) {
std::cout << beam.score << ": " << detokenize(detokenizer, beam.tokens) << '\n';
}
}
}
// Model is stateful which means that context (kv-cache) which belongs to a particular
// text sequence is accumulated inside the model during the generation loop above.
// This context should be reset before processing the next text sequence.
// While it is not required to reset context in this sample as only one batch of sequences is processed,
// it is called for education purposes:
lm.reset_state();
auto prompts = std::vector<std::string>(argv + 2, argv + argc);

std::string model_path = argv[1];
std::string device = "CPU"; // GPU can be used as well

ov::genai::LLMPipeline pipe(model_path, device);
ov::genai::GenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 20;
config.num_beam_groups = 3;
config.num_beams = 15;
config.num_return_sequences = config.num_beams * prompts.size();

// workaround until pad_token_id is not written into IR
pipe.get_tokenizer().set_pad_token_id(PAD_TOKEN);

auto beams = pipe.generate(prompts, config);
for (int i = 0; i < beams.scores.size(); i++)
std::cout << beams.scores[i] << ": " << beams.texts[i] << '\n';

return 0;
} catch (const std::exception& error) {
std::cerr << error.what() << '\n';
return EXIT_FAILURE;
50 changes: 50 additions & 0 deletions text_generation/causal_lm/cpp/chat_sample.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <openvino/openvino.hpp>
#include "openvino/genai/llm_pipeline.hpp"

using namespace std;

std::vector<string> questions = {
"1+1=",
"what was the previous answer?",
"Why is the sky blue?",
"4+10=",
"What is Intel OpenVINO?",
"Can you briefly summarize what I asked you about during this session?",
};

int main(int argc, char* argv[]) try {
std::string prompt;
std::string accumulated_str = "";

std::string model_path = argv[1];
ov::genai::LLMPipeline pipe(model_path, "CPU");

ov::genai::GenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 10000;
auto streamer = [](std::string word) { std::cout << word << std::flush; };

pipe.start_chat();
for (size_t i = 0; i < questions.size(); i++) {
// std::getline(std::cin, prompt);
prompt = questions[i];

std::cout << "question:\n";
cout << prompt << endl;

// auto answer_str = pipe(prompt, config, streamer);
auto answer_str = pipe.generate(prompt, ov::genai::max_new_tokens(10000), ov::genai::streamer(streamer));
accumulated_str += answer_str;

cout << "\n----------\n";
}
pipe.finish_chat();
} catch (const std::exception& error) {
std::cerr << error.what() << '\n';
return EXIT_FAILURE;
} catch (...) {
std::cerr << "Non-exception object thrown\n";
return EXIT_FAILURE;
}
138 changes: 18 additions & 120 deletions text_generation/causal_lm/cpp/greedy_causal_lm.cpp
Original file line number Diff line number Diff line change
@@ -1,129 +1,27 @@
// Copyright (C) 2023-2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include <openvino/openvino.hpp>

namespace {
std::pair<ov::Tensor, ov::Tensor> tokenize(ov::InferRequest& tokenizer, std::string&& prompt) {
constexpr size_t BATCH_SIZE = 1;
tokenizer.set_input_tensor(ov::Tensor{ov::element::string, {BATCH_SIZE}, &prompt});
tokenizer.infer();
return {tokenizer.get_tensor("input_ids"), tokenizer.get_tensor("attention_mask")};
}

std::string detokenize(ov::InferRequest& detokenizer, std::vector<int64_t>& tokens) {
constexpr size_t BATCH_SIZE = 1;
detokenizer.set_input_tensor(ov::Tensor{ov::element::i64, {BATCH_SIZE, tokens.size()}, tokens.data()});
detokenizer.infer();
return detokenizer.get_output_tensor().data<std::string>()[0];
}

// The following reasons require TextStreamer to keep a cache of previous tokens:
// detokenizer removes starting ' '. For example detokenize(tokenize(" a")) == "a",
// but detokenize(tokenize("prefix a")) == "prefix a"
// 1 printable token may consist of 2 token ids: detokenize(incomplete_token_idx) == "�"
struct TextStreamer {
ov::InferRequest detokenizer;
std::vector<int64_t> token_cache;
size_t print_len = 0;

void put(int64_t token) {
token_cache.push_back(token);
std::string text = detokenize(detokenizer, token_cache);
if (!text.empty() && '\n' == text.back()) {
// Flush the cache after the new line symbol
std::cout << std::string_view{text.data() + print_len, text.size() - print_len};
token_cache.clear();
print_len = 0;
return;
}
if (text.size() >= 3 && text.compare(text.size() - 3, 3, "") == 0) {
// Don't print incomplete text
return;
}
std::cout << std::string_view{text.data() + print_len, text.size() - print_len} << std::flush;
print_len = text.size();
}

void end() {
std::string text = detokenize(detokenizer, token_cache);
std::cout << std::string_view{text.data() + print_len, text.size() - print_len} << '\n';
token_cache.clear();
print_len = 0;
}
};
}
#include "openvino/genai/llm_pipeline.hpp"

int main(int argc, char* argv[]) try {
if (argc != 3) {
throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> '<PROMPT>'");
}
// Compile models
ov::Core core;
core.add_extension(OPENVINO_TOKENIZERS_PATH); // OPENVINO_TOKENIZERS_PATH is defined in CMakeLists.txt
//Read the tokenizer model information from the file to later get the runtime information
auto tokenizer_model = core.read_model(std::string{argv[1]} + "/openvino_tokenizer.xml");
// tokenizer and detokenizer work on CPU only
ov::InferRequest tokenizer = core.compile_model(
tokenizer_model, "CPU").create_infer_request();
auto [input_ids, attention_mask] = tokenize(tokenizer, argv[2]);
ov::InferRequest detokenizer = core.compile_model(
std::string{argv[1]} + "/openvino_detokenizer.xml", "CPU").create_infer_request();
// The model can be compiled for GPU as well
ov::InferRequest lm = core.compile_model(
std::string{argv[1]} + "/openvino_model.xml", "CPU").create_infer_request();
auto seq_len = input_ids.get_size();

// Initialize inputs
lm.set_tensor("input_ids", input_ids);
lm.set_tensor("attention_mask", attention_mask);
ov::Tensor position_ids = lm.get_tensor("position_ids");
position_ids.set_shape(input_ids.get_shape());
std::iota(position_ids.data<int64_t>(), position_ids.data<int64_t>() + seq_len, 0);
constexpr size_t BATCH_SIZE = 1;
// Input values are persistent between inference calls.
// That allows to set values, which aren't going to change, only once
lm.get_tensor("beam_idx").set_shape({BATCH_SIZE});
lm.get_tensor("beam_idx").data<int32_t>()[0] = 0;
lm.infer();
size_t vocab_size = lm.get_tensor("logits").get_shape().back();
float* logits = lm.get_tensor("logits").data<float>() + (seq_len - 1) * vocab_size;
int64_t out_token = std::max_element(logits, logits + vocab_size) - logits;

lm.get_tensor("input_ids").set_shape({BATCH_SIZE, 1});
position_ids.set_shape({BATCH_SIZE, 1});
TextStreamer text_streamer{std::move(detokenizer)};

// Get the runtime info from the tokenizer model that we read earlier
auto rt_info = tokenizer_model->get_rt_info(); //Get the runtime info for the model
int64_t SPECIAL_EOS_TOKEN;
if (3 > argc || argc > 4)
throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> \"<PROMPT>\" <DEVICE>");

if (rt_info.count("eos_token_id") > 0) { //check if the runtime information has a valid EOS token ID
SPECIAL_EOS_TOKEN = rt_info["eos_token_id"].as<int64_t>();
} else {
throw std::runtime_error("EOS token ID not found in model's runtime information.");
}

int max_sequence_length = 100;
while (out_token != SPECIAL_EOS_TOKEN && seq_len < max_sequence_length) {
++seq_len;
lm.get_tensor("input_ids").data<int64_t>()[0] = out_token;
lm.get_tensor("attention_mask").set_shape({BATCH_SIZE, seq_len});
std::fill_n(lm.get_tensor("attention_mask").data<int64_t>(), seq_len, 1);
position_ids.data<int64_t>()[0] = int64_t(seq_len - 1);
lm.start_async();
text_streamer.put(out_token);
lm.wait();
logits = lm.get_tensor("logits").data<float>();
out_token = std::max_element(logits, logits + vocab_size) - logits;
}
text_streamer.end();
// Model is stateful which means that context (kv-cache) which belongs to a particular
// text sequence is accumulated inside the model during the generation loop above.
// This context should be reset before processing the next text sequence.
// While it is not required to reset context in this sample as only one sequence is processed,
// it is called for education purposes:
lm.reset_state();
std::string model_path = argv[1];
std::string prompt = argv[2];

// GPU can be used as well
std::string device = "CPU";
if (argc > 3) device = argv[3];

ov::genai::LLMPipeline pipe(model_path, device);
ov::genai::GenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 100;
config.do_sample = false;
auto streamer = [](std::string subword){std::cout << subword << std::flush;};

// since streamer is set results will be printed each time a new token is generated
pipe.generate(prompt, config, streamer);
} catch (const std::exception& error) {
std::cerr << error.what() << '\n';
return EXIT_FAILURE;
417 changes: 417 additions & 0 deletions third-party-programs.txt

Large diffs are not rendered by default.