Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch NPU LLM execution to ov::genai::StatefulLLMPipeline #1677

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

TolyaTalamanov
Copy link
Collaborator

No description provided.

@github-actions github-actions bot added category: LLM LLM pipeline (stateful, static) no-match-files labels Feb 5, 2025
@TolyaTalamanov TolyaTalamanov marked this pull request as draft February 5, 2025 15:30
@ilya-lavrenov ilya-lavrenov added this to the 2025.1 milestone Feb 5, 2025
@TolyaTalamanov TolyaTalamanov marked this pull request as ready for review February 6, 2025 12:04
@github-actions github-actions bot added the category: cmake / build Cmake scripts label Feb 6, 2025
@TolyaTalamanov TolyaTalamanov changed the title Switch NPU LLM execution to ov::genai::StatefulLLMPipeline Switch NPU LLM execution to ov::genai::StatefulLLMPipeline [DON'T MERGE!] Feb 6, 2025
@github-actions github-actions bot added the category: visual language Visual language pipeline label Feb 17, 2025
@TolyaTalamanov
Copy link
Collaborator Author

Depends on: #1748

@TolyaTalamanov TolyaTalamanov changed the title Switch NPU LLM execution to ov::genai::StatefulLLMPipeline [DON'T MERGE!] Switch NPU LLM execution to ov::genai::StatefulLLMPipeline Feb 19, 2025
@github-actions github-actions bot added category: text to image Text 2 image pipeline category: GHA CI based on Github actions category: cmake / build Cmake scripts category: Python API Python API for GenAI category: samples GenAI samples category: GenAI C++ API Changes in GenAI C++ public headers and removed category: visual language Visual language pipeline category: cmake / build Cmake scripts labels Feb 19, 2025
@github-actions github-actions bot removed category: text to image Text 2 image pipeline category: GHA CI based on Github actions category: cmake / build Cmake scripts labels Feb 19, 2025
// NB: OPTIMIZE_SIZE is only possible when model_path is defined!
// Otherwise, switch to OPTIMIZE_SPEED
if (cache_mode.has_value() && *cache_mode == CacheMode::OPTIMIZE_SIZE && !model_path.empty()) {
compiled = ov::genai::utils::singleton_core().compile_model(model_path, "NPU", properties);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus way we will read the model by model_path twice, won't we?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe your point is the same as @ilya-lavrenov. @smirnov-alexey please explain :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AsyaPronina Actually, we read model once in case of OPTIMIZE_SPEED and don't explicitly read by default

Copy link
Contributor

@AsyaPronina AsyaPronina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Minor comments left.

kv_desc.min_response_len = pop_int_and_cast(properties, "MIN_RESPONSE_LEN").value_or(128u);
update_npu_config(properties, model, kv_pos, kv_desc);
auto cache_mode = get_option<CacheMode>(config, "CACHE_MODE");
// NB: OPTIMIZE_SIZE is only possible when model_path is defined!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smirnov-alexey will this limitation be fixed in nearest future?

It will be great to avoid custom path for NPU (passing extra models_path) through a chain of calls

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-lavrenov As I understood this is how it work on openvino side:

https://docs.openvino.ai/2025/api/c_cpp_api/classov_1_1_core.html

This can be more efficient than using the Core::read_model + Core::compile_model(model_in_memory_object) flow, especially for cases when caching is enabled and a cached model is available.

Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose those are different things. I suppose you need to check CACHE_DIR instead of CACHE_MODE, then?

Definitely compile_model with path is preferable, but CACHE_MODE optimization (is it about weightless cache? I suppose it should work even with CACHE_MODE = OPTIMIZE_SPEED for NPU) should also work with ov::Model as well, IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, model has been already read via read_model using a "common" code and some transformations are applied.
So, looks strange that you ignore this model and read your own one inside the plugin.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-lavrenov You're right, it's an artifact of copy-paste, will fix, thanks!

Copy link
Contributor

@smirnov-alexey smirnov-alexey Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilya-lavrenov I believe read_model() + compile_model(model) doesn't work. I previously discussed it with GPU plguin and I think the only way to import model via CACHE_DIR is using compile_model(model_path). It has something to do with https://github.com/openvinotoolkit/openvino/pull/27162/files
Please correct me if I'm wrong, but I couldn't achieve weightless caching with CACHE_DIR and model. If I recall correctly, WEIGHTS_PATH is not set in this case.

Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can enable weightless blob for already read ov::Models
See openvinotoolkit/openvino#29101, could you please try that PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we get increased memory consumption

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another attempt openvinotoolkit/openvino#29107 😄

static_config = { **default_config, 'STATIC_PIPELINE': 'STATEFUL' }

# Test both, static and generic pipelines
pipeline_configs = [default_config, static_config]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it will be renamed to npu tests

'NPUW_ONLINE_PIPELINE': 'NONE'
} | get_default_llm_properties()

static_config = { **default_config, 'STATIC_PIPELINE': 'STATEFUL' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need 'STATIC_PIPELINE': 'STATEFUL'?

auto cache_mode = get_option<CacheMode>(config, ov::cache_mode.name());
// NB: OPTIMIZE_SIZE is only possible when model_path is defined!
// Otherwise, switch to OPTIMIZE_SPEED
if (cache_mode.has_value() && *cache_mode == CacheMode::OPTIMIZE_SIZE && !model_path.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TolyaTalamanov previous behavior was in default case we just use model_path (weightless). And user would need to explicitly specify OPTIMIZE**_SPEED**. Please, return to this behavior - no need for user to specify OPTIMIZE_SIZE here, it's our default option

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, changed it now, could you check?

@@ -139,8 +139,8 @@ ov::genai::LLMPipeline::LLMPipeline(
m_pimpl = std::make_unique<ContinuousBatchingAdapter>(models_path, tokenizer, scheduler_config, device, device_properties);
}

if (m_pimpl == nullptr && device == "NPU") {
m_pimpl = static_llm::LLMPipelineFactory::create(models_path, tokenizer, device, properties);
if (m_pimpl == nullptr && device == "NPU" && properties.count("STATIC_PIPELINE")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my previous review - can we just get rid of STATIC_PIPELINE since our only option is STATEFULL?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely it's not the case, but what if we have another static pipeline implementation? I plan to remove it anyway next week preferably when switching to generic pipeline doesn't bring any perf regressions

Copy link
Contributor

@smirnov-alexey smirnov-alexey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern here is OPTIMIZE_SPEED and OPTIMIZE_SIZE behavior

@smirnov-alexey
Copy link
Contributor

@TolyaTalamanov please wait for merge until I test Ilya's changes. Likely we could remove cache_mode logic

@smirnov-alexey
Copy link
Contributor

smirnov-alexey commented Feb 21, 2025

@TolyaTalamanov please wait for merge until I test Ilya's changes. Likely we could remove cache_mode logic

Tested - doesn't work out-of-the-box. I suggest we implement it separately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: LLM LLM pipeline (stateful, static) do_not_merge no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants