Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance ext_proc filter to support MXN streaming #34942

Merged
merged 78 commits into from
Nov 8, 2024

Conversation

yanjunxiang-google
Copy link
Contributor

@yanjunxiang-google yanjunxiang-google commented Jun 26, 2024

This PR is for issue: #32090. One of the use case is, like compression by the external processing.

This is to let the ext_proc server be able to buffer M request body chunks from Envoy first, processing them, then send N chunks back to Envoy in the STREAMED mode. It also let the server buffer the entire message, i.e, header, body, trailer, before sending back any response.

The ext_proc MXN streaming works this way:

  1. Enable the MXN streaming by configuring the body mode to be BIDIRECTIONAL_STREAMED in the ext_proc filter config.
  2. Config the trailer mode to be SEND in the ext_proc filter config.

With above config, Envoy will send body to the ext_proc server as they arrival. The server can buffer the entire or partial of the body (M chunks) then streaming the mutated body(may need to split into N chunks), back to Envoy.

Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @mattklein123
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #34942 was opened by yanjunxiang-google.

see: more, trace.

@yanjunxiang-google yanjunxiang-google changed the title Streamed more chunks Enhance ext_proc filter to support MxN streaming Jul 10, 2024
Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
@yanjunxiang-google yanjunxiang-google changed the title Enhance ext_proc filter to support MxN streaming Enhance ext_proc filter to support M:N streaming Jul 10, 2024
Signed-off-by: Yanjun Xiang <[email protected]>
@yanjunxiang-google yanjunxiang-google marked this pull request as ready for review July 11, 2024 02:07
@yanjunxiang-google
Copy link
Contributor Author

/assign @gbrail @htuch @jmarantz @tyxia @yanavlasov

Copy link

@gbrail cannot be assigned to this issue.

🐱

Caused by: a #34942 (comment) was created by @yanjunxiang-google.

see: more, trace.

Signed-off-by: Yanjun Xiang <[email protected]>
@KBaichoo
Copy link
Contributor

/assign @tyxia

As codeowner for first pass.

Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
@yanjunxiang-google
Copy link
Contributor Author

Kind Ping!

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the tests yet. Are we hitting all the corner cases?

To answer this I'd look at a coverage map generated by CI, which you can do with a few (non-obvious) clicks.

if (!message_timer_) {
message_timer_ = filter_callbacks_->dispatcher().createTimer(cb);

if (bodyMode() != ProcessingMode::FULL_DUPLEX_STREAMED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please comment how the next processing step occurs in full duplex mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment reflects what the code does (which is clear enough already from reading the code), but not for what the plan for how the next step will be.

source/extensions/filters/http/ext_proc/processor_state.cc Outdated Show resolved Hide resolved
return ProcessorState::CallbackState::TrailersCallback;
}
}

return ProcessorState::CallbackState::Idle;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleHeadersResponse is too big to comprehend. It wil be hard to know whether the change you made might have the desired effect, and no undesired ones.

WDYT of breaking this one up also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are unit tests and integration tests are added special for the code change in this function.
Sure, due to historical reason, there are quite some technical debt in the ext_proc filter state machine code. I added a TODO here. Let's take care of these technical debts in separate follow up PRs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By technical debt (by that, do you mean "lack of test coverage")?

I think it would be better to get the test coverage really solid before adding a lot of complexity. THis stuff is really complicated to read, and it would help a lot if we at least had confidence in all the code getting covered in tests.

Copy link
Contributor Author

@yanjunxiang-google yanjunxiang-google Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test coverage, we should be good. These are the test cases we added for this new body mode:

Integration tests:

  1. Server buffer headers, whole body, before sending response.
  2. Server buffer headers, whole body, and trailers, before sending response.
  3. Server buffer headers, and certain amount of body, then send send body response without wait of the end of it. At same time new body is coming in, and server continue do this kind of buffer-processing-response for a while. Then eventually trailers come in. Then server sends last chunk body response, and trailer response.

Unit tests:

  1. Client sends header and body. Server send header response once receive header request, i.e, not waiting for body.
  2. Client sends header and trailer, no body. Server sends header response after receiving trailer.
  3. For one HTTP stream , server do MxN processing for some chunks, then do 1x1(i.e, send one response for one request immediately) for some chunks, then do MxN again.
  4. A couple of server misbehaving test cases
  5. A couple of Envoy misconfiguration test cases.

These tests are trying to cover different scenarios, like client requests may or may not have body, may or may not have trailer. Server sends header response may or may not wait for body, may or may not wait for trailers, may or may not buffer, et.al.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we hitting all the lines in the coverage report?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of test coverage for ext_proc code in a whole, I recall it meets the Envoy criteria, like >96.3%. And also the ext_proc fuzzer coverage is > 70%

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good, but I'd still like you to look at the line coverage and see if we are hitting all your new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source/extensions/filters/http/ext_proc/processor_state.cc Outdated Show resolved Hide resolved
@@ -39,6 +39,7 @@ envoy_extension_cc_test(
}),
extension_names = ["envoy.filters.http.ext_proc"],
rbe_pool = "2core",
shard_count = 8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want to know why this test is so slow you need to use mutliple cores and 8 shards. I think I asked this before but didn't see the answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I responded this here: #34942 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up on that thread but copying here:

I don't have objection to calling it large if it's large.

I just am surprised it takes this long and feel like we must be having some sleeps or something more complex than a unit test usually is, beyond having a number of test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly check the trace log timestamp. it looks the tests themselves are very fast. i.e, <5ms in a local setup. It's the test initialization consumed most of the time (>100ms in the same local setup). And this is the case for existing tests as well, like the very basic SimpestPost test:

TEST_F(HttpFilterTest, SimplestPost) {

Signed-off-by: Yanjun Xiang <[email protected]>
Signed-off-by: Yanjun Xiang <[email protected]>
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

@repokitteh-read-only repokitteh-read-only bot removed the api label Nov 8, 2024
Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments regarding how to handle server response/behavior. Thanks for patience

if (!message_timer_) {
message_timer_ = filter_callbacks_->dispatcher().createTimer(cb);

// Skip starting timer For FULL_DUPLEX_STREAMED body mode.
Copy link
Member

@tyxia tyxia Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without timer, how we are going to handle timeout situation that side stream server doesn't respond?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the side stream server does not respond, the router filter idle timeout will kick in, and destroy ext_proc filter. This will be same as a backend server does not respond to a client request.

ENVOY_LOG(debug, "Applying body response to chunk of data. Size = {}", chunk->length);
MutationUtils::applyBodyMutations(common_response.body_mutation(), chunk_data);
}
bool should_continue = chunk->end_stream;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the side stream server doesn't(or JUST forget to) respond with end_stream = true, look like this is not handled and will be stuck there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the side stream does not respond or never send end_of_stream to be true. ext_proc filter will keep waiting for for response, and eventually router filter timeout should kick in and destroy the ext_proc filter. This should be same as if backend server misbehaves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this issue #37065 to track the work to add an integration test if the server failed to send response in time and router filter timeout.

Copy link
Member

@tyxia tyxia Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This theory doesn't sound solid to me and it seems to make all existing individual filter's error handling pointless.

Besides, tightly coupling side stream error with router/upstream will introduce a bad observability and customer experience, as they are two different errors.

I think we should improve the error handling of this design (maybe extproc can have its own timeout)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought we agreed that no ext_proc timer for FULL_DUPLEX_STREAMED mode. And let the generic Envoy router idle timer(default 15s: ) to take care of the cases if server is not responding.
And this will make the server be able to buffer more data, and maybe all the way to the end_of_stream received. Adding an ext_proc specific timer will limit this capability.

Signed-off-by: Yanjun Xiang <[email protected]>
@yanjunxiang-google
Copy link
Contributor Author

@adisuissa I did an upstream merge. It needs your API approval again. Thanks!

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm api

@repokitteh-read-only repokitteh-read-only bot removed the api label Nov 8, 2024
@tyxia
Copy link
Member

tyxia commented Nov 8, 2024

LGTM, as a WIP/good start.

I think those open comments above need to be addressed (and some more tests/loadtest) to complete this feature.

@yanavlasov yanavlasov merged commit 72a2067 into envoyproxy:main Nov 8, 2024
21 checks passed
@yanjunxiang-google yanjunxiang-google deleted the streamed_more_chunks branch November 11, 2024 14:50
yanavlasov pushed a commit that referenced this pull request Dec 23, 2024
Adding an MxN integration test to test timeout mechanism works

This is to address a left-over comments of original MxN commit:
#34942

Fix: #37065

Signed-off-by: Yanjun Xiang <[email protected]>
jmarantz pushed a commit that referenced this pull request Jan 2, 2025
…nse (#37773)

Adding ext_proc MxN test with ext_proc server send out-of-order
response.
This is to address a left-over comments of the initial MxN PR:
#34942.


Fix: #37064

---------

Signed-off-by: Yanjun Xiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.