Inference streaming support #1750

RobertSamoilescu · 2024-05-09T15:11:16Z

This PR includes streaming support for MLServer by allowing the user to implement in the runtime the predict_stream method which expects as input a async generator of request an outputs a async generator of response.

class MyModel(MLModel):

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
	    pass

    async def predict_stream(
        self, payloads: AsyncIterator[InferenceRequest]
    ) -> AsyncIterator[InferenceResponse]:
	    pass

While the input-output types for the predict remain the same, for the predict_stream the implementation can handle a stream of inputs and a stream of outputs. This design choice is quite general and can cover many input-output scenarios:

unary input - unary output (handled by predict)
unary input - stream output (handled by predict_stream)
stream input - unary output (handled by predict_stream)
stream input - stream output (handled by predict_stream )

Although for REST, streamed input might not be a thing and currently not supported, for gRPC it is quite natural to have. In the case that a user will like to use streamed inputs, then they will have to use gRPC.

Exposed endpoints

We expose the following endpoints (+ the ones including the version) to the user:

/v2/models/{model_name}/infer
/v2/models/{model_name}/infer_stream
/v2/models/{model_name}/generate
/v2/models/{model_name}/generate_stream

The first two are general purpose endpoints while the later two are LLM specific (see open inference protocol here). Note that the infer and generate endpoints will point to the infer implementation while infer_stream and generate_stream will point to infer_stream implementation defined above.

Client calls

REST non-streaming

import os
import requests
from mlserver import types
from mlserver.codecs import StringCodec

TESTDATA_PATH = "../tests/testdata/"
payload_path = os.path.join(TESTDATA_PATH, "generate-request.json")
inference_request = types.InferenceRequest.parse_file(payload_path)


api_url = "http://localhost:8080/v2/models/text-model/generate"
response = requests.post(api_url, json=inference_request.dict())
response = types.InferenceResponse.parse_raw(response.text)
print(StringCodec.decode_output(response.outputs[0]))

REST streaming

import os
import httpx
from httpx_sse import connect_sse
from mlserver import types
from mlserver.codecs import StringCodec

TESTDATA_PATH = "../tests/testdata/"
payload_path = os.path.join(TESTDATA_PATH, "generate-request.json")
inference_request = types.InferenceRequest.parse_file(payload_path)

with httpx.Client() as client:
    with connect_sse(client, "POST", "http://localhost:8080/v2/models/text-model/generate_stream", json=inference_request.dict()) as event_source:
        for sse in event_source.iter_sse():
            response = types.InferenceResponse.parse_raw(sse.data)
            print(StringCodec.decode_output(response.outputs[0]))

gRPC non-streaming

import os
import grpc
import mlserver.grpc.converters as converters
import mlserver.grpc.dataplane_pb2_grpc as dataplane
import mlserver.types as types
from mlserver.codecs import StringCodec
from mlserver.grpc.converters import ModelInferResponseConverter

TESTDATA_PATH = "../tests/testdata/"
payload_path = os.path.join(TESTDATA_PATH, "generate-request.json")
inference_request = types.InferenceRequest.parse_file(payload_path)

# need to convert from string to bytes for grpc
inference_request.inputs[0] = StringCodec.encode_input("prompt", inference_request.inputs[0].data.__root__)
inference_request_g = converters.ModelInferRequestConverter.from_types(
    inference_request, model_name="text-model", model_version=None
)
grpc_channel = grpc.insecure_channel("localhost:8081")
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
response = grpc_stub.ModelInfer(inference_request_g)

response = ModelInferResponseConverter.to_types(response)
print(StringCodec.decode_output(response.outputs[0]))

gRPC streaming

import os
import grpc
import mlserver.grpc.converters as converters
import mlserver.grpc.dataplane_pb2_grpc as dataplane
import mlserver.types as types
from mlserver.codecs import StringCodec
from mlserver.grpc.converters import ModelInferResponseConverter


TESTDATA_PATH = "../tests/testdata/"
payload_path = os.path.join(TESTDATA_PATH, "generate-request.json")
inference_request = types.InferenceRequest.parse_file(payload_path)

# need to convert from string to bytes for grpc
inference_request.inputs[0] = StringCodec.encode_input("prompt", inference_request.inputs[0].data.__root__)
inference_request_g = converters.ModelInferRequestConverter.from_types(
    inference_request, model_name="text-model", model_version=None
)

async def get_inference_request_stream(inference_request):
    yield inference_request

async with grpc.aio.insecure_channel("localhost:8081") as grpc_channel:
    grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
    inference_request_stream = get_inference_request_stream(inference_request_g)
    
    async for response in grpc_stub.ModelStreamInfer(inference_request_stream):
        response = ModelInferResponseConverter.to_types(response)
        print(StringCodec.decode_output(response.outputs[0]))

Limitations

GZipMiddleware must be disabled since it is not compatible with starlette streaming ("gzip_enabled": false)
GRPC metrics endpoints must be disabled - further investigation in a following PR ("metrics_endpoint": null)
Parallel workers are not supported ("parallel_workers": 0)
Error handling for REST is not supported - this is because when the error raised is of asyncio.exceptions.CancelledError type. CancelledError inherits from BaseException and the starlette middleware for error handling checks for the type Exception.

mlserver/handlers/dataplane.py

sakoush

In general it looks great, I left some comments though and I will look at tests next.

benchmarking/testserver/models/text-model/settings.json

sakoush · 2024-05-15T13:33:27Z

benchmarking/testserver/models/text-model/text_model.py

@@ -0,0 +1,45 @@
+import asyncio


should we add an example infer.py that uses this model? it is part of the PR description but probably better to have it here as well. Happy for it to be part of a follow up docs and examples PR.

mlserver/batching/hooks.py

sakoush · 2024-05-15T15:46:29Z

mlserver/batching/hooks.py

+        payload: AsyncIterator[InferenceRequest],
+    ) -> AsyncIterator[InferenceResponse]:
+        model = _get_model(f)
+        logger.warning(


is this going to be logged on mlserver for every request? I think this might pollute the logs? I guess if the user doesnt set adaptive batching then this code path will not be hit anyway?

Moved it outside.

mlserver/grpc/dataplane_pb2.pyi

sakoush · 2024-05-17T10:20:10Z

mlserver/handlers/dataplane.py

+                break
+
+            payload = self._prepare_payload(payload, model)
+            payloads_decorated = self._payloads_decorator(payload, payloads, model)


is this really a decorator logic?

I renamed it.

sakoush · 2024-05-17T10:20:46Z

mlserver/handlers/dataplane.py

+            payload = self._prepare_payload(payload, model)
+            payloads_decorated = self._payloads_decorator(payload, payloads, model)
+
+            async for prediction in model.predict_stream(payloads_decorated):


what happens if one element in the stream fails? do we still keep going or should we break?

mlserver/model.py

sakoush · 2024-05-17T10:30:20Z

mlserver/rest/endpoints.py

+    """
+    async for inference_response in infer_stream:
+        # TODO: How should we send headers back?
+        # response_headers = extract_headers(inference_response)


what are the kind of headers we usually send back in the response of infer?

See link here

proto/dataplane.proto

sakoush

comments on testing. I think we should add cases for:

errors on infer_stream
input streaming

tests/fixtures.py

sakoush · 2024-05-17T11:01:58Z

tests/grpc/test_servicers.py

+@pytest.mark.parametrize(
+    "sum_model_settings", [lazy_fixture("text_stream_model_settings")]
+)
+@pytest.mark.parametrize("sum_model", [lazy_fixture("text_stream_model")])


what is sum_model?

It is a fixture (see here). Also the definition is here.

sakoush · 2024-05-17T11:02:31Z

tests/grpc/test_servicers.py

    expected = pb.InferTensorContents(int64_contents=[6])

    assert len(prediction.outputs) == 1
    assert prediction.outputs[0].contents == expected


+@pytest.mark.parametrize("settings", [lazy_fixture("settings_stream")])
+@pytest.mark.parametrize(


what is sum_model_settings?

It is a fixture (see here)

tests/rest/test_endpoints.py

sakoush · 2024-05-17T11:07:19Z

tests/rest/test_endpoints.py

+@pytest.mark.parametrize(
+    "model_name,model_version", [("text-model", "v1.2.3"), ("text-model", None)]
+)
+async def test_generate(


can generate test be an extra parametrized item in infer test?

It is a bit tricky to parameterise the model loading through lazy_fixtures due to recursive dependency involving fixture. I will leave it like this since I don't want to refactor the tests.

sakoush · 2024-05-17T11:07:48Z

tests/rest/test_endpoints.py

@@ -147,15 +207,19 @@ async def test_infer_headers(
    )


-async def test_infer_error(rest_client, inference_request):
+async def test_infer_error(


should we tests errors for the stream case as well?

sakoush · 2024-05-17T11:09:27Z

tests/test_model.py

+    yield generate_request
+
+
+async def test_predict_stream_fallback(


as explained earlier I am not sure if we should fallback to predict or raise not implemented.

tests/testdata/grpc/model-generate-request.json

sakoush

lgtm - great work! This should be followed by a docs PR to describe streaming and the current limitations more explicitly.

…ntation.

RobertSamoilescu force-pushed the feature/inference-streaming-poc branch from 32c3d7a to c2cf03c Compare May 10, 2024 09:28

RobertSamoilescu requested review from sakoush and lc525 May 10, 2024 09:58

RobertSamoilescu force-pushed the feature/inference-streaming-poc branch from fb3eed6 to 3637aef Compare May 10, 2024 10:32

RobertSamoilescu commented May 15, 2024

View reviewed changes

mlserver/handlers/dataplane.py Show resolved Hide resolved

RobertSamoilescu force-pushed the feature/inference-streaming-poc branch from a519933 to 63af613 Compare May 15, 2024 13:00

sakoush reviewed May 17, 2024

View reviewed changes

RobertSamoilescu force-pushed the feature/inference-streaming-poc branch from 63af613 to 499e693 Compare May 20, 2024 13:53

sakoush reviewed May 21, 2024

View reviewed changes

tests/testdata/grpc/model-generate-request.json Outdated Show resolved Hide resolved

sakoush approved these changes May 21, 2024

View reviewed changes

Adrian Gonzalez-Martin and others added 19 commits May 22, 2024 09:55

Add predict_stream with fallback

ee25235

Add predict_stream to handlers

c799f54

Add sse-starlette to deps

4fa9c7a

Send inference stream back as SSE

fadff06

Add tests for ServerSentEvent encoding

8c24122

Update responses to use \n\n as separator

266420e

Add tests for infer-stream endpoint

f23c60f

WIP: disable batch when streaming and adding tests for batch

1451a14

WIP

f5a5de3

Included generate and generate_stream endpoints and dataplane impleme…

69b3423

…ntation.

Included generate and generate_stream to MLModel.

5024f1e

Included adaptive batching hooks for generate and generate_stream.

9abd39d

Included test for batch hooks.

1a7a2b3

Corrected type in batch hooks comment.

43c983b

Included inputs to the text-model.

ab324bf

Included dataplane generate and generate_stream tests.

3de2216

Included endpoint tests for generate and generate_stream.

55cf350

Split text model in TextModel and TextStreamModel.

7f2140e

Included model tests (generate_stream fallback and generate_stream).

482a499

RobertSamoilescu added 27 commits May 22, 2024 09:56

Disabled grpc interceptor while streaming.

107266c

Fixed predict_stream fallback.

1ce1ab4

Settings streaming validator.

1f7da51

Checkout sum-model model-settings.json

37614c2

Fixed inclusion of infer_stream endpoint.

c143a50

Checkout runtimes poetry lock.

6a9d134

Updated dependencies.

37ef1e3

Removed commented outdate batching test hook.

c291d7d

Removed sse_starlette dependency.

26a0812

Fixed benchmarking text-model settings.

5435d48

Fixed linting issues.

d7d3c74

Renamed benchmarking text model.

447e658

Included lazy fixtures as depenedency.

12cbccd

Removed streaming_enabled flag and introduced gzip_enabled flag.

d52aca0

Fixed linting issues.

2302d99

Fixed handlers tests.

ccbf945

Checkout runtimes poetry.lock

2cf3cb6

Updated poetry.lock

04a2aa8

Regenerated lock after rebasing.

b632250

Fixed merge conflicts.

37af739

Removed proto dummy hello.

0841790

Fixed grpc comments.

852cfc1

Adressed review comments.

bf49a3a

Fixed dataplane test.

499bc9a

Included grpc stream error handling.

d900100

Minor fix stream infer error.

93e3503

Included new line at the end of json file.

0ec59c7

RobertSamoilescu force-pushed the feature/inference-streaming-poc branch from 195bcde to 0ec59c7 Compare May 22, 2024 09:02

RobertSamoilescu merged commit 54cd47e into SeldonIO:master May 22, 2024
25 checks passed

RobertSamoilescu mentioned this pull request Jun 4, 2024

Removed text-model form benchmarking #1790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference streaming support #1750

Inference streaming support #1750

RobertSamoilescu commented May 9, 2024 •

edited

Loading

sakoush left a comment

sakoush May 15, 2024

sakoush May 15, 2024

RobertSamoilescu May 20, 2024

sakoush May 17, 2024

RobertSamoilescu May 20, 2024

sakoush May 17, 2024

sakoush May 17, 2024

RobertSamoilescu May 20, 2024

sakoush left a comment

sakoush May 17, 2024

RobertSamoilescu May 20, 2024

sakoush May 17, 2024

RobertSamoilescu May 20, 2024

sakoush May 17, 2024

RobertSamoilescu May 20, 2024

sakoush May 17, 2024

sakoush May 17, 2024

sakoush left a comment

		yield generate_request


		async def test_predict_stream_fallback(

Inference streaming support #1750

Inference streaming support #1750

Conversation

RobertSamoilescu commented May 9, 2024 • edited Loading

Exposed endpoints

Client calls

REST non-streaming

REST streaming

gRPC non-streaming

gRPC streaming

Limitations

sakoush left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sakoush left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sakoush left a comment

Choose a reason for hiding this comment

RobertSamoilescu commented May 9, 2024 •

edited

Loading