Error handling options. #353

bmmptlgc · 2024-07-23T10:55:38Z

bmmptlgc
Jul 23, 2024

Hi, I created a streaming app that reads messages from a multi schema topic (AVRO schemas used) and sends each message type to it's own topic, filtering out messages types that we don't want to stream. I forked the Streamiz repo, created my app there and added a test project to investigate how to use dependency injection and then reused the tests to explore error handling:
Here's the repo: https://github.com/bmmptlgc/kafka-streams-dotnet/blob/schema-registry-di/test/Sample.Kafka.Supplier.DI.UnitTests/BaseContext.cs

My app will have multiple StreamTasks (one per multi schema topic, ex. orders and products), all running the same topology. The topology is simple. A .Filter to exclude message types we don't care about and a .To to determine the topic name to produce to. https://github.com/bmmptlgc/kafka-streams-dotnet/blob/d70c1017254761236a418764c19b2c6365bcf8c1/sample-kafka-supplier-di/TopicSplitterService.cs#L52
Not a lot could go wrong, right? But what if...

I'm not sure how likely it will be for an exception to be thrown in the callback functions I wrote for .Filter and .To. They should be safe because they depend on the messages having a particular header, which our system guaranteed they do. So I put a throw statement in each callback and verified I can handle the exception with streamConfig.InnerExceptionHandler. And it appears that in this handler I have to decide to CONTINUE or FAIL.

There seems to be no RETRY capability here, is that correct?
If I wanted to retry a failed message just by configuring the topology, is there a way to do that?
If I return FAIL in the handler, all streams are shut down. For example, if I FAIL a message for the orders stream, the stream stops processing as expected, but the products stream also shuts down. Is there a way to avoid this?

I understand that question 2. would eventually require handling the error after exhausting a limited number of retries. At this point I will probably need to persist the message in a store and reprocess it later on and let the stream continue. It would still be nice to have a retry capability before throwing the message in some error store.

Another error that could happen is if a stream task publishes a message to a single message topic, and the message breaks the Avro schema (ie, it is not compatible with the previous schema). In my tests am setting up the producer mock to throw an exception: https://github.com/bmmptlgc/kafka-streams-dotnet/blob/d70c1017254761236a418764c19b2c6365bcf8c1/test/Sample.Kafka.Supplier.DI.UnitTests/KafkaTopicSplitter/When_consuming_a_message_with_a_type_configured_to_publish_to_a_single_schema_topic.cs#L56
I was expecting this to be handled by streamConfig.ProductionExceptionHandler: https://github.com/bmmptlgc/kafka-streams-dotnet/blob/d70c1017254761236a418764c19b2c6365bcf8c1/test/Sample.Kafka.Supplier.DI.UnitTests/BaseContext.cs#L41
But it is also handled by streamConfig.InnerExceptionHandler.

When streamConfig.ProductionExceptionHandler used?

Thanks in advance,
Bruno

Answered by LGouellec

Sep 5, 2024

Probably yes you right, Export Metrics is called regularly but if you stream fail during a transition period, the last metrics values are not exported.
Could be a great enhancement. Feel free to create a specific GH issue on that.

View full answer

LGouellec · 2024-07-24T23:23:29Z

LGouellec
Jul 24, 2024
Maintainer

Hey @bmmptlgc ,

There seems to be no RETRY capability here, is that correct?

Correct

If I wanted to retry a failed message just by configuring the topology, is there a way to do that?

No, today the topology can only fail or continue, I mean skip the current message. But you can implement your own pattern with try/catch in your lambda and push the invalid record to a DQL topic. This DLQ topic will be process by a specific KStreams application, or specific Consumer to implement your retry logic with a specific retry number and/or interval. etc ..

If I return FAIL in the handler, all streams are shut down. For example, if I FAIL a message for the orders stream, the stream stops processing as expected, but the products stream also shuts down. Is there a way to avoid this ?

If you return FAIL, the current stream thread will shutdown properly and die. If you have another stream thread in your instance, this one will continue to work except if it process your invalid record. In that case, it will probably fail as well.

When streamConfig.ProductionExceptionHandler used?

The ProductionExceptionHandler is used when the produce process fails (RecordTooLargeException, BrokerAuthentification errors, etc ..). Streamiz use the byte array serializer to flush data to Kafka, your AvroSerDes is called before the Producer.Send(..) and it's relevant of the InnerExceptionHandler. Kafka Streams JAVA has the same mechanism btw.

Hope it helps.

1 reply

bmmptlgc Aug 13, 2024
Author

Hey @bmmptlgc ,

There seems to be no RETRY capability here, is that correct?

Correct

If I wanted to retry a failed message just by configuring the topology, is there a way to do that?

No, today the topology can only fail or continue, I mean skip the current message. But you can implement your own pattern with try/catch in your lambda and push the invalid record to a DQL topic. This DLQ topic will be process by a specific KStreams application, or specific Consumer to implement your retry logic with a specific retry number and/or interval. etc ..

The code inside my lambdas is very simple. It could only fail if the messages didn't have the expected header, which our platform guarantee they will. I'm more concerned about some Kafka outage or network glitch that cased the .to method to fail sending to the destination topic. I can't handle such an error in a try catch. So I'll have to handle the thrown exception with streamConfig.InnerExceptionHandler and FAIL, unless the thrown exception is ProduceException in which case I will handle with ProductionExceptionHandler, in which case I can retry.

If I return FAIL in the handler, all streams are shut down. For example, if I FAIL a message for the orders stream, the stream stops processing as expected, but the products stream also shuts down. Is there a way to avoid this ?

If you return FAIL, the current stream thread will shutdown properly and die. If you have another stream thread in your instance, this one will continue to work except if it process your invalid record. In that case, it will probably fail as well.

In my example, I have 2 streams as shown by the image below, one for orders and another for products. You can see I have many messages in both topics. I'm not sure how Streamiz subscribes to the topics. I know there is only one Kafka consumer, but I expected each builder.Stream to only handle messages for the topic name that is passed to them. It so happens the first message is a products message and I throw an exception. I'm expecting only the products stream to fail, but both streams fail and you can see in the logs below "All stream threads have died. The instance will be in error state and should be closed". Each breakpoint inside my lambda is only hit once. Nothing else is processed after the exception is thrown. Maybe there is something wrong with how I set up my streams.

When streamConfig.ProductionExceptionHandler used?

The ProductionExceptionHandler is used when the produce process fails (RecordTooLargeException, BrokerAuthentification errors, etc ..). Streamiz use the byte array serializer to flush data to Kafka, your AvroSerDes is called before the Producer.Send(..) and it's relevant of the InnerExceptionHandler. Kafka Streams JAVA has the same mechanism btw.

I was able to simulate ProductionExceptionHandler in my tests, by having my producer mock set to throw ProduceException. Is Is there a recommended way to return RETRY by inspecting properties of report.Error (since Error.IsRetriable is internal 😢)?

Regards,
Bruno

LGouellec · 2024-08-13T23:57:42Z

LGouellec
Aug 13, 2024
Maintainer

Hi @bmmptlgc,

The stream via builder.Stream is just a DSL syntax to abstract the consumption of a specific topic and chain all your downstream operations (Join, Filter, Map, etc... ) but at the end all your streams are run with a single thread.

You can parallelise your processing via config.NumStreamThreads to leverage the consumer group protocol, but at the end if one message fails (orders) and streamConfig.InnerExceptionHandler return FAIL, the current thread will die (consumer, producer, state stores, etc ..).

It will not close the specific "stream" of order but all the processors of your topology. Kafka Streams JAVA implement the same behavior.

Btw, I'm currently conducting a satisfaction survey to understand how I can better serve and I would love to get your feedback on the product.

Your insights are invaluable and will help us shape the future of our product to better meet your needs. The survey will only take a few minutes, and your responses will be completely confidential.

Survey

Thank you for your time and feedback!

Best regards,

4 replies

bmmptlgc Aug 19, 2024
Author

I'm guessing then that the right approach would be to create a KafkaStream per topic that I need to split and just reuse the topology, like:

foreach (var topicConfig in _topicSplitterOptions.Topics)
{
    var topology = BuildTopology(topicConfig);

    var stream = new KafkaStream(topology, _streamConfig);

    await stream.StartAsync(stoppingToken);
}

I have tested this and it seems to achieve the requirement. Are there any drawback of running starting multiple KafkaStreams in the same app?

LGouellec Aug 19, 2024
Maintainer

@bmmptlgc
Except than you have 2 threads instead of one no.
Do not avoid to have a StateDir and an ApplicationId per KafkaStreams instance in your configuration.

bmmptlgc Aug 20, 2024
Author

@bmmptlgc Except than you have 2 threads instead of one no. Do not avoid to have a StateDir and an ApplicationId per KafkaStreams instance in your configuration.

Regarding ApplicationId, I'm actually setting a different value per KafkaStream:

_streamConfig.ApplicationId = $"kafka-topic-splitter-{topicConfig.SourceTopic}";
                
var stream = new KafkaStream(topology, _streamConfig);

If I understand it correctly, StateDir is associated with the use of stores. My application is simple enough that we are not using any stores, so maybe I don't need to worry about StateDir?

LGouellec Aug 20, 2024
Maintainer

@bmmptlgc
Yes stateDir is only used if your application is stateful

bmmptlgc · 2024-08-21T17:45:47Z

bmmptlgc
Aug 21, 2024
Author

When streamConfig.ProductionExceptionHandler used?

The ProductionExceptionHandler is used when the produce process fails (RecordTooLargeException, BrokerAuthentification errors, etc ..). Streamiz use the byte array serializer to flush data to Kafka, your AvroSerDes is called before the Producer.Send(..) and it's relevant of the InnerExceptionHandler. Kafka Streams JAVA has the same mechanism btw.

I was able to simulate ProductionExceptionHandler in my tests, by having my producer mock set to throw ProduceException. I don't see that the deliveryReport that is passed to this handler contains the original exception (ProduceException) and I am trying to determin when I should retry or fail. Is there a recommended way to return RETRY by inspecting properties of report.Error (since Error.IsRetriable is internal 😢)?

2 replies

LGouellec Aug 21, 2024
Maintainer

@bmmptlgc
You can checkout how I analyse the ProduceException

streamiz/core/Kafka/Internal/RecordCollector.cs

Line 254 in dbea20d

private void HandleError(DeliveryReport<byte[], byte[]> report)

bmmptlgc Sep 2, 2024
Author

If I understand it correctly, you first determine if the error is recoverable and only call the ProductionExceptionHandler if it is not recoverable. So inside my ProductionExceptionHandler I can assume the error is not recoverable and FAIL. Have you ever encountered a scenario where you returned Retry in a ProductionExceptionHandler?

LGouellec · 2024-09-02T22:20:13Z

LGouellec
Sep 2, 2024
Maintainer

Hey @bmmptlgc ,

Some errors are not retriable and manageable, for instance if you don't have the authorisation to publish into this topic. Does make sense to CONTINUE or RETRY, because the next time you will have the same exception.

So this is the workflow to manage ProductionException :

Determine if the error is Fatal or not => IF yes, FAIL
Determine if the error is Recoverable, 3 possible errors => TransactionCoordinatorFenced or UnknownProducerId or OutOfOrderSequenceNumber, in this case the task are migrated to another instance, so I have a special trick internally to manage this error
Last options, call the ProductionExceptionHandler :
-- If the developer return FAIL => throw the exception on top the StreamThread
-- if the developer return RETRY => put the original record, into a retry queue
-- if the developer return CONTINUE => log and skip the message

Hope it's clear

3 replies

bmmptlgc Sep 3, 2024
Author

It is mostly clear :)

My question is about -- if the developer return RETRY => put the original record, into a retry queue.

If we already determined the error is not recoverable (not one of the 3 errors mentioned above) why would the developer return Retry? Wouldn't it just error again over and over in a loop?

LGouellec Sep 3, 2024
Maintainer

RETRY could make sense, if the payload is too large (in the meantime, ops can increase the maximum record size at the topic level).
Kafka internally has an internal retry mechanism, if the number of retry is exceed because there is no enough replicas for instance, the delivery report will contain the error and the developer can choice to retry and resend the message for instance.
And finally if the kafka cluster or the network is not stable, instead of raising an exception on top of the StreamThread, the developer can retry

bmmptlgc Sep 5, 2024
Author

Thanks for the clarification.

bmmptlgc · 2024-09-05T17:02:47Z

bmmptlgc
Sep 5, 2024
Author

Another topic related to error handling is Observability.

I was able to set it up on my topic splitter and the "Streamiz" metrics are being exported by my company's own observability implementation. All I had to do aside from configuring open telemetry like we do for any other application and adding the "Streamiz" meter to the builder, was to set up the metrics reporter this way:

MetricsReporter = new OpenTelemetryMetricsExporter().ExposeMetrics

Then, when the metrics are exported (by default every 30 seconds) I get metrics exported like this one, for example:

Export stream_thread_metrics_task_closed_rate, The average per-second number of closed tasks, Meter: Streamiz
(2024-09-05T16:45:46.5108832Z, 2024-09-05T16:46:24.3738027Z] thread_id: kafka-topic-splitter-products-7d6f3d53-2696-450d-a242-73befd8c8c9c-stream-thread-0 DoubleGauge
Value: 0.0170520428347316

However, if my stream fails and shuts down any metrics that where added to the sensors collection after the last export and before the shutdown, are not exported.

Should the Streamiz code call MetricUtils.ExportMetrics(...) on shutdown?

1 reply

LGouellec Sep 5, 2024
Maintainer

Probably yes you right, Export Metrics is called regularly but if you stream fail during a transition period, the last metrics values are not exported.
Could be a great enhancement. Feel free to create a specific GH issue on that.

Answer selected by LGouellec

bmmptlgc · 2024-09-05T17:17:16Z

bmmptlgc
Sep 5, 2024
Author

Still on the topic of observability, I couldn't see any sensors recording failed messages. One of my requirements is to record metrics about failures. The only place I can do this, that I know of, is in the InnerExceptionHandler, but all this handler receives is the exception. I would like to record a metric with Topic, Partition, Message Type and so on... Is this possible?

3 replies

LGouellec Sep 5, 2024
Maintainer

We have a dropped record sensor which track the number of records dropped at the end of the production :

streamiz/core/Kafka/Internal/RecordCollector.cs

Line 312 in dbea20d

droppedRecordsSensor.Record();

But this metrics increase only if the record fail and you choice to CONTINUE, if you FAIL or RETRY this counter doesn't increase.

Btw, I don't think the metrics should contain topic, partition, etc .. because metrics sensor is just a string key and a double value.

bmmptlgc Sep 6, 2024
Author

We have a standard of adding Topic, Partition, Message Type as metric tags.

In my implementation I am always returning FAIL, so I will not get the droppedRecordsSensor metrics. Becuase of that I am adding recording my owm metrics when errors occur and make the best I can with context data the exception handlers receive:

_streamConfig.InnerExceptionHandler = exception =>
{
    Logger.Error(exception, "Failed to process a message");
    
    MetricsRecorder
        .Counter(MetricsRegistry.KafkaTopicSplitterTopologyException)
        .Increment(1);
    
    return ExceptionHandlerResponse.FAIL;
};
_streamConfig.DeserializationExceptionHandler = (_, result, exception) =>
{
    result.Message.Headers.TryGetLast(KnownHeaders.MessageType, out var typeName);
    
    Logger.Error(exception, 
        "Failed to deserialize a massage of type {MessageType} from topic {Topic}",
        typeName ?? "N/A",
        result.Topic);
    
    MetricTags metricTags = new(
        (Constants.MetricTags.Partition, result.Partition.Value),
        (Constants.MetricTags.Topic, result.Topic),
        (Constants.MetricTags.MessageType, typeName ?? "N/A"));
    
    MetricsRecorder
        .Counter(MetricsRegistry.KafkaTopicSplitterDeserialization)
        .Increment(1, metricTags);
    
    return ExceptionHandlerResponse.FAIL;
};
_streamConfig.ProductionExceptionHandler = report => 
{
    report.Message.Headers.TryGetLast(KnownHeaders.MessageType, out var typeName);
    
    Logger.Error("Failed to produce a massage of type {MessageType} from topic {Topic}",
        typeName ?? "N/A",
        report.Topic);
    
    MetricTags metricTags = new(
        (Constants.MetricTags.Partition, report.Partition.Value),
        (Constants.MetricTags.Topic, report.Topic),
        (Constants.MetricTags.MessageType, typeName ?? "N/A"));
    
    MetricsRecorder
        .Counter(MetricsRegistry.KafkaTopicSplitterProducer)
        .Increment(1, metricTags);
    
    return ProductionExceptionHandlerResponse.FAIL;
};

It is only the InnerExceptionHandler that doesn't receive the data I am looking for, but we can live with that. We can look at the logs after we start getting Prometheus alerts for these metrics.

Thanks for all the help so far.

LGouellec Sep 6, 2024
Maintainer

The _streamConfig.InnerExceptionHandler doesn't contain the data, because the Exception can come from a CommitException, or BeginTransaction, or something other and not related to a message processed.

No worries !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling options. #353

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 14 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Error handling options. #353

bmmptlgc Jul 23, 2024

Replies: 6 comments · 14 replies

LGouellec Jul 24, 2024 Maintainer

bmmptlgc Aug 13, 2024 Author

LGouellec Aug 13, 2024 Maintainer

bmmptlgc Aug 19, 2024 Author

LGouellec Aug 19, 2024 Maintainer

bmmptlgc Aug 20, 2024 Author

LGouellec Aug 20, 2024 Maintainer

bmmptlgc Aug 21, 2024 Author

LGouellec Aug 21, 2024 Maintainer

bmmptlgc Sep 2, 2024 Author

LGouellec Sep 2, 2024 Maintainer

bmmptlgc Sep 3, 2024 Author

LGouellec Sep 3, 2024 Maintainer

bmmptlgc Sep 5, 2024 Author

bmmptlgc Sep 5, 2024 Author

LGouellec Sep 5, 2024 Maintainer

bmmptlgc Sep 5, 2024 Author

LGouellec Sep 5, 2024 Maintainer

bmmptlgc Sep 6, 2024 Author

LGouellec Sep 6, 2024 Maintainer

bmmptlgc
Jul 23, 2024

Replies: 6 comments 14 replies

LGouellec
Jul 24, 2024
Maintainer

bmmptlgc Aug 13, 2024
Author

LGouellec
Aug 13, 2024
Maintainer

bmmptlgc Aug 19, 2024
Author

LGouellec Aug 19, 2024
Maintainer

bmmptlgc Aug 20, 2024
Author

LGouellec Aug 20, 2024
Maintainer

bmmptlgc
Aug 21, 2024
Author

LGouellec Aug 21, 2024
Maintainer

bmmptlgc Sep 2, 2024
Author

LGouellec
Sep 2, 2024
Maintainer

bmmptlgc Sep 3, 2024
Author

LGouellec Sep 3, 2024
Maintainer

bmmptlgc Sep 5, 2024
Author

bmmptlgc
Sep 5, 2024
Author

LGouellec Sep 5, 2024
Maintainer

bmmptlgc
Sep 5, 2024
Author

LGouellec Sep 5, 2024
Maintainer

bmmptlgc Sep 6, 2024
Author

LGouellec Sep 6, 2024
Maintainer