Better processor ideas #1141

lovromazgon · 2023-07-27T15:09:18Z

lovromazgon
Jul 27, 2023
Maintainer

This discussion can serve as a starting point to explore and ultimately choose the specific improvements we want to apply to processors.

Current state

Let's be honest, Conduit processors are currently pretty basic. We would like to make them more powerful to give users more freedom when manipulating in-flight data.

Initially, processors were inspired by single message transforms (SMTs) provided by Kafka Connect. We tried to mimic the behavior as well as processor names to bring them closer to new Conduit users that had prior experience with Kafka Connect. A couple of issues arose because of this. The main issue was that Conduit is working with a single OpenCDC record that combines the key and payload, whereas Kafka Connect has separate payloads for both. The OpenCDC record additionally provides fields like metadata, position, operation, and two fields for the payload (before and after). Conduit also differentiates between raw data and structured data. All of this meant we tried to jam a square peg into a round hole and ended up with sub-par processors.

Potential improvements

Ability to process any field

In Conduit a single record is always represented as an OpenCDC record, meaning that it has the following structure:

type Record struct {
	Position  []byte
	Operation string // create, update, delete, snapshot
	Metadata  map[string]string
	Key       Data   // either structured or raw data
	Payload struct {
		Before Data // either structured or raw data
		After  Data // either structured or raw data
	}
}

Currently built-in processors can only manipulate fields Record.Key and Record.Payload.After. We should change processors so they can be configured to work on any of these fields, maybe even multiple fields at the same time (e.g. copy payload data into metadata).

Better type names

Processor types are currently very hard to read. They are all in lowercase letters without spaces and normally end with "key" or "payload" depending on which field they manipulate (e.g. timestampconverterpayload).

We should make processor types easier to read and group together processors that do the same thing but work on different fields. The field could be chosen by a config option (see ability to process any field).

Logging

Processors do not have access to a logger, so we had to resort to workarounds in processors that need it (example). When building a processor we should give it access to a logger.

Discovering processor types

The UI has poor support for processors because they need to be hard coded. Currently, there is no way to retrieve the available processor types and their specifications.

We should expose an endpoint that returns all processor types and their specifications, similar to how we already expose the list of plugins and their specifications.

For example something like this:

$ curl -X 'GET' 'http://localhost:8080/v1/processors/types' -H 'accept: application/json'
[
  {
    "type": "nameOfProcessorType",
    "description": "Short description of what this processor does.",
    "params": {
      "param1": {
        "description": "Short description of what this parameter does.",
        "default": ""
      }
    }
  }
]

OpenCDC unwrap processor

The unwrap processor currently only supports formats debezium and kafka-connect. We should also provide support to unwrap an OpenCDC record. This will be particularly useful when reprocessing records coming from the DLQ, since those records contain an OpenCDC record in the payload.

Processor lifecycle

A processor is currently just a simple function that gets called for every record. There is no specific method called when initializing or closing a processor. To support more complex processors that need to open/close resources at the start/end we should introduce more methods. This is a prerequisite if we want to have pluggable processors that need to be cleaned up when the pipeline stops running.

Permanent storage

A processor can only store data in memory while it's running, it has no access to permanent storage. If we provided a way to store data permanently, processors would get the ability to do stateful processing (e.g. aggregations) across pipeline restarts.

Splitting and combining records

Current processors are simple functions that take a record and return a record. They can modify the record as they see fit or drop it entirely, but they can't split it into multiple records or combine multiple records into one record.

Giving processors the ability to split records would allow us to denormalize data while combining records would allow us to aggregate it. For this to work across restarts we would also need to give processors access to permanent storage.

Note that we will need to consider how we would give the processor access to multiple records at once. If the processor relies on being called by the node (as it does now), then it can't combine records based on a time window, as there will not necessarily be a record at the point the time window gets closed.

Pluggable processors

Pluggable processors or bring-your-own-processor (BYOP) is the ability to provide a custom processor implementation, similarly to how we currently allow the user to provide standalone connector implementations. We already provide a similar feature using the JavaScript processor, although it has many drawbacks like poor performance and the fact that users are forced to write JavaScript. We should think about alternatives that would improve the experience of writing your own processor like go-plugin or WebAssembly.

simonl2002 · 2023-07-28T15:35:12Z

simonl2002
Jul 28, 2023
Maintainer

Any thoughts on how to make the ergonomics of language processors like JavaScript better. Right now having to embed the code within the configuration feels less than ideal.

1 reply

lovromazgon Jul 31, 2023
Maintainer Author

We could let the user specify a path to the JS script file instead of requiring them to write it into the config file. For instance:

processors:
  - id: my-js-procerssor
    type: js
    settings:
      path: ./scripts/my-js-processor.js

This sounds like an additional improvement specific to the JS processor.

hariso · 2023-07-31T11:42:54Z

hariso
Jul 31, 2023
Maintainer

Thanks for the list Lovro! It looks really great. 👍

I'd suggest thinking about two more:

embedded processors (for performance reasons)
a processor SDK (to make it easier to access all the fields in a record, maybe even access built-in functions to build upon those).

4 replies

lovromazgon Jul 31, 2023
Maintainer Author

Can you expand on what you mean by these? Unless I misunderstand, this sounds like it would be part of "pluggable processors". Or do you mean that someone would embed/compile custom processors into the Conduit binary?

hariso Jul 31, 2023
Maintainer

Yup, it's the latter, being able to embed/compile a custom processor into the Conduit binary with the least amount of work possible (it's possible even now, but it takes a few steps).

lovromazgon Jul 31, 2023
Maintainer Author

Now that we merged #1131 it's relatively simple:

func main() {
	cfg := conduit.DefaultConfig()
	cfg.ProcessorBuilderRegistry.Register("my-custom-processor", func(config processor.Config) (processor.Interface, error) {
		// implement this part
	})
	conduit.Serve(cfg)
}

I guess that at least covers the first point, right?

hariso Aug 1, 2023
Maintainer

FTR, we discussed this during a team sync. I was suggesting looking for a way to dynamically load the code of a processor so that maintaining a fork of Conduit and rebuilding it is not needed. However, there doesn't seem to be a nice way to do that in the current Go ecosystem. This was an idea to improve performance, so we won't be tackling it right now.

simonl2002 · 2023-08-14T14:49:10Z

simonl2002
Aug 14, 2023
Maintainer

Tagging this request from @lyuboxa #1172 as relevant to Better Processors work.

0 replies

gedw99 · 2023-08-26T05:09:23Z

gedw99
Aug 26, 2023

For wasm I use Wazero and Nats kv.

Wazero is the runner.

Nats KV stores the wasm and nats pub sub system ensures the wasm binaries are where they need to be across any cluster of nodes.

Nats looks to repo releases to get the wasm , so it’s sort doing CDC off any GitHub release. A global registry of what to look for can be in git or s3.

—-

Also the core foundational storage of conduit uses badgerdb or postresql or in memory.

i started working on a NATS jetstream based Storage System because the nature of the data is KV and Again NATS KV System gives you streaming KV and easy clustering of that storage. As well as fault tolerance with no need for any load balancers etc etc.
It’s currently not part of conduit core as I built it myself as a needed it.

i really like the things being brought up in this discussion.

If I can help I would try to free up sone time ..

0 replies

lovromazgon · 2024-05-21T13:56:53Z

lovromazgon
May 21, 2024
Maintainer Author

Shipped as part of v0.9.0, relevant issue: #1378

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better processor ideas #1141

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Better processor ideas #1141

lovromazgon Jul 27, 2023 Maintainer

Current state

Potential improvements

Ability to process any field

Better type names

Logging

Discovering processor types

OpenCDC unwrap processor

Processor lifecycle

Permanent storage

Splitting and combining records

Pluggable processors

Replies: 5 comments · 5 replies

simonl2002 Jul 28, 2023 Maintainer

lovromazgon Jul 31, 2023 Maintainer Author

hariso Jul 31, 2023 Maintainer

lovromazgon Jul 31, 2023 Maintainer Author

hariso Jul 31, 2023 Maintainer

lovromazgon Jul 31, 2023 Maintainer Author

hariso Aug 1, 2023 Maintainer

simonl2002 Aug 14, 2023 Maintainer

gedw99 Aug 26, 2023

lovromazgon May 21, 2024 Maintainer Author

lovromazgon
Jul 27, 2023
Maintainer

Replies: 5 comments 5 replies

simonl2002
Jul 28, 2023
Maintainer

lovromazgon Jul 31, 2023
Maintainer Author

hariso
Jul 31, 2023
Maintainer

lovromazgon Jul 31, 2023
Maintainer Author

hariso Jul 31, 2023
Maintainer

lovromazgon Jul 31, 2023
Maintainer Author

hariso Aug 1, 2023
Maintainer

simonl2002
Aug 14, 2023
Maintainer

gedw99
Aug 26, 2023

lovromazgon
May 21, 2024
Maintainer Author