Skip to content

Commit

Permalink
[improve][pip] PIP-352: Event time based topic compactor (apache#22710)
Browse files Browse the repository at this point in the history
  • Loading branch information
marekczajkowski committed Jul 31, 2024
1 parent c249530 commit 9d0292e
Showing 1 changed file with 68 additions and 0 deletions.
68 changes: 68 additions & 0 deletions pip/pip-352.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# PIP-352: Event time based topic compactor

# Background knowledge

Pulsar Topic Compaction provides a key-based data retention mechanism that allows you only to keep the most recent message associated with that key to reduce storage space and improve system efficiency.

Another Pulsar's internal use case, the Topic Compaction of the new load balancer, changed the strategy of compaction. It only keeps the first value of the key. For more detail, see [PIP-215](https://github.com/apache/pulsar/issues/18099).

There is also plugable topic compaction service present. For more details, see [PIP-278](https://github.com/apache/pulsar/pull/20624)

More topic compaction details can be found in [Pulsar Topic Compaction](https://pulsar.apache.org/docs/en/concepts-topic-compaction/).

# Motivation

Currently, there are two types of compactors
available: `TwoPhaseCompactor` and `StrategicTwoPhaseCompactor`. The latter
is specifically utilized for internal load balancing purposes and is not
employed for regular compaction of Pulsar topics. On the other hand, the
former can be configured via `CompactionServiceFactory` in the
`broker.conf`.

I believe it could be advantageous to introduce another type of topic
compactor that operates based on event time. Such a compactor would have
the capability to maintain desired messages within the topic while
preserving the order expected by external applications. Although
applications may send messages with the current event time, variations in
network conditions or redeliveries could result in messages being stored in
the Pulsar topic in a different order than intended. Implementing event
time-based checks could mitigate this inconvenience.

# Goals
* No impact on current topic compation behavior
* Preserve the order of messages during compaction regardless of network latencies

## In Scope
* Abstract TwoPhaseCompactor

* Migrate the current implementation to a new abstraction

* Introduce new compactor based on event time

* Makes existing tests compatible with new implementations.


# High Level Design

In order to change the way topic is compacted we need to create `EventTimeCompactionServiceFactory`. This service provides a new
compactor `EventTimeOrderCompactor` which has a logic similar to existing `TwoPhaseCompactor` with a slightly change in algorithm responsible for
deciding which message is outdated.

New compaction service factory can be enabled via `compactionServiceFactoryClassName`

# Detailed Design

## Design & Implementation Details

* Abstract `TwoPhaseCompactor` and move current logic to new `PublishingOrderCompactor`

* Implement `EventTimeCompactionServiceFactory` and `EventTimeOrderCompactor`

* Create `MessageCompactionData` as a holder for compaction related data

Example implementation can be found [here](https://github.com/apache/pulsar/pull/22517/files)

# Links

* Mailing List discussion thread: https://lists.apache.org/thread/nc8r3tm9xv03vl30zrmfhd19q2k308y2
* Mailing List voting thread: https://lists.apache.org/thread/pp6c0qqw51yjw9szsnl2jbgjsqrx7wkn

0 comments on commit 9d0292e

Please sign in to comment.