Another overload of distinct #1729

julianhowarth · 2024-10-28T18:17:17Z

julianhowarth
Oct 28, 2024

We are using Mutiny with a streaming process reading data from AWS DynamoDB, doing some transformations and then writing to S3. Doing it this way means the memory footprint is low as we never need to hold the entire dataset in memory.

We recently discovered we were getting duplicates in the dataset and so turned to Multi's select().distinct(comparator) operator in order to remove them. This caused the memory usage to balloon as now the entire dataset has to be held in memory for duplicate checking. However, our duplicate check is trivial, based on an integer id only, so there is no need for the entire dataset to be in memory, just a Set<Integer>.

We've been able to solve this very easily by using plug with a custom AbstractMultiOperator which uses an extractor instead of a Comparator to extract the id from each of our objects and store that in a Set rather than the objects themselves. Doing this has returned the memory footprint to its previous low value.

Would it be useful to have this directly in Mutiny? So to add a new method to MultiSelect, something like:

public <K> Multi<T> distinct(Function<T, K> keyExtractor) {

where keyExtractor is responsible for extracting a key which is used to de-duplicate the values. This could be implemented as a separate AbstractMultiOperator, something like MultiDistinctByKeyOp<T, K>

cescoffier · 2024-10-28T18:31:38Z

cescoffier
Oct 28, 2024
Maintainer

Sounds reasonable, so +1.
Fancy a PR?

@jponge in on PTO this week, but will be able to assist you next week.

3 replies

julianhowarth Oct 30, 2024
Author

Thanks, I'll put together a PR over the next few days

cescoffier Oct 30, 2024
Maintainer

Awesome! Thanks!

julianhowarth Nov 4, 2024
Author

I have raised #1731 as a first attempt at addressing this.

jponge · 2024-11-04T19:35:23Z

jponge
Nov 4, 2024
Maintainer

Implemented in #1731

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another overload of distinct #1729

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Another overload of distinct #1729

julianhowarth Oct 28, 2024

Replies: 2 comments · 3 replies

cescoffier Oct 28, 2024 Maintainer

julianhowarth Oct 30, 2024 Author

cescoffier Oct 30, 2024 Maintainer

julianhowarth Nov 4, 2024 Author

jponge Nov 4, 2024 Maintainer

julianhowarth
Oct 28, 2024

Replies: 2 comments 3 replies

cescoffier
Oct 28, 2024
Maintainer

julianhowarth Oct 30, 2024
Author

cescoffier Oct 30, 2024
Maintainer

julianhowarth Nov 4, 2024
Author

jponge
Nov 4, 2024
Maintainer