Another overload of distinct #1729
Closed
julianhowarth
started this conversation in
Ideas
Replies: 2 comments 3 replies
-
Sounds reasonable, so +1. @jponge in on PTO this week, but will be able to assist you next week. |
Beta Was this translation helpful? Give feedback.
3 replies
-
Implemented in #1731 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are using Mutiny with a streaming process reading data from AWS DynamoDB, doing some transformations and then writing to S3. Doing it this way means the memory footprint is low as we never need to hold the entire dataset in memory.
We recently discovered we were getting duplicates in the dataset and so turned to
Multi
'sselect().distinct(comparator)
operator in order to remove them. This caused the memory usage to balloon as now the entire dataset has to be held in memory for duplicate checking. However, our duplicate check is trivial, based on an integer id only, so there is no need for the entire dataset to be in memory, just aSet<Integer>
.We've been able to solve this very easily by using
plug
with a customAbstractMultiOperator
which uses an extractor instead of aComparator
to extract the id from each of our objects and store that in aSet
rather than the objects themselves. Doing this has returned the memory footprint to its previous low value.Would it be useful to have this directly in Mutiny? So to add a new method to
MultiSelect
, something like:where
keyExtractor
is responsible for extracting a key which is used to de-duplicate the values. This could be implemented as a separateAbstractMultiOperator
, something likeMultiDistinctByKeyOp<T, K>
Beta Was this translation helpful? Give feedback.
All reactions