Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capped Hashmap #25

Merged
merged 16 commits into from
Jan 30, 2024
Merged

Capped Hashmap #25

merged 16 commits into from
Jan 30, 2024

Conversation

ppoliani
Copy link
Contributor

Implementation of a Hashmap collection with a capped size. Based on the ideas discussed here #2

@ppoliani ppoliani changed the title [WIP] Capped Hashmap Capped Hashmap Jan 28, 2024
@ppoliani ppoliani changed the title Capped Hashmap [WIP] Capped Hashmap Jan 28, 2024
@ppoliani ppoliani changed the title [WIP] Capped Hashmap Capped Hashmap Jan 28, 2024
Copy link
Contributor

@mimoo mimoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the PR. Added a few comments, IIUC I think the capacity check is not working as intended.

I was also wondering about two different approaches that could have worked: use IndexMap as a dependency instead of the HashMap (as it remembers order of insertion), not sure how it performs but it might just be easier to rely on that dependency.

Another approach could have been to just wipe out the entire hashmap once we reach X entries, it's a bit of a nuclear approach but maybe that would have been fine as well :P

src/committee/node.rs Show resolved Hide resolved
src/committee/node.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
self.last_items.push_front(k);
}

if self.last_items.len() > self.capacity - 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have logs when we reach a quarter of the capacity, half of the capacity, 90% of the capacity, or something like that (so that we now something is happening and unfinished signatures are piling up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also imagine capacity is set to 5, this means that the hashmap and vecdeque both have capacity 4. So when the insertion above makes inner and last_items full, the check here will be 4 > 4 = false and do nothing. This means that the next insert will resize both the hashmap and vecdeque. I think you made a mistake in the initialization, you should keep the capacity, but have the two structures have capacity + 1 instead. Maybe a test checking for the array capacity would help with making sure that the logic works :)

Copy link
Contributor Author

@ppoliani ppoliani Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should keep the capacity, but have the two structures have capacity + 1 instead

Oh shit, yes that's what I intended to do in the first place. Good point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read about the capacity and how it changes. There is a good explanation here.

I also tested it and it looks like it can double even though the length remains the same.

it has to reserve "slack space", the amount of which depends on the implementation of the hashmap (and specifically its collision resolution algorithm).

I guess the more keys we insert the highest the chance of a collision (even though some keys are removed) so it looks like it follows a conservative approach by doubling the capacity.

I believe this is not a big deal neither a huge performance implication given that we're not gonna be storing several thousand or millions of entries in the hash table. And besides capacity change does not mean entries are moved to a new memory location.


/// Inserts an new item to the collection. Return Some(key) where key is the
/// key that was removed when we reach the max capacity. Otherwise returns None.
pub fn insert(&mut self, k: K, v: V) -> Option<K> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should return the item that was removed, as this might be surprising behavior for something that is supposed to closely mimic a hashmap. If we want this API it should be named something else (insert_and_get_removed_item or something)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should return the item that was removed

This is actually quite handy and it can help with us to things like this #2 (comment).

It's similar to what the core HashMap does with their insert fn. It returns the value of the replaced item and they don't call it something like insert_and_get_replaced_item.

I believe we can can this add_entry to avoid any confusion with the HashMap::insert fn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying this is not a useful function, just that it should have a different name so that it does not have a surprising behavior as HashMap::insert does not behave like this. HashMap::insert returns the element that was removed at the place of insert, which is different!

src/capped_hashmap.rs Outdated Show resolved Hide resolved
.iter()
.filter(|key| *key != k)
.map(|key| *key)
.collect::<VecDeque<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it kinda sucks that we have to go through the whole thing here :D I agree that we should assume that the key to remove is one of the last one appended (and that the oldest stuff is probably just stale stuff at this point). Maybe a LinkedList is better as removing something at any index is easier? (so find followed by a remove). In any case I think this would be better:

self.last_items.iter().position(|key| key == &k).and_then(self.last_items.remove(pos));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this looks much better 👍

Copy link
Contributor Author

@ppoliani ppoliani Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a LinkedList isn't faster on removals.

This operation should compute in O(n) time.

Also this is what the official docs suggest:

NOTE: It is almost always better to use Vec or VecDeque because array-based containers are generally faster, more memory efficient, and make better use of CPU cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah you'd need some sort of linkedlist + hashmap to know which two nodes to update to remove a node :D

src/capped_hashmap.rs Outdated Show resolved Hide resolved
@mimoo
Copy link
Contributor

mimoo commented Jan 30, 2024

Nice! Thanks for addressing the nits :)

"cwd": "${workspaceFolder}"
},
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably didn't mean to push this file :o

@mimoo mimoo merged commit 899ec01 into sigma0-dev:main Jan 30, 2024
1 check passed
@ppoliani ppoliani deleted the feat/capped_hashmap branch January 30, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants