Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadStream with a very large value results in excessive memory use when cache is enabled #66

Open
floren opened this issue Aug 18, 2021 · 2 comments

Comments

@floren
Copy link
Contributor

floren commented Aug 18, 2021

If the cache is enabled, readWithRLock always reads the file using a siphon.

The siphon code copies every byte it reads into a bytes.Buffer. When the full file has been read, that bytes.Buffer is used to update the cache.

However, if the underlying file is e.g. a gigabyte in size, the siphon will end up with a bytes.Buffer containing that entire gigabyte. Unless you've set your cache size to over a gigabyte, this gets thrown away as soon as the ReadStream is done.

The main reason we use ReadStream is so we can deal with very large items without having to stick the entire thing in memory at once. Having discovered this, we'll probably disable the cache, but there are cases where people may wish to have a cache enabled without blowing up their memory!

floren added a commit to floren/diskv that referenced this issue Aug 18, 2021
The siphon will now stop writing to its internal buffer once the size of the buffer
exceeds the maximum cache size. Because we write until we *exceed* the max cache size,
we're safe to attempt the cache update even if the buffer only contains partial data,
because it's still over the limit & will be rejected.
@peterbourgon
Copy link
Owner

Do you use diskv in situations where there are values with such disparate sizes, and you hope to cache the smaller ones but not cache the larger ones?

@floren
Copy link
Contributor Author

floren commented Aug 19, 2021

I do use diskv in situations where some values are multiple gigabytes, and some are multiple megabytes.

We noticed this behavior when trying to figure out why sending items from one node's diskv to another took up so much memory. Once we figured out what was going on, we disabled the cache, but thought we'd report the behavior and offer a fix. I believe if you configure a 100MB cache, and you read a 500MB item via ReadStream, you'd be surprised to learn that a diskv makes a complete in-memory copy of the item before immediately and always throwing it away.

If you're not particularly worried about that corner case, feel free to close this issue and the related PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants