You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 9, 2019. It is now read-only.
We're experiencing large write performance degradation over time as a result of sustained IO/throughput happening in BoltDB. But first, a little about our use case:
We're writing a sequential log of data, wherein the majority of our keys are 64 byte integers that primarily increase. One bucket serves to hold a variable payload size (change stream from Postgres).
We clean up old log data that has passed some expiration. In all the stats below, we're using a 48 hour retention period and clean up hourly. Deletes are done in batches of 1000 using manual transaction management.
We currently have 3 buckets:
Fix Linux build. #1 Bucket(64 byte variable key) -> Bucket(64 byte integer key, ascending insertion order generally) -> N bytes (1KB-128KB; think database change log of old row/new row)
gofmt #2 Bucket(64 byte integer key, ascending insertion order generally) -> 64 byte (max) random value
After running the server for 4-5 days, we ended up with a 32GB database that has pegged the i3.xl we're running it on at ~3-4K IOPS with ~375-400 MB/s throughput. These levels of IO are sustained once they begin; as expected, restarting the server has no effect. During that time, we would have run the log cleaner around 48 times (once per hour). The cleaner removes data by age in the following way:
From Bucket Fix Linux build. #1, it only cleans up data in the sub buckets in ascending order. The top level variable keys are maintained and never touched.
From Bucket gofmt #2, keys are removed in ascending order
Booting the server against the 17GB database, the IOPS/throughput returned to normal for around 12 hours. At that point, everything returned to the same level degradation as the 32GB database:
The culprit appears to be runtime.memmove and runtime.memclr, which I'm guessing it spending most of it's time reshuffling our data. Any thoughts on how to address the issue?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We're experiencing large write performance degradation over time as a result of sustained IO/throughput happening in BoltDB. But first, a little about our use case:
We currently have 3 buckets:
All use 50% (default) fill.
After running the server for 4-5 days, we ended up with a 32GB database that has pegged the
i3.xl
we're running it on at ~3-4K IOPS with ~375-400 MB/s throughput. These levels of IO are sustained once they begin; as expected, restarting the server has no effect. During that time, we would have run the log cleaner around 48 times (once per hour). The cleaner removes data by age in the following way:Here's a profile of the time it took to complete the clean up batches with their duration and the number of rows removed (from the default bucket):
I ran the bolt tool against the database at 32GB:
Then we compacted the database, which took it to ~17GB:
Booting the server against the 17GB database, the IOPS/throughput returned to normal for around 12 hours. At that point, everything returned to the same level degradation as the 32GB database:
DB.Stats()
, with cumulative and a 10s diff: https://gist.github.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0#file-db-stats-18gb-10-second-diff-includedFinally, here's cpu/alloc from pprof on the ~18GB server while it was pegging the IO:
cpu profile
: https://rawgit.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0/raw/b2f56542a36799d836ea2ab710fa362454d642ca/18GB.pprof.sample.cpu.svgalloc_objects
: https://rawgit.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0/raw/b2f56542a36799d836ea2ab710fa362454d642ca/18GB.pprof.alloc_objects.alloc_space.svgThe culprit appears to be
runtime.memmove
andruntime.memclr
, which I'm guessing it spending most of it's time reshuffling our data. Any thoughts on how to address the issue?The text was updated successfully, but these errors were encountered: