-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce disk space for vochain statedb #309
Comments
Oh, I see now the state stores the whole set of bytes of every casted vote https://github.com/vocdoni/vocdoni-node/blob/master/vochain/state.go#L576 In the previous version of statedb we were only saving the Hash of the vote, since the vote bytes can be recovered from the Tendermint blockstore. Is there a reason for this change @ed255 ? EDIT: Wrong comment, the current code stores only the hash 😅 |
The current sdbVote := models.StateDBVote{
VoteHash: ethereum.HashRaw(voteBytes),
ProcessId: vote.ProcessId,
Nullifier: vote.Nullifier,
} This is three hashes of 32 bytes each, so 96 bytes per vote. Taking into account the number of votes, the total raw store data for the votes should be around 438 MiB, which is far away from the 9 GiB currently stored by statedb. |
Here are two sources of storage overhead that I can think of. Storage of nodes in outdated paths in ArboFor every Tree.Add / Tree.Set, some of the nodes in the merkle tree are updated. For each node update, Arbo creates a new key-value entry in the DB and doesn't delete the key-value entry corresponding to the updated node. This allows traversing the tree from old Roots, but also means that outdated nodes (nodes that can't be reached with the current tree) are stored in the DB. So for example, if we have a balanced tree with 1024 leafs (that's 10 levels), and we update each leaf, since we have updated every node now we are taking more than double of the space as before. Here are some rough numbers for this example:
Notice that leaf update in this tree updates 10 nodes (the leaf path), so we outdate 10 nodes each time. SolutionsA. Arbo could delete outdated nodes on each operation. I think this one is quite easy to achieve, nevertheless a requirement for this would be that the StateDB uses a single Tx for a block update, and currently we are using more than one Tx in order to avoid the Tx growing too much. BadgerDB has a preconfigured limit on the number of writes in a Tx, whereas Pebble can use as much memory as is available. Edit1: I have thought about a solution for the fact that StateDB does multiple commits internally within a single block. Simmilarly to the db.Batch type we have, we could have a special type that does commits internally when the Tx becomes too big, but stores the list of deletes into a list that is only applied on db.Batch.Commit(). This way we make sure we don't delete nodes before ending a block, which would be problematic if a crash happens at that point. B. Figure out a way to only keep nodes accessible from "pinned" snapshots. This is what filesystems that support snapshots do. I don't know how that would work internally. With this, we could have pinned the last 5 blocks in the StateDB for example, and delete all the nodes that can't be reached from those 5 points. This deletion could happen after a block is ended. C. On node reboot, export and import all the arbo trees. This would apply to the censusManager and StateDB. By exporting an importing the leafs, we discard all the outdated nodes, although this operation will take longer the bigger the trees are, and requires node restart. ConsiderationsCurrently the StateDB doesn't store information about the subtrees layout. For some of the solutions, having this layout would be required. This could be done by making sure that after opening a subTree, the subTree config is stored in the NoState KVDB of the mainTree, or in the metadata section of the underlying DB. This would allow programatically operating on all the subTrees without needing to specify each subTree. There would be a challenge though, which is that a subTreeConfiguration requires 2 functions, and we can't serialize functions into a DB :S Key Value DB uses long prefixes in StateDBThe subTrees in the StateDB are stored in a unique Key Value DB for the StateDB using prefixes. The deeper the subtree, the longer the key (in key-value) prefix will be. For example, the VoteTree is under the ProcessTree under the MainTree. So every key-value stored by the VoteTree is prefixed with the ProcessID (which is 32bytes). If the Key-ValueDB stores the key in raw for each entry (instead of storing the key as part of a radix tree for example), this means a space overhead for each KV entry, which can add up. This needs to be investigated to figure out if it's a noticable overhead (for example, write a test where an Arbo tree is filled, one with a raw WriteTx, another one with a PrefixedWriteTx, and then compare the disk usage of both). |
Update: talked with @ed255 offline, and we agreed to go with option A. Added an issue in arbo to implement this: vocdoni/arbo#25 |
Finally implemented as part of this PR #1080 |
For a vochain network with the following details:
The disk space used is 13G distributed as follows:
While the whole list of transactions (stored by Tendermint) is 3.1GiB, the vochain state uses 9GiB. This seems a bit too much since the state should not save the full vote package but only the hash of it. Why is this happening and how can we reduce the amount of disk space required by statedb?
The text was updated successfully, but these errors were encountered: