-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issues #25
Comments
Hey @marvin-j97 good morning!! Thank you for this extensive write up. I've seen your project. Good work and keep it up. The tombstone is a fixed value, so one could break the implementation by writing a "$tombstone" value manually. Also this makes tombstones unnecessarily large. I don't understand the "granular page locking" stuff - you don't need locking on SSTables... they are immutable. I don't understand the idea of the "pager" and "pages" at all... you don't seem to use block-based SSTables, that's why I can't find index blocks, but that probably makes compression pretty ineffective compared to RocksDB. Are pages cached, if so, what's the cache size? Is it set correctly for RocksDB as well? On read operations many threads can access the Get method, yes memtable is locked but sstables are locked on the page level on reads, for granular reads. There are no caching pages. The Cuckoo filter seems to return null if it returns false... but Cuckoo filters are still a probabilistic data structure, or are you making sure it has a FPR of 0.0? Otherwise it may return a wrong result. How much memory does it use, considering you have to index every key? I've tried to make it as memory efficient as I could whilst making it usable for page indexes for a key value pair with prefixes. I am not making sure it has an FPR of 0, the FPR with this cuckoo is determined sorta by load factor. The idea behind it was for it to provide less FPR then the previous bloomfilter implementation and offer an index solution for key value pages. I don't see any notion of "levels" in the tree, K4 seems to support range queries, but there seems to be a single level, so you are doing size tiered compaction. How many SSTables are usually kept around? Looking at the compaction, there's a timer which I don't think makes sense. When you have a burst traffic of writes you will end up with more SSTables and have to wait, possibly 1 hour, for the compaction to fix up your read performance. Compactions should operate on the current state of SSTables in the LSM-tree. The Memtable is level 0 whilst every other sstable is another level in K4. There is always 1 sstable if compaction occurs very frequently. When you have a burst traffic of writes you will end up with more SSTables and have to wait, possibly 1 hour, for the compaction The flushes and compactions don't seem to fsync their data before appending the new SSTables. They do, the underlaying pager syncs data every set time and escalates every minute and pager closure. The conclusion in the wiki is obviously written by an LLM, not to mention there are wrong statements (in point 4) like K4 having atomic transactions (as if RocksDB doesn't do that) and being able to recover SSTables from the WAL... which makes no sense? You recover memtables from a WAL (as you seem to do as well). Huh? No. Firstly this isn't a clone of RocksDB. Everything is designed from my own ideas referring to a a log structure merge tree. Yes you can batch operations in a transaction and commit them atomically to the memtable.. If you have NO sstables, NO data. You can recover from your WAL file to regenerate your K4 instance as it was. Also I think having to call RecoverFromWAL manually is a recipe for disaster. The readme says If you have a populated WAL file in the data directory but no data files aka sstables, but I don't understand that. What's the connection between having no SSTables and recovering a WAL? You always recover a WAL to restore the memtable(s), no matter if you have SSTables or not. How can you not understand that? If you have a WAL file that is populated from even another system, you can recovery your K4 instance as it was before. Benchmarks use HDD, which probably hides inefficiencies in the read path; also RocksDB is typically run on SSDs. I did, look at the latest post. I benched on Google Cloud N2 with an SSD 250GB and K4 performed amazingly whereas RocksDB did not at similar configurations. I had to stop the test for RocksDB as 10 million key values was making it poop the bed. I have to also add, there is nothing wrong with trying to be innovative and creative on the design choices here. If I copied RocksDB, what would make K4 special? |
I starred your project. Keep it up. Storage engines are no small feat. |
https://github.com/fjall-rs/fjall good stuff. I saw this as well out there, gave it some love as well. |
If you have any more questions do let me know. I am getting a coffee now, will work on the tombstone patch. Cheers! |
I just tried it in a new Go project: package main
import (
"fmt"
"github.com/guycipher/k4/v2"
"log"
"time"
)
const (
DB_PATH = "testdb"
)
func main() {
numOps := 10000000
db, err := k4.Open(DB_PATH, (1024*1024)*128, 3600, false, false)
if err != nil {
log.Fatalf("Error opening K4 database: %v", err)
}
defer db.Close()
key := make([]byte, 20)
value := make([]byte, 20)
// Benchmark Put
for i := 0; i < numOps; i++ {
key = []byte(fmt.Sprintf("key%d", i))
value = []byte(fmt.Sprintf("value%d", i))
if err := db.Put(key, value, nil); err != nil {
log.Fatalf("Error putting key: %v", err)
}
}
} is my code. It ended up writing a 40 GB WAL, and 3 SSTables with 18 GB, 18 GB and 7 GB respectively, and taking about 100s. That's roughly a space amplification of 200. It also used more than 6 GB of RAM.
My system is i9 11900k, Samsung 990 EVO. |
Thank you! excellent run on great hardware. I can see something I’ve been seeing more frequently is the size of the sstables are huge compared to the configured threshold. In your example I can see the flush threshold of 128mb there should have been 340 ish sstables. That’s a bug, most definitely. When you write 6gb memory usage, that’s through the write operations? That’s not too bad for the amount of key values, in my mind. Could be optimized for less usage I’d imagine. The WAL is sorta a given, it will be large. each page is 1024*4 could me smaller for sure. |
It's about 4 GB, after 10 seconds the writes are "finished" and it goes into the 90 seconds of ???, and goes to about 6 GB. Without having checked, I would expect RocksDB to use maybe 100-200 MB here.
I have written just 400 MB of data. That's why every database rotates & truncates the WAL at some point (normally when doing memtable flushes). It's not feasible to keep a WAL that large. |
On close we flush memtable and wait for operations to finish that includes synchronization and background operations. In this case stated it shouldn’t be that long. It’s taking that long cause of how large the sstables are, if they followed through with the threshold it would be faster on closure. I am working on this bug currently. 18gb sstables are way off threshold. This is calculated obviously in the skip list as new entries are added. Hm interesting. I could truncate wal after each flush this could work. Good idea ! |
Honestly @marvin-j97 im very appreciative eh. Very. Thank you for taking the time to do this. Be sure to give ya a shoutout next release with these patches. |
For the WAL file, because I want to assure you can backup your entire instance from a WAL it's hard to implement truncation. Yes there "could" be sstables, but what if there is not? I am thinking about this. Currently, the best bet to bring down the WAL file size is making the pager size page size for the entire system 1024. This is fine as encoded values under this will keep the files small, but yeah I need to bench it as the pager will have to create and link overflows and gather them on reads as well. |
One thing I noticed is that Go's Gob package which encodes the structures like operation, and keyvalue are not optimized for space, this is causing these structures to significantly grow in size after encoding. To counter this yes, protobuff, msgpack, something like this could be used or a custom solution. |
In 2.1.7 I tested 1 million key value pairs with your code. I got back a memtable of 95.3 MB. After the flush to disk the file was 1gb. This is definitely not the pager, I am seeing these encoded structures span multiple pages. |
That's what the SSTables are for. To rebuild an LSM-tree you send over the SSTables and the current WAL fragment(s). |
Understood. So essentially every flush off the queue (in terms of K4) you'd truncate the WAL as now those entries are safe in an sstable. So curious when you reopen an instance you replay the WAL to populate memtable and essentially it will auto clean itself once it truncates and transitions to an sstable? I gotta wrap my head around that. Very interesting stuff. I'm obviously always worried about the actual sstables. I understand the reasoning here, in K4 we are populating double the data pretty much, which is sorta bloat. |
I found this on Reddit but I decided to post here because my comment got too long.
Disclaimer: I'm making https://github.com/fjall-rs/lsm-tree, so I have some experience with LSM-trees.
I don't really know Go well so some stuff may be wrong, but reading through a bit of the code I feel like there are some questionable things I think:
The Cuckoo filter seems to return null if it returns false... but Cuckoo filters are still a probabilistic data structure, or are you making sure it has a FPR of 0.0? Otherwise it may return a wrong result.How much memory does the Cuckoo filter use, considering you have to index every key?RecoverFromWAL
manually is a recipe for disaster. The readme saysIf you have a populated WAL file in the data directory but no data files aka sstables
, but I don't understand that. What's the connection between having no SSTables and recovering a WAL? You always recover a WAL to restore the memtable(s), no matter if you have SSTables or not.The text was updated successfully, but these errors were encountered: