[WIP] Delete duplicated key/value pairs recursively #19

lixiaoy1 · 2017-07-12T08:20:08Z

This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement
Add a new flush style called kFlushStyleDedup which users can config by setting
flush_style=kFlushStyleDedup and write_buffer_number_to_flush. Flush
is triggered while number of non-flushed memtables is >= min_write_buffer_number_to_merge.
When flushing memtables to L0, it merges write_buffer_number_to_flush memtables
and deduping against other non-flushed memtables.

With bluestore fio plugin to do 4k random write test, I compared the two stats
db.get.micros and db.write.micros from rocksdb every ten minutes. db.get.micros
is the average of get operations, and db.write.micros is the average time of put/delete
functions. From the result, rocksdb read performance can be improved up to 38%,
and write performance can be improved to 15%. More detailed info please see the excel:
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8

And to complement, the io perfermance of bluestore fio doesn't improve much. From time of
txc states, the time spends a lot in kv_state_commiting_lat (up to 80%), but submit_transaction and
submit_transaction_sync takes little time compared.

Signed-off-by: Xiaoyan Li [email protected]

tchaikov · 2017-07-12T08:29:36Z

will this change go to upstream?

lixiaoy1 · 2017-07-12T08:36:08Z

As this idea is from Sage http://pad.ceph.com/p/rocksdb-wal-improvement, at first hope Sage or you can review the changes.

lixiaoy1 · 2017-07-12T08:37:39Z

@liewegas Please help to review.

liewegas · 2017-07-12T11:47:31Z

db/column_family.h

@@ -236,6 +236,9 @@ class ColumnFamilyData {
  MemTable* mem() { return mem_; }
  Version* current() { return current_; }
  Version* dummy_versions() { return dummy_versions_; }
+  void set_stop() { stop = true; }
+  bool is_stop() { return stop; }


why is the stop flag needed?

It is used to decide whether choosing all immutable tables to flush when MemTableList::PickMemtablesToFlush. In the new flush style, if the db process is going to shutdown, it needs to flush all existed immutable memtables. https://github.com/ceph/rocksdb/pull/19/files#diff-f8dc8f7b1ea2cb77b0eb9b156bc434f2R133

liewegas · 2017-07-12T11:47:51Z

db/builder.cc

  assert((column_family_id ==
          TablePropertiesCollectorFactory::Context::kUnknownColumnFamily) ==
         column_family_name.empty());
  // Reports the IOStats for flush for every following bytes.
  const size_t kReportFlushIOStatsEvery = 1048576;
  Status s;
+  uint64_t num_duplicated = 0, num_total=0;


style: spaces around =

Will update it later.

liewegas · 2017-07-12T11:47:58Z

db/builder.cc

@@ -138,17 +140,52 @@ Status BuildTable(
        &snapshots, earliest_write_conflict_snapshot, env,
        true /* internal key corruption is not ok */, range_del_agg.get());
    c_iter.SeekToFirst();
+
+    ParsedInternalKey c_ikey, comp_ikey;
+    if(compare_iter != nullptr)


style: space between if and (

Will update it later.

liewegas · 2017-07-12T11:48:14Z

db/builder.cc

+            break;
+          }
+          else if(result > 0)
+            compare_iter->Next();


style: braces around single statement blocks

Will update it later.

liewegas · 2017-07-12T11:56:47Z

Yay!

This looks reasonable to me, but I have pretty limited rocksdb experience so it really needs to be looked at by those folks. I'd fix the style issues before sending it there since they'll undoubtable complain about it. (Also, the automated github pull request checks include a style check.)

I don't understand why the new set_stop/is_stop checks are needed; what makes this change need it?

A description of the approach and new flag somewhere is probably needed. It looks like you are always flushing a single memtable at a time and always deduping against all others. We might need a tunable there? (In contrast, the current behavior is to flush all memtables? That doesn't seem quite right as there are a few tunables controlling the how many to flush at a time... sorry, I'm not very familiar with this code.)

The expected savings is probably most evident in the overall write-amp and amount of data going into level0. Did you happen to look at this metrics? @markhpc might know which values to look for.

Thanks!

lixiaoy1 · 2017-07-13T06:06:48Z

The current behavior is in following scenarios it begins to flush:

When the number of non-flushed immutable memtables is same or bigger than min_write_buffer_number_to_merge.
https://github.com/ceph/rocksdb/blob/master/db/memtable_list.cc#L273
https://github.com/ceph/rocksdb/blob/master/db/db_impl_compaction_flush.cc#L1094
Other scenario, like shutdown etc. It needs to flush all memtables.

In this branch, it changes scenario 1. You are right. In this patch, it flushes a single memtable at a time and always deduping against other min_write_buffer_number_to_merge-1 memtables. I agree that we can make it more flexible: use min_write_buffer_number_to_merge to indicate how many memtables to flush, and add another options to specify how many other memtables to compare/dedup. And sum of these two numbers should be <= max_write_buffer_number.

The branch decreases the data flushed to L0. When min_write_buffer_number_to_merge is set 2, the following are summary of L0 sst files every 10 minutes from my tests:
Log in this branch:
file number | file size(MB)
8 | 25.86
7 | 291.6
5 | 282.9
5 | 311.68
4 | 260.32
4 | 264.43
4 | 266.39
4 | 267.35
0 | 0
4 | 263.79

Master branch:
file number | file size(MB)
4 | 179
2 | 301.83
2 | 346.32
2 | 346.32
0 | 0
2 | 362.69
2 | 365.34
2 | 365.88
2 | 364.02
2 | 361.99
2 | 359.07

markhpc · 2017-07-13T15:41:01Z

Awesome work Li! I'm really excited to take a look at this.

liuchang0812 · 2017-07-14T14:00:26Z

awesome work. actually the minor-compaction which called kFlushStyleDedup is similar with L0 sstable's compaction. we could set options.level0_file_num_compaction_trigger = 1; to get similar result.

lixiaoy1 · 2017-07-14T14:06:02Z

level0_file_num_compaction_trigger=1 means to trigger L0->L1 compaction when number of l0 sst files is 1. I think their mechanism are different. Meanwhile if you set level0_file_num_compaction_trigger as 1, more duplicated data will put into l2.

lixiaoy1 · 2017-07-14T14:07:53Z

@liewegas @markhpc and other reviewers, I will be out of office next week(15th - 23th July), please give your comments. And I will respond after I come back.

liuchang0812 · 2017-07-14T14:10:40Z

@lixiaoy1 oh....sorry, level0_file_num_compaction_trigger=2

lixiaoy1 · 2017-07-14T14:19:57Z

@liuchang0812 I think you mean min_write_buffer_number_to_merge. When setting min_write_buffer_number_to_merge as 2, the test result on this branch is better. You may have a look at
https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8
If you looked into whole io path, they are almost same.

lixiaoy1 · 2017-08-08T03:49:17Z

@markhpc Could you please have a look?

This is to implement the idea: http://pad.ceph.com/p/rocksdb-wal-improvement Add a new flush style called kFlushStyleDedup which users can config by setting flush_style=kFlushStyleDedup and write_buffer_number_to_flush. Flush is triggered while number of non-flushed memtables is >= min_write_buffer_number_to_merge. When flushing memtables to L0, it merges write_buffer_number_to_flush memtables and deduping against other non-flushed memtables. With bluestore fio plugin to do 4k random write test, I compared the two stats db.get.micros and db.write.micros from rocksdb every ten minutes. db.get.micros is the average of get operations, and db.write.micros is the average time of put/delete functions. From the result, rocksdb read performance can be improved up to 38%, and write performance can be improved to 15%. More detailed info please see the excel: https://drive.google.com/drive/folders/0B6jqFc7e2yxVdUQ2aEpCR3ItbG8 And to complement, the io perfermance of bluestore fio doesn't improve much. From time of txc states, the time spends a lot in kv_state_commiting_lat (up to 80%), but submit_transaction and submit_transaction_sync takes little time compared. Signed-off-by: Xiaoyan Li [email protected]

lixiaoy1 · 2017-09-04T02:43:57Z

This pr is based on Rocksdb@e15382c09c87a65eaeca9bda233bab503f1e5772, I submitted a new PR to facebook/Rocksdb facebook#2829, which is based on Rocksdb master branch.

markhpc · 2017-09-22T17:27:22Z

Hi Li,

Sorry, I haven't gotten a chance to test this yet. Have you gotten any feedback from the rocksdb folks?

lixiaoy1 · 2017-09-25T00:46:18Z

@markhpc currently Andrew Kryczka from Facebook asked some questions. What I am doing now is that I get the kv operation sequences out from BlueStore when running fio+librbd, and use the kv operation seqs to feed into Rocksdb, and check how Rocksdb is improved. lixiaoy1/ceph@e6a4e8d

markhpc · 2017-10-11T14:41:53Z

Hi Li,

I'm still swamped, but I wanted to mention that we're going to hopefully look at this soon (either me or someone else). Any update from the rocksdb folks?

Mark

lixiaoy1 · 2017-10-12T00:21:05Z

Hi Mark, I have some discussion about the PR with Facebook people. It is still under review. And another point is that Facebook people raised the following idea which I think also can achieve same objective: Repeatedly compact the oldest two immutable memtables into one larger immutable memtable. We'll flush the compacted memtable into an L0 file only once it exceeds some size. Anyway I think both of above solution can benefit BlueStore. Best wishes Lisa From: Mark Nelson [mailto:[email protected]] Sent: Wednesday, October 11, 2017 10:42 PM To: ceph/rocksdb <[email protected]> Cc: Li, Xiaoyan <[email protected]>; Mention <[email protected]> Subject: Re: [ceph/rocksdb] [WIP] Delete duplicated key/value pairs recursively (#19) Hi Li, I'm still swamped, but I wanted to mention that we're going to hopefully look at this soon (either me or someone else). Any update from the rocksdb folks? Mark — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#19 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM1sAUICifAcsMCK7NiNPvawQemttHZHks5srNO4gaJpZM4OVSCp>.

liewegas reviewed Jul 12, 2017

View reviewed changes

lixiaoy1 force-pushed the dedup branch from ee3a97c to 8d72fa9 Compare July 14, 2017 13:42

lixiaoy1 force-pushed the dedup branch from 8d72fa9 to 1297ff8 Compare July 14, 2017 14:02

lixiaoy1 force-pushed the dedup branch from 1297ff8 to 60971cd Compare July 14, 2017 14:14

lixiaoy1 force-pushed the dedup branch 2 times, most recently from dad3d2a to bf29157 Compare August 22, 2017 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Delete duplicated key/value pairs recursively #19

[WIP] Delete duplicated key/value pairs recursively #19

lixiaoy1 commented Jul 12, 2017 •

edited

Loading

tchaikov commented Jul 12, 2017

lixiaoy1 commented Jul 12, 2017

lixiaoy1 commented Jul 12, 2017

liewegas Jul 12, 2017

lixiaoy1 Jul 13, 2017

liewegas Jul 12, 2017

lixiaoy1 Jul 13, 2017

liewegas Jul 12, 2017

lixiaoy1 Jul 13, 2017

liewegas Jul 12, 2017

lixiaoy1 Jul 13, 2017

liewegas commented Jul 12, 2017

lixiaoy1 commented Jul 13, 2017

markhpc commented Jul 13, 2017

liuchang0812 commented Jul 14, 2017

lixiaoy1 commented Jul 14, 2017

lixiaoy1 commented Jul 14, 2017

liuchang0812 commented Jul 14, 2017

lixiaoy1 commented Jul 14, 2017

lixiaoy1 commented Aug 8, 2017

lixiaoy1 commented Sep 4, 2017

markhpc commented Sep 22, 2017

lixiaoy1 commented Sep 25, 2017

markhpc commented Oct 11, 2017

lixiaoy1 commented Oct 12, 2017 via email

[WIP] Delete duplicated key/value pairs recursively #19

Are you sure you want to change the base?

[WIP] Delete duplicated key/value pairs recursively #19

Conversation

lixiaoy1 commented Jul 12, 2017 • edited Loading

tchaikov commented Jul 12, 2017

lixiaoy1 commented Jul 12, 2017

lixiaoy1 commented Jul 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Jul 12, 2017

lixiaoy1 commented Jul 13, 2017

markhpc commented Jul 13, 2017

liuchang0812 commented Jul 14, 2017

lixiaoy1 commented Jul 14, 2017

lixiaoy1 commented Jul 14, 2017

liuchang0812 commented Jul 14, 2017

lixiaoy1 commented Jul 14, 2017

lixiaoy1 commented Aug 8, 2017

lixiaoy1 commented Sep 4, 2017

markhpc commented Sep 22, 2017

lixiaoy1 commented Sep 25, 2017

markhpc commented Oct 11, 2017

lixiaoy1 commented Oct 12, 2017 via email

lixiaoy1 commented Jul 12, 2017 •

edited

Loading