-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve rebuild logic #5
Comments
IMO this is fairly essential. Under the current scheme, the backend data usage is multiplied by the number of rebuild operations that have been carried out, plus one for the initial write. So in the case of an initial backend configuration with some data stored, replacing a single backend and rebuilding means doubling the data usage in all of the backends that didn't get replaced, since a duplicate of all data is written to them again. |
with this config minimal_shards = 4
expected_shards = 6 and got this 024-12-09 16:30:40 +07:00: INFO Shard 0 is DIFFERENT
2024-12-09 16:30:40 +07:00: INFO Shard 1 is DIFFERENT
2024-12-09 16:30:40 +07:00: INFO Shard 2 is DIFFERENT
2024-12-09 16:30:40 +07:00: INFO Shard 3 is DIFFERENT
...
.....
.....
2024-12-09 16:30:54 +07:00: INFO Shard 0 is the SAME
2024-12-09 16:30:54 +07:00: INFO Shard 1 is DIFFERENT
2024-12-09 16:30:54 +07:00: INFO Shard 2 is DIFFERENT
2024-12-09 16:30:54 +07:00: INFO Shard 3 is DIFFERENT
.....
2024-12-09 16:30:54 +07:00: INFO Rebuild file from 127.0.0.1:9903,127.0.0.1:9907,127.0.0.1:9906,127.0.0.1:9902,127.0.0.1:9908,127.0.0.1:9909 to 127.0.0.1:9902,127.0.0.1:9904,127.0.0.1:9908,127.0.0.1:9903,127.0.0.1:
9905,127.0.0.1:9909
...
....
2024-12-09 16:31:17 +07:00: INFO Shard 0 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 1 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 2 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 3 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 4 is the SAME
2024-12-09 16:31:17 +07:00: INFO Shard 5 is the SAME
...
2024-12-09 16:31:17 +07:00: INFO Rebuild file from 127.0.0.1:9908,127.0.0.1:9904,127.0.0.1:9903,127.0.0.1:9905,127.0.0.1:9902,127.0.0.1:9909 to 127.0.0.1:9905,127.0.0.1:9908,127.0.0.1:9909,127.0.0.1:9902,127.0.0.1:
9904,127.0.0.1:9903 not sure about the last one tho, why there are 6 shards there. What we can see from there:
|
Please forget my previous command. So, what we need to do is to make sure the shard is put on the same zdb. |
PR #142 is a working code for this, it doesn't set the shard that is not broken/missing. |
Great, I'll take it for a test drive asap. |
I did a build of the dedup branch at Here's +-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-meta1 | Yes | 32 | 15960 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-meta2 | Yes | 32 | 15960 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1eb4:b62c:8ba6:7310:3f1c:54b3:846b]:9900 - 504-59256-meta3 | Yes | 32 | 15960 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-meta5 | Yes | 32 | 15960 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-data1 | Yes | 271 | 502878178 | 1073741824 | 46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-data2 | Yes | 271 | 502878178 | 1073741824 | 46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1eb4:b62c:8ba6:7310:3f1c:54b3:846b]:9900 - 504-59256-data3 | Yes | 271 | 502878178 | 1073741824 | 46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-data5 | Yes | 271 | 502878178 | 1073741824 | 46 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+ After replacing one backend and letting the rebuild finish: +-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-meta1 | Yes | 36 | 17520 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-meta2 | Yes | 36 | 17520 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-meta5 | Yes | 36 | 17520 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9fab:2cda:749d:aa8e:43c:aeee:dfef]:9900 - 504-59261-meta7 | Yes | 36 | 17520 | 1073741824 | 0 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+=================================================================+===========+=========+============+============+==================+
| [300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 - 504-59257-data1 | Yes | 286 | 524878754 | 1073741824 | 48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900 - 504-59258-data2 | Yes | 286 | 524878754 | 1073741824 | 48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900 - 504-59260-data5 | Yes | 286 | 524878754 | 1073741824 | 48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:9fab:2cda:749d:aa8e:43c:aeee:dfef]:9900 - 504-59261-data7 | Yes | 271 | 524733279 | 1073741824 | 48 |
+-----------------------------------------------------------------+-----------+---------+------------+------------+------------------+ So this looks way better. There's still a few extra objects that have been stored in the original backends though. I'll attach a log file in case it can help deduce why. |
interesting, i've checked the logs and find nothing suspicious. I think i'll re-test and add more logs |
i've checked the logs (even created Go script for this) and nothing wrong with it in the This section is where we rebuild the shard this kind of log (splitted into several lines for readability) 1214:2024-12-14 00:40:20 +00:00: INFO Rebuild file from
[302:1eb4:b62c:8ba6:7310:3f1c:54b3:846b]:9900,[300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900,[300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900,[300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900
to
[300:9fab:2cda:749d:aa8e:43c:aeee:dfef]:9900,[300:c282:7f31:4aa4:2730:ddff:a5ab:ba2e]:9900,[300:9c4a:29ac:6e0b:6cf6:881f:f371:b9cd]:9900,[300:2589:76ee:f44e:a3ad:36d0:8f3c:f297]:9900 See the order of the backends there. |
One thing i can notice which different with my test is: you use the same backends for both meta and data while i use different one for the backend that i replaced. But i don't see how it relates with this issue |
unfortunately i still can't reproduce it. @scottyeager |
I'm testing in the context of qsfs, so the store operations are bring triggered automatically. It's possible that the rotate timer in zdb expired during my test and another store was triggered. I'll try again and make sure to rule that out. Just looking at the status output though I didn't suspect an additional store as the root of the discrepancy. The new backend has 271 objects, just like all of the original backends did before the rebuild operation. If more data was stored, I'd expect all backends to have a larger number of objects. |
I did another test, being absolutely sure that no extra data is added after changing the backends config. Here it's a bit different result. Before rebuild +--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+====================================================================+===========+=========+============+============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-meta10 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-meta11 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 - 5545-728557-meta13 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-meta8 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+====================================================================+===========+=========+============+=============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-data10 | Yes | 31 | 52466633 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-data11 | Yes | 31 | 52466633 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 - 5545-728557-data13 | Yes | 31 | 52466633 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-data8 | Yes | 31 | 52466633 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+ And after rebuild: +--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+====================================================================+===========+=========+============+============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-meta10 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-meta11 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900 - 5545-728562-meta24 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-meta8 | Yes | 7 | 2921 | 1073741824 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| backend | reachable | objects | used space | free space | usage percentage |
+====================================================================+===========+=========+============+=============+==================+
| [304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900 - 5545-728559-data10 | Yes | 31 | 52466633 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [300:24:6b96:f25f:6734:75e4:13:2f59]:9900 - 5545-728556-data11 | Yes | 31 | 52466633 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900 - 5545-728562-data24 | Yes | 31 | 52466633 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+
| [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 - 5545-728558-data8 | Yes | 47 | 85973382 | 10737418240 | 0 |
+--------------------------------------------------------------------+-----------+---------+------------+-------------+------------------+ So this time the metadata objects are constant, but there's still extra objects stored in one of the data backends that wasn't replaced. Logs for this run are below. |
OK, i found the issue. The new logic is working as expected, but there is something else: the backend that supposed to be healthy was failed during the checking but then up again during restoring the data The dead backend is 302:1d81:cef8:3049:b337:7aa0:19c7:e8ea So, we only expect that all errors are coming only from this backend, not the other. This is the normal log, when everything going as expected. 2024-12-18 02:23:48 +00:00: WARN could not download shard 3: error during storage: ZDB at [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 5545-728557-data13, error operation READ caused by Namespace: not found
......
2024-12-18 02:23:49 +00:00: INFO Rebuild file from
[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 to
[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900
This is the troubling logs 2024-12-18 02:26:17 +00:00: WARN could not download shard 0: error during storage: ZDB at [302:1297:52fe:57ba:e03d:e9da:161f:777]:9900 5545-728558-data8, error operation READ caused by timeout
2024-12-18 02:26:18 +00:00: WARN could not download shard 3: error during storage: ZDB at [302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 5545-728557-data13, error operation READ caused by Namespace: not found
......
2024-12-18 02:28:03 +00:00: INFO Rebuild file from
[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[302:1d81:cef8:3049:b337:7aa0:19c7:e8ea]:9900 to
[300:f37f:4d04:af80:4810:3b53:bd09:a4c1]:9900,[300:24:6b96:f25f:6734:75e4:13:2f59]:9900,[304:b61f:4a88:ab78:72ec:9db3:92e3:c093]:9900,[302:1297:52fe:57ba:e03d:e9da:161f:777]:9900
There are some things i could think of right now:
|
I'm not sure how 0-db in sequential mode behave in this situation, whether the number of objects will be increased or not. |
Indeed, this second test was done with some connectivity trouble to the backends, which I didn't notice in the first test. |
I'm thinking of more heuristics for this:
Let me know if you have more idea. |
I agree that there needs to be something smarter for deciding if a backend is dead or not, instead of the binary dead/alive we have now. I'd suggest treating connection failures something like "if we can't reach the backend for X amount of time, consider it to be down" rather than immediately assuming it is down (and of course like you said if the namespace is not found its immediately down). Whether or not the backend is in the config does not matter for the heuristic, since we are considering the read operations here. The config is only relevant for store operations so the system knows what db's it can use. It is perfectly fine to remove a full namespace/db from the config but keep the data there for years. |
I'm not 100% sure on this, but my own brief test suggests that attempting to So I think this is a good solution, if we can try to put the shard back on the same zdb at the same key in this case where retrieving the shard failed. |
Checked the code and discussed with Lee. For now, i'll try to improve it by retry the |
^ implemented on #151 |
Hi @scottyeager When testing again, please see this log debug!("timeout error on attempt {}, retrying", attempt + 1); if not appeared, it means this new handler was not triggered. |
Thanks for the updates @iwanbk. I will check about this when I come back to testing QSFS next week. |
The current rebuild logic is fairly simple: retrieve data, reencode, and send back to the new backends. However, we can check if any of the new backends is also used in the old metadata. If it is, we can assign it the same shard, eliminating the write to that backend, saving some space.
Need toch check if encoding is deterministic for this, especially if it is a parity shard
The text was updated successfully, but these errors were encountered: