fix/`TestEvacuateShard` test #2868

carpawell · 2024-06-11T20:18:44Z

No description provided.

codecov · 2024-06-11T21:31:33Z

Codecov Report

Attention: Patch coverage is 44.44444% with 5 lines in your changes missing coverage. Please review.

Project coverage is 23.64%. Comparing base (e5da072) to head (9680cd5).
Report is 4 commits behind head on master.

Files	Patch %	Lines
pkg/local_object_storage/engine/put.go	37.50%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2868      +/-   ##
==========================================
+ Coverage   23.56%   23.64%   +0.07%     
==========================================
  Files         773      770       -3     
  Lines       44672    44569     -103     
==========================================
+ Hits        10529    10537       +8     
+ Misses      33290    33178     -112     
- Partials      853      854       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

carpawell · 2024-06-12T17:16:43Z

Disgusting debugging.

On the other hand, this (WithNonblocking) looks not good to me and can be changed. It may be the reason of objects-on-wrong-shards effects and also may lead to an error when all shards are busy (i would prefer to wait for them and fail with deadline). @roman-khimov, @cthulhu-rider

pkg/local_object_storage/engine/put.go

cthulhu-rider · 2024-06-13T08:35:58Z

pkg/local_object_storage/engine/put.go

@@ -163,6 +172,7 @@ func (e *StorageEngine) putToShard(sh hashedShard, ind int, pool util.WorkerPool

 		putSuccess = true
 	}); err != nil {
+		e.log.Warn("object put: pool task submitting", zap.Error(err))


how this affects the system and/or what's the admin's reaction to this message?

load to this shard currently is bigger than it can handle. also, we have a system that says where to put an object but now (this line) it is not gonna work: an object goes to another shard (we had (have?) bugs that relate such cases). in fact, it always bothered me that object put was always kinda random, you never know from logs if this placement is ok or a "bad" shard was taken as a best-effort

this PR (and the issue) is a real example BTW. if i had this log, i wouldn't have had to run this test so many times trying to understand what was happening, it would have taken 1 min looking at logs

cthulhu-rider · 2024-06-13T08:39:04Z

pkg/local_object_storage/engine/evacuate.go

@@ -164,6 +164,8 @@ mainLoop:
 						}
 						continue loop
 					}
+
+					e.log.Debug("could not put to shard, trying another", zap.String("shard", shards[j].ID().String()))


if last shard failed there wont be more trying

not a problem to me, classic iteration: "try another shard; shard list is over; i finished cycle"

what exactly would you expect here?

cthulhu-rider · 2024-06-13T08:44:01Z

pkg/local_object_storage/engine/evacuate_test.go

@@ -28,7 +28,7 @@ func newEngineEvacuate(t *testing.T, shardNum int, objPerShard int) (*StorageEng

 	e := New(
 		WithLogger(zaptest.NewLogger(t)),
-		WithShardPoolSize(1))
+		WithShardPoolSize(uint32(objPerShard)))


in general, doubtful fix to me possibly hiding buggy implementation or test

doubtful fix to me

why? this is literally the problem, i proofed it locally, required 100k+ test runs. ants.Pool has background workers and some logic that blocks execution (calculates num of wip workers), if it does not free it before another iteration is being done, current shard logic just tries another shard and duplicates objects

in other words, 1-sized pool cannot guarantee that an object will be put if you want to put 2+ objects. on a few-core machine, it is more critical, on my notebook it is like ~ 0.003%. but an evacuate test should not suffer because of the put problem IMO, it should test evacuation logic and it cannot, because objects are duplicated and go to unexpected shards

We're testing evacuation here, so shard behavior details are not really relevant (subject to another bug).

Masked error-less shard skipping makes debugging awful, relates #2860. Signed-off-by: Pavel Karpy <[email protected]>

It looks like on a slow (few-core?) machine put operation can be failed because the internal pool has not been freed from the previous iteration (another shard is tried then and additional "fake" relocation is detected in the test). Closes #2860. Signed-off-by: Pavel Karpy <[email protected]>

carpawell self-assigned this Jun 11, 2024

carpawell force-pushed the fix/evacuate-test branch 6 times, most recently from 7cfd17f to 7040f49 Compare June 12, 2024 17:11

carpawell changed the title ~~debug evacuate test~~ fix/TestEvacuateShard test Jun 12, 2024

carpawell marked this pull request as ready for review June 12, 2024 17:12

carpawell requested review from roman-khimov and cthulhu-rider as code owners June 12, 2024 17:12

carpawell force-pushed the fix/evacuate-test branch 2 times, most recently from 161489b to 48452c9 Compare June 12, 2024 17:21

roman-khimov reviewed Jun 13, 2024

View reviewed changes

pkg/local_object_storage/engine/put.go Outdated Show resolved Hide resolved

cthulhu-rider reviewed Jun 13, 2024

View reviewed changes

carpawell added 2 commits June 13, 2024 13:47

shard: improve logging on object putting

3591bc1

Masked error-less shard skipping makes debugging awful, relates #2860. Signed-off-by: Pavel Karpy <[email protected]>

carpawell force-pushed the fix/evacuate-test branch from 48452c9 to 9680cd5 Compare June 13, 2024 10:48

carpawell requested review from roman-khimov and cthulhu-rider June 13, 2024 11:13

roman-khimov approved these changes Jun 13, 2024

View reviewed changes

roman-khimov merged commit 9c049a2 into master Jun 13, 2024
21 of 22 checks passed

roman-khimov deleted the fix/evacuate-test branch June 13, 2024 13:02

carpawell mentioned this pull request Jun 13, 2024

Do not skip shards that are busy #2871

Open

roman-khimov mentioned this pull request Jun 14, 2024

Evacuate test fails sometimes #2409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/`TestEvacuateShard` test #2868

fix/`TestEvacuateShard` test #2868

carpawell commented Jun 11, 2024

codecov bot commented Jun 11, 2024 •

edited

Loading

carpawell commented Jun 12, 2024 •

edited

Loading

cthulhu-rider Jun 13, 2024

carpawell Jun 13, 2024

carpawell Jun 13, 2024

cthulhu-rider Jun 13, 2024

carpawell Jun 13, 2024

carpawell Jun 13, 2024

cthulhu-rider Jun 13, 2024

carpawell Jun 13, 2024

carpawell Jun 13, 2024

roman-khimov Jun 13, 2024

fix/TestEvacuateShard test #2868

fix/TestEvacuateShard test #2868

Conversation

carpawell commented Jun 11, 2024

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

carpawell commented Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fix/`TestEvacuateShard` test #2868

fix/`TestEvacuateShard` test #2868

codecov bot commented Jun 11, 2024 •

edited

Loading

carpawell commented Jun 12, 2024 •

edited

Loading