-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix/TestEvacuateShard
test
#2868
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2868 +/- ##
==========================================
+ Coverage 23.56% 23.64% +0.07%
==========================================
Files 773 770 -3
Lines 44672 44569 -103
==========================================
+ Hits 10529 10537 +8
+ Misses 33290 33178 -112
- Partials 853 854 +1 ☔ View full report in Codecov by Sentry. |
7cfd17f
to
7040f49
Compare
Disgusting debugging. On the other hand, this ( |
161489b
to
48452c9
Compare
@@ -163,6 +172,7 @@ func (e *StorageEngine) putToShard(sh hashedShard, ind int, pool util.WorkerPool | |||
|
|||
putSuccess = true | |||
}); err != nil { | |||
e.log.Warn("object put: pool task submitting", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this affects the system and/or what's the admin's reaction to this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load to this shard currently is bigger than it can handle. also, we have a system that says where to put an object but now (this line) it is not gonna work: an object goes to another shard (we had (have?) bugs that relate such cases). in fact, it always bothered me that object put was always kinda random, you never know from logs if this placement is ok or a "bad" shard was taken as a best-effort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR (and the issue) is a real example BTW. if i had this log, i wouldn't have had to run this test so many times trying to understand what was happening, it would have taken 1 min looking at logs
@@ -164,6 +164,8 @@ mainLoop: | |||
} | |||
continue loop | |||
} | |||
|
|||
e.log.Debug("could not put to shard, trying another", zap.String("shard", shards[j].ID().String())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if last shard failed there wont be more trying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a problem to me, classic iteration: "try another shard; shard list is over; i finished cycle"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exactly would you expect here?
@@ -28,7 +28,7 @@ func newEngineEvacuate(t *testing.T, shardNum int, objPerShard int) (*StorageEng | |||
|
|||
e := New( | |||
WithLogger(zaptest.NewLogger(t)), | |||
WithShardPoolSize(1)) | |||
WithShardPoolSize(uint32(objPerShard))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, doubtful fix to me possibly hiding buggy implementation or test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doubtful fix to me
why? this is literally the problem, i proofed it locally, required 100k+ test runs. ants.Pool
has background workers and some logic that blocks execution (calculates num of wip workers), if it does not free it before another iteration is being done, current shard logic just tries another shard and duplicates objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other words, 1-sized pool cannot guarantee that an object will be put if you want to put 2+ objects. on a few-core machine, it is more critical, on my notebook it is like ~ 0.003%. but an evacuate test should not suffer because of the put problem IMO, it should test evacuation logic and it cannot, because objects are duplicated and go to unexpected shards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're testing evacuation here, so shard behavior details are not really relevant (subject to another bug).
Masked error-less shard skipping makes debugging awful, relates #2860. Signed-off-by: Pavel Karpy <[email protected]>
It looks like on a slow (few-core?) machine put operation can be failed because the internal pool has not been freed from the previous iteration (another shard is tried then and additional "fake" relocation is detected in the test). Closes #2860. Signed-off-by: Pavel Karpy <[email protected]>
48452c9
to
9680cd5
Compare
No description provided.