Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix/TestEvacuateShard test #2868

Merged
merged 2 commits into from
Jun 13, 2024
Merged

fix/TestEvacuateShard test #2868

merged 2 commits into from
Jun 13, 2024

Conversation

carpawell
Copy link
Member

No description provided.

@carpawell carpawell self-assigned this Jun 11, 2024
Copy link

codecov bot commented Jun 11, 2024

Codecov Report

Attention: Patch coverage is 44.44444% with 5 lines in your changes missing coverage. Please review.

Project coverage is 23.64%. Comparing base (e5da072) to head (9680cd5).
Report is 4 commits behind head on master.

Files Patch % Lines
pkg/local_object_storage/engine/put.go 37.50% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2868      +/-   ##
==========================================
+ Coverage   23.56%   23.64%   +0.07%     
==========================================
  Files         773      770       -3     
  Lines       44672    44569     -103     
==========================================
+ Hits        10529    10537       +8     
+ Misses      33290    33178     -112     
- Partials      853      854       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@carpawell carpawell force-pushed the fix/evacuate-test branch 6 times, most recently from 7cfd17f to 7040f49 Compare June 12, 2024 17:11
@carpawell carpawell changed the title debug evacuate test fix/TestEvacuateShard test Jun 12, 2024
@carpawell carpawell marked this pull request as ready for review June 12, 2024 17:12
@carpawell
Copy link
Member Author

carpawell commented Jun 12, 2024

Disgusting debugging.

On the other hand, this (WithNonblocking) looks not good to me and can be changed. It may be the reason of objects-on-wrong-shards effects and also may lead to an error when all shards are busy (i would prefer to wait for them and fail with deadline). @roman-khimov, @cthulhu-rider

@carpawell carpawell force-pushed the fix/evacuate-test branch 2 times, most recently from 161489b to 48452c9 Compare June 12, 2024 17:21
@@ -163,6 +172,7 @@ func (e *StorageEngine) putToShard(sh hashedShard, ind int, pool util.WorkerPool

putSuccess = true
}); err != nil {
e.log.Warn("object put: pool task submitting", zap.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how this affects the system and/or what's the admin's reaction to this message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load to this shard currently is bigger than it can handle. also, we have a system that says where to put an object but now (this line) it is not gonna work: an object goes to another shard (we had (have?) bugs that relate such cases). in fact, it always bothered me that object put was always kinda random, you never know from logs if this placement is ok or a "bad" shard was taken as a best-effort

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR (and the issue) is a real example BTW. if i had this log, i wouldn't have had to run this test so many times trying to understand what was happening, it would have taken 1 min looking at logs

@@ -164,6 +164,8 @@ mainLoop:
}
continue loop
}

e.log.Debug("could not put to shard, trying another", zap.String("shard", shards[j].ID().String()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if last shard failed there wont be more trying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a problem to me, classic iteration: "try another shard; shard list is over; i finished cycle"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly would you expect here?

@@ -28,7 +28,7 @@ func newEngineEvacuate(t *testing.T, shardNum int, objPerShard int) (*StorageEng

e := New(
WithLogger(zaptest.NewLogger(t)),
WithShardPoolSize(1))
WithShardPoolSize(uint32(objPerShard)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, doubtful fix to me possibly hiding buggy implementation or test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doubtful fix to me

why? this is literally the problem, i proofed it locally, required 100k+ test runs. ants.Pool has background workers and some logic that blocks execution (calculates num of wip workers), if it does not free it before another iteration is being done, current shard logic just tries another shard and duplicates objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other words, 1-sized pool cannot guarantee that an object will be put if you want to put 2+ objects. on a few-core machine, it is more critical, on my notebook it is like ~ 0.003%. but an evacuate test should not suffer because of the put problem IMO, it should test evacuation logic and it cannot, because objects are duplicated and go to unexpected shards

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're testing evacuation here, so shard behavior details are not really relevant (subject to another bug).

Masked error-less shard skipping makes debugging awful, relates #2860.

Signed-off-by: Pavel Karpy <[email protected]>
It looks like on a slow (few-core?) machine put operation can be failed because
the internal pool has not been freed from the previous iteration (another shard
is tried then and additional "fake" relocation is detected in the test).
Closes #2860.

Signed-off-by: Pavel Karpy <[email protected]>
@roman-khimov roman-khimov merged commit 9c049a2 into master Jun 13, 2024
21 of 22 checks passed
@roman-khimov roman-khimov deleted the fix/evacuate-test branch June 13, 2024 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants