Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing zero-copy bugs fixes (not for merging) #1156

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

szetszwo
Copy link
Contributor

@szetszwo szetszwo commented Sep 28, 2024

This fix will be split into multiple JIRAs: RATIS-2164, RATIS-2151, RATIS-2173

The following are the bugs found so far:

  1. LeakDetector: asserted allLeaks is non-empty but printed "allLeaks.size = 0"
    • Another bug: Tracks are added to the set before calling retain. Without calling retain at all, it is not a leak.
  2. SimpleTracing and AdvancedTracing: the methods should be synchronized.
    • Minor presentation problem: AdvancedTracing should have a single track list instead of retainsTraces and releaseTraces.
  3. GrpcClientProtocolService.UnorderedRequestStreamObserver.processClientRequest(..) should use try-finally.
  4. GrpcLogAppender.appendLog(..) calls release() incorrectly for exception.
  5. LogAppenderDefault.sendAppendEntriesWithRetries(..) calls release() incorrectly for exception.
  6. LogSegment cache can release an entry multiple times.
  7. LogSegment.loadCache(..) should call retain() for cache hit.
  8. SegmentedRaftLog.retainLog(..): between getting the entry and calling retain(), the entry can be released. The "fail to retain" exception, if there is any, can be ignored since It is the same as a cache miss. See RATIS-2159. TestRaftWithSimulatedRpc could "fail to retain". #1153
  9. SegmentedRaftLog.retainEntryWithData(..) should release for exception.
  10. Test bug: the log entries stored in SimpleStateMachine4Testing can be released.
  11. LogSegment: New entries can be added after EntryCache is closed.
  12. MemoryRaftLog has similar problems as in SegmentedRaftLog.
  13. SegmentedRaftLogWorker should clean up unfinished tasks in the queue after stopped running.

@@ -55,13 +59,56 @@
*/
public class LeakDetector {
private static final Logger LOG = LoggerFactory.getLogger(LeakDetector.class);

private static class LeakTrackerSet {
private final Set<LeakTracker> set = Collections.newSetFromMap(new ConcurrentHashMap<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All methods are synchronized, do we still need to use ConcurrentHashMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! We don't need ConcurrentHashMap anymore.

@@ -46,6 +47,8 @@ public class DataBlockingQueue<E> extends DataQueue<E> {
private final Condition notFull = lock.newCondition();
private final Condition notEmpty = lock.newCondition();

private boolean closed = false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use atomic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. :)

@@ -46,6 +47,8 @@ public class DataBlockingQueue<E> extends DataQueue<E> {
private final Condition notFull = lock.newCondition();
private final Condition notEmpty = lock.newCondition();

private boolean closed = false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. :)

@szetszwo
Copy link
Contributor Author

szetszwo commented Oct 7, 2024

Finally, it is able to pass all the tests (with a few retries). Note that there are probably some other zero copy bugs. Will fix them separately.

@szetszwo
Copy link
Contributor Author

szetszwo commented Oct 7, 2024

This can pass all the tests (with a few retries). Since this change is quite big (56kB) and non-trivial, I will split this to a few JIRAs:

  1. The current JIRA RATIS-2164 for fixing LeakDetector. (I will leave this PR as-is and submit another PR.)
  2. RATIS-2151 for fixing gRPC.
  3. RATIS-2159 for fixing other non-gRPC cases.

I will see if (2) and (3) needed to be further split.

BTW, we should move LeakDetector enabling from MiniRaftClusterWithGrpc to MiniRaftCluster. It will be able to detect more failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants