Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnostics Logs: Fixes ContactReplicas to only include the successful replica contacted per physical partition. #4156

Conversation

philipthomas-MSFT
Copy link
Contributor

Pull Request Template

Description

Describe the bug
Fixing a bug for ContactedReplicas. It should only display the one replica per physical partition that was contacted successfully, not all replicas contacted for the partition.

To Reproduce
Steps to reproduce the behavior. If you can include code snippets or links to repositories containing a repro of the issue that can helps us in detecting the scenario it would speed up the resolution.

Print any diagnostic log string where multiple replicas on the same physical partition failed and one was finally successful.

Expected behavior

"ContactedReplicas": [
	{
		"Count": 1,
		"Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/"
	}
]

Actual behavior

"ContactedReplicas": [
	{
		"Count": 1,
		"Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859499990p/"
	},
	{
		"Count": 1,
		"Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859470000s/"
	},
	{
		"Count": 1,
		"Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859471110s/"
	},
	{
		"Count": 1,
		"Uri": "rntbd://cdb-ms-prod-westus-fd4.documents.azure.com:14382/apps/9dc0394e-d25f-4c98-baa5-72f1c700bf3e/services/060067c7-a4e9-4465-a412-25cb0104cb58/partitions/2cda760c-f81f-4094-85d0-7bcfb2acc4e6/replicas/132608933859472220s/"
	}
]

Environment summary
SDK Version: All
OS Version (e.g. Windows, Linux, MacOSX)

Additional context
Add any other context about the problem here (for example, complete stack traces or logs).

Solution

  • Identify any tests that contradict the expected behavior. If none exists, create them. Tests should pass.
  • Identify the system under test and make the necessary changes.
  • Go back to tests, and the tests should fail.
  • Fix tests, and tests should pass.

Impact of change

Breaking contracts: Tempting as it may be, I do not wish to break the current contract which is keeping ContactedReplicas as part of an array with a count of 1, even though we are always forcing only the last contacted replica in the array to be revealed. Always open to discuss if we want to break this, but for now, the impact of change is at its minimum.

Duplicate contact regions: The method and tests to verify the removing of duplicate contact replicas from the array is unnecessary since we are only revealing the last contacted region.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #IssueNumber

@kirankumarkolli
Copy link
Member

It should only display the one replica per physical partition that was contacted successfully, not all replicas contacted for the partition

ContactedReplicas() semantically needs to include all replicas contacted part of serving request (ex: retries, quorum etc...).
In what scenarios non-contacted replicas are showing-up in ContactedReplicas()?

@philipthomas-MSFT
Copy link
Contributor Author

semantically

ContactedReplicas() semantically needs to include all replicas contacted part of serving request (ex: retries, quorum etc...).
In what scenarios non-contacted replicas are showing-up in ContactedReplicas()?

Thanks for your feedback. So, for clarification, are we saying that the bug is not that ContactReplicas should only show the last contacted replica only (and I guess that makes sense), but somehow there are replicas in the contacted replica array that has not actually been contacted that should not obviously be there? How did we come to this conclusion at the inception of this bug because if that is the case, not sure how to repo this? Maybe we should just close this for now.

@@ -237,6 +237,11 @@ public void Visit(ClientSideRequestStatisticsTraceDatum clientSideRequestStatist
this.jsonWriter.WriteObjectEnd();
}

private static List<TransportAddressUri> GetFinalContactedReplica(ClientSideRequestStatisticsTraceDatum clientSideRequestStatisticsTraceDatum)
{
return new List<TransportAddressUri> { clientSideRequestStatisticsTraceDatum.ContactedReplicas.LastOrDefault() };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A request might have contacted multiple replicas, why are we taking only 1?

@ealsur
Copy link
Member

ealsur commented Oct 30, 2023

Thanks for your feedback. So, for clarification, are we saying that the bug is not that ContactReplicas should only show the last contacted replica only (and I guess that makes sense), but somehow there are replicas in the contacted replica array that has not actually been contacted that should not obviously be there? How did we come to this conclusion at the inception of this bug because if that is the case, not sure how to repo this? Maybe we should just close this for now.

A request can go to 1 replica or for some reason (Strong Consistency for example checking quorum, or receiving a 404/1002 and having to retry) might reach to another replica.

The problem is not that the list has >1 but rather that the ContactedReplicas list contains items that were not visited. In order to understand where the problem is, we need to first check which are the places that are updating ContactedReplicas (probably in the direct package) to understand if there is some point where the data is being added incorrectly.

One way to tackle this is first understand all the steps/places involved in generating and collecting this data to understand which of them is wrong?

@kirankumarkolli
Copy link
Member

Closing inactive PRs older than 3-months, please re-active it needed.

auto-merge was automatically disabled September 13, 2024 18:57

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enables automation to merge PRs
Projects
Status: Done
3 participants