Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC4J-611 Add OpenTelemetry tracing to dfsclient #716

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

jpmcmu
Copy link
Contributor

@jpmcmu jpmcmu commented Jun 14, 2024

  • Added tracing support to read pathways
  • Added tracing support to write pathways

Signed-off-by: James McMullan [email protected]

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change is a breaking change (fix or feature that will cause existing behavior to change).

Checklist:

  • I have created a corresponding JIRA ticket for this submission
  • My code follows the code style of this project.
    • I have applied the Eclipse code-format template provided.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the HPCC Systems CONTRIBUTORS document (https://github.com/hpcc-systems/HPCC-Platform/wiki/Guide-for-contributors).
  • The change has been fully tested:
    • This change does not cause any existing JUnits to fail.
    • I have include JUnit coverage to test this change
    • I have performed system test and covered possible regressions and side effects.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Premature optimization
    • This change fixes the problem, not just the symptom

Testing:

Copy link

Jira Issue: https://hpccsystems.atlassian.net/browse/HPCC4J-611

Jirabot Action Result:
Workflow Transition: Merge Pending
Updated PR

Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu provided some high level feedback, let's discuss offline

@@ -294,6 +311,7 @@ private boolean retryRead()
}
catch (Exception e)
{
this.readSpan.end();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This situation seems important enough to report as an exception on the span, and if we can determine the readSpan has failed at this point, we should set the span status to fail.

@@ -344,6 +391,13 @@ public RowServiceInputStream(DataPartition dp, FieldDef rd, FieldDef pRd, int co

this.dataPart = dp;

this.readSpan = readSpan;
if (readSpan != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also test if readspan is valid

this.readSpan = readSpan;
if (readSpan != null)
{
this.readSpanTraceID = "00-" + readSpan.getSpanContext().getTraceId()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually the trace context and is propagated via a "traceparent" header, let's refactor this.readSpanTraceID

{
this.prefetchException = e;

if (readSpan != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and is valid


if (readSpan != null)
{
Attributes attributes = Attributes.of( ServerAttributes.SERVER_ADDRESS, getIP(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the server address, etc attributes only added in exception handlers?

@@ -1986,7 +2128,8 @@ private void makeFetchObject(StringBuilder sb)

private String makeGetVersionRequest()
{
final String versionMsg = RFCCodes.RFCStreamReadCmd + "{ \"command\" : \"version\", \"handle\": \"-1\", \"format\": \"binary\" }";
final String trace = readSpan != null ? "\"_trace/traceparent\" : \"" + readSpanTraceID + "\",\n" : "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to send the traceparent is the span isn't valid

return GlobalOpenTelemetry.get().getTracer("DFSClient");
}

public static Span createRootSpan(String name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure DFUClient should ever create a root span, since this is a client library we should inherit the trace from the caller

@jpmcmu jpmcmu changed the title WIP: HPCC4J-611 Add OpenTelemetry tracing to dfsclient HPCC4J-611 Add OpenTelemetry tracing to dfsclient Jul 3, 2024
HpccFileException wrappedException = new HpccFileException("Failed on initial remote read read trans", e);
if (writeSpan != null)
{
Attributes attributes = Attributes.of(ServerAttributes.SERVER_ADDRESS, this.rowServiceIP,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this seems to be repeated in a few places, why not add a function in utils to handle the attributes annotation in a single place. Also, why aren't we annotating the span as soon as it's created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to cleanup how I am handling adding events / exceptions in the next PR, so I was holding off on cleaning up how attributes get added for those events. However, those attributes should have been added to the span when it was originally created. I will do that as part of this PR.

return versionMsg;
}

private void makeInitialWriteRequest() throws Exception
{
String jsonRecordDef = RecordDefinitionTranslator.toJsonRecord(this.recordDef).toString();

final String trace = writeSpan != null ? "\"_trace/traceparent\" : \"" + writeSpanTraceID + "\",\n" : "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not want to report traceparents unless we know it's a valid traceparent

Copy link
Contributor Author

@jpmcmu jpmcmu Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added a check during construction, such that writeSpan will only be set if its context is valid.

Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu looks fine, I did leave a couple of comments we need to resolve.

I don't want to delay this pr anymore than needed, but we have to confirm the traceparent is being propagated in the dafilserv requests (perhaps simple junit test cases?)

Also, since you're creating new trace/spans, we need to exercise that logic as well.
Can you capture log output showing the toplevel trace/span, and at least one of the child spans and confirm the child inherits the trace id from the parent span?

@jpmcmu
Copy link
Contributor Author

jpmcmu commented Jul 3, 2024

@rpastrana Still looking into unit tests to capture the trace IDs, but here is example log output for a read:
[otel.javaagent 2024-07-03 15:28:52:230 -0400] [Thread-2] INFO io.opentelemetry.exporter.logging.LoggingSpanExporter - 'HPCCRemoteFileReader.RowService/Read_benchmark::integer::20kb_1' : ac3aaf1874cbf5569e33d679558bc44d e9b52d6c621e6f38 CLIENT [tracer: DFSClient:] AttributesMap{data={server.address=192.168.4.41, server.port=7601, thread.id=18, read.size=-1000, thread.name=Thread-2}, capacity=128, totalAddedValues=5}

- Added tracing support to read pathways
- Added tracing support to write pathways

Signed-off-by: James McMullan [email protected]
Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu approved

@rpastrana rpastrana merged commit cbc653f into hpcc-systems:candidate-9.6.x Jul 3, 2024
3 of 5 checks passed
@@ -287,16 +374,19 @@ private static class RowServiceResponse

private String makeGetVersionRequest()
{
final String versionMsg = RFCCodes.RFCStreamReadCmd + "{ \"command\" : \"version\", \"handle\": \"-1\", \"format\": \"binary\" }";
final String trace = traceContextHeader != null ? "\"_trace/traceparent\" : \"" + traceContextHeader + "\",\n" : "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith FYI all HPCC4J requests into DAFILESrv will contain a new header with OTel trace context. We've confirmed on our end that the header does not seem to break the server side, but we'd like to get your thoughts. Also, we're opening a jira requesting that dafilesrv at minimum consume this info and propagate it, and optimally add helpful instrumentation based on jtrace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu - is a backslash in an JSON identifier valid? Regardless it looks a little odd, why not just "traceparent" or if want a separate node, generate '"_trace" : { "traceparent": traceContextHeader }' ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith it's supposed to be

"_trace": { "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-00" } ,

as you point out. Once fixed, will that layout work for your side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would make it :
"trace": { "parent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-00" }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leading underscore comes from the roxie protocol, I vote we keep it for symetry unless it breaks some sort of convention

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right ok, and "traceparent" is also part of that protocol?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. So bar the fix to the format, it seems fine.

@@ -513,8 +662,11 @@ public void write(byte[] b) throws IOException
*/
public void write(byte[] b, int off, int len) throws IOException
{
String request = "{ \"format\" : \"binary\", \"handle\" : \"" + this.handle + "\""
+ (useOldProtocol ? "" : ", \"command\" : \"continue\"") + " }";
final String trace = traceContextHeader != null ? "\"_trace/traceparent\" : \"" + traceContextHeader + "\",\n" : "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any value in sending this every request ?
Will it always match the one sent in makeInitialWriteRequest ? If so, dafilesrv can maintain what it saw and reuse, rather than re-read etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most impactful value in reporting it is validation of expected parent trace. The server side can choose to drop it and ignore, how concerned are we about the potentially superfluous header in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the expectation is that it should match the initial request value? If it does not, should it ignore it and warn, or use the new value. Is it likely to change, or is it impossible?

In general, the dafilesrv protocol does not expect clients to resend info. that has been encompassed in the initial request, e.g. the "node" type and format etc. - the "handle" on all subsequent requests represents the reference to established stream info.

If this is a field value that might change, then I'm not particularly concerned with the extra expense of the comm traffic, assuming the client is dealing with reasonable sized chunks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it comes down to how much overhead we're incurring vs how much it could help us track a problematic activity...
The overhead seems minimal, but it also doesn't seem particularly helpful.

If the client/server ever get out of synch (I don't know how likely this scenario is), could the traceparent help us track the problem (ie, the out of synch request would be tracked under a known trace)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not particularly fussed either way, but it is out of character with other info that is sent once.
And, no, I don't think it will be useful if the client gets out of sync. If that happens, the handle (which is the key reference), will be out of sync, the server (dafilesrv) will reject it (challenge response) and the client should reinitiate the connection, along with sending "_trace" info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants