-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensitive Data Redaction #255
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: svrnm <[email protected]>
An additional method for setting the sensitivity information for a span attribute might look like the following: | ||
|
||
``` | ||
span.setAttribute("url.query", url.toString(), <SENSITIVITY_DETAILS>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably more of a spec concern and we'll have more time to design and polish API, but I'd like to entertain the idea of
- Some
SensitivityConfig
instrumentation can request fromTracer
SensitivityConfig
can be configured by the SDK/distro/etc and can have different implementations for different regulationsSensitivityConfig
has convenience methods to redact common things like URLs. E.g.String SensitivityConfig.redactQueryParams(Uri uri)
- It may recognize common attributes which are known to be problematic (e.g.
client.address
on the server spans is potentially PII) and could be applied implicitly without instrumentation code
E.g.
Span clientSpan = tracer.startSpan("GET", CLIENT);
String redactedUri = tracer.getSensitivityConfig().redactQueryParams(uri);
clientSpan.setAttribute("url.full", redactedUri);
or
Span serverSpan = tracer.startSpan("GET /foo", SERVER);
// the `SensitivityConfig` associated with the tracer knows that `client.address`
// may be sensitive and would apply whatever it's configured to do
// allow, drop, anonymize, etc)
serverSpan.setAttribute("client.address", request.getClientIp());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a suggestion as an alternative to what I suggested or an addition? It looks like an addition (and I think it's great!), but I wanted to verify first before discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's both :)
I'd like to be able to add redaction/sanitization without requiring instrumentation to change their code or worry about it.
I.e. the alternative proposal is to build a reasonable story within existing API surface first.
But the proposal you made in the OTEL will still be necessary - we'll need to redact/sanitize custom attributes or to optimize and fine-tune instrumentation code for those who want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwing in one more suggestion - we can auto-generate semantic convention code along with the redaction.
Today we generate constants like:
class UrlAttributes {
public static final String URL_FULL = "url.full";
}
we could instead generate a method
class UrlAttributes {
public static final String URL_FULL = "url.full";
public static void recordUrlFull(Span span, Uri uri) {
String redacted = getSensitivityConfig().redactQueryParams(uri);
span.setAttribute(URL_FULL, redacted);
}
}
We still need to have a central config and a way to pass it around and make it accessible on the API level, but we can hide a lot of boilerplate and be able to partially automate things across languages.
/cc @lquerel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to be able to add redaction/sanitization without requiring instrumentation to change their code or worry about it.
Yes, that's an important feature! With me suggestions for how to write that down in the semantic conventions this should be feasible. Let me add some wording to make this clear.
Looking at the solutions I would prefer the first one over the updated code generation for the semantic conventions, i.e. setAttribute
should do the redaction internally, because having a method per attribute seems excessive to me (and people will forget to use it, vs not using the constants comes with less penalties) and people can configure their own redaction through configuration (let's say the use an application they do not have code access and this application emits "enduser.id" without redaction, they can update their sensitivity config and make sure that it is redacted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Library authors do not have enough context to make reduction decisions. The best they can do is tag attributes with some semantic annotation saying "this is sensitive" (e.g. in my company we have a taxonomy of "purpose policy" annotations for data elements). If I deploy this library as a user, I should be able to decide if I actually want "sensitive" to be redacted or not - I may be debugging locally and need to see all data, or maybe in production I have different pathways where the data flows, some require reduction and others require raw data.
And to add these annotations it's better to go with the schemas approach, without changing the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to add these annotations it's better to go with the schemas approach, without changing the API.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are talking about the same thing, @yurishkuro: Indeed a library author can not make reduction decision, this is the responsibility of the application operator, but the library author needs capabilities to express that some data may be sensitive and provide redaction options similar to what we want to do in the semantic conventions. Here is an example: A library author writes a method that calls a payment provider HTTP endpoint, which has a unique query parameter to carry the token (let's say acmeToken
). That library author now wants to say "when I call this endpoint, and a HTTP span is generated, make sure that url.query
is redacted in a way that acmeToken
is treated like other potentially sensitive attributes (see #971). Ideally they can do something similar to what is proposed for the semantic conventions, e.g. a way to write down what to do when DEFAULT
, STRICT
or STRICTER
(or any other profile is selected). Then, the end user of that library and of opentelemetry can make the decision of redaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svrnm yes, this is a pretty complex use case, but I would still err on the side of a centralized solution over federating these responsibilities to the call sites. It's conceivable that the semantic convention / schema for url.query
could be extended with conditional clauses like "if I am in this instrumentation scope or the URL contains this domain name, these are the additional rules / annotations for the query elements". And I think this approach is going to be necessary in many cases anyway, because the specific HTTP instrumentation may not have any clue about this business use case, e.g. it could be a generic HTTP client interceptor that generates client spans and records URL, and only user code would have the theoretical capability of understanding the sensitivity of the query params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's conceivable that the semantic convention / schema for url.query could be extended with conditional clauses like "if I am in this instrumentation scope or the URL contains this domain name, these are the additional rules / annotations for the query elements".
That's what is part of this proposal, I need to optimize the wording for that, but eventually the configuration for redaction needs to be that fine-grained.
but I would still err on the side of a centralized solution over federating these responsibilities to the call sites.
Yes, centralized is the core of it, but library/application authors still need a way to provide a hint that a redaction needs to be applied. Rethinking my example, we actually have the following situation:
A library author is likely using a HTTP client library as dependency themselves to interact with a 3rd partyAPI, so in some pseudo code they might have something like the following:
function authenticateWithMyService($token) {
// they want to let OpenTelemetry know that $token is sensitive, and when
// it is injected into the `url.query` attribute certain redaction may be required.
use3rdPartyHttpClient("https://myservice//?myAcmeToken=$token")
}
How would they implement that? They need some ways to annotate that. Also those annotations need to be composable somehow.
Signed-off-by: svrnm <[email protected]>
|
||
#### API | ||
|
||
Every API that sets an attribute consisting of a key and a value, needs to be enhanced by an additional functionality that allows to add details about the potential sensitivity of this data and a hooks how it may be redacted. This can be an additional set of parameters for an existing method or a method that can be called after the attribute has been set, if adding parameters to a method signature would lead to a breaking change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel metrics should take an exception here due to the nature of cardinality and performance. Do we know if there were credential/privacy leaks in metrics systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need to research on that, but I think performance will be an issue for all signals, I added that to the trade-off section already. So I wonder if we need to do some experiments eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics API is designed to be much faster than other APIs (e.g. it's not uncommon to have metrics APIs that are 20x faster than tracing APIs), so the perf impact would be bigger if measured by the % drop of calls/second.
Cardinality could also be an issue, for example, if the intention is to get the "unique count of users via email address", the result will be 1
if email address got redacted to "[email protected]".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification, that's a signification performance difference indeed.
On cardinality I think there are 2 options to address that:
- if the redaction happens late (before exporting data) it is still possible to create that metric properly
- there are options to redact emails and still keep their uniqueness, e.g. by hashing the emails (or parts of it), note that hashing is a suboptimal solution since the hashed words are easy to guess & break (use a dictionary with common first and lastnames, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
The annotations and APIs as outlined above will allow SDK users to provide their sensitive requirements as configuration (here: environment variable, but can be encoded in future config files as well), e.g. | ||
|
||
``` | ||
env OTEL_SENSITIVE_DATA_PROFILE=”STRICT” ./myInstrumentedApp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this apply to the SDK globally or can be applied on specific pipelines/processors/exporters? e.g. if the goal is to send audit logs (which have EUII such as email address and IP address) to destination A without performing any redaction, and send normal logs to destination B with redaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion here was globally, but I see the point you are making for having it split out by exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
|
||
## Open questions | ||
|
||
- **Question 1**: Should sensitivity details for an attribute in the semantic conventions be excluded from the stability guarantees? This means, updating them for a **stable** attribute is not a breaking change. The idea behind excluding them from the stability guarantees is that the requirements are subject of change due to changes in technology (see [#971](https://github.com/open-telemetry/semantic-conventions/pull/971), the list of query string values will evolve over time) or changes in regulatory requirements or both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there should still be some stability definition.
I wouldn't want to see something that was redacted suddenly not be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, maybe "adding redactions" may be outside, but "removing redactions" may require additional guard rails. It is probably unlikely that redactions will be removed, the only case I can think of is that something is deprecated for a very long time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave the question open, I can add some words on adding/removing redaction
An additional method for setting the sensitivity information for a span attribute might look like the following: | ||
|
||
``` | ||
span.setAttribute("url.query", url.toString(), <SENSITIVITY_DETAILS>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to start with the most generic solution that can address all leaks, even if it's less efficient. Anything that relies on instrumentation callsite changes is fundamentally not capable of full coverage - someone will always forget some sensitivity parameter or a call to sanitizer. Whereas having a config-driven sanitizer will be able to see all data and apply rules reliably.
Co-authored-by: Reiley Yang <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]>
env OTEL_SENSITIVE_DATA_PROFILE=”STRICT” ./myInstrumentedApp | ||
``` | ||
|
||
This call will make sure that redactions applied follow the `STRICT` profile. If not set the `DEFAULT` will be used. Additionally there are 2 levels that can not be used in sensitivity details: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the not on purpose?
It seems like at least ALWAYS
could be useful, for example as a temporary mitigation when a leakage is discovered, and before it can be properly fixed.
This call will make sure that redactions applied follow the `STRICT` profile. If not set the `DEFAULT` will be used. Additionally there are 2 levels that can not be used in sensitivity details: | |
This call will make sure that redactions applied follow the `STRICT` profile. If not set the `DEFAULT` will be used. Additionally there are 2 levels that can be used in sensitivity details: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I wanted to express here is that ALWAYS and NEVER can not be reconfigured through <SENSITIVITY_DETAILS>
as outlined above in the document. Because if allowed it would break expectations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that statement was a bit confusing to me as well, maybe it can be rephrased to express what you said above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
|
||
#### API | ||
|
||
Every API that sets an attribute consisting of a key and a value, needs to be enhanced by an additional functionality that allows to add details about the potential sensitivity of this data and a hooks how it may be redacted. This can be an additional set of parameters for an existing method or a method that can be called after the attribute has been set, if adding parameters to a method signature would lead to a breaking change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we considering Baggage as a signal here? And would we consider in-band data (context propagation) as well as out-of-band data?
I think open-telemetry/opentelemetry-specification#1633 may benefit from any spec changes related to being able to specify sensitivity of Baggage attributes to be propagated across system boundaries. I don't think this OTEP would solve the context propagation boundary issue (Propagators would still need to define what they consider a system boundary) but I wonder, providing that Baggage is considered here, if the SensitivityConfig
should contain something related to defining if a Baggage attribute is safe to propagate or not. I'm not convinced either way myself just now, but thought I'd raise it as we're saying "Every API".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very interesting thought. Since baggage may carry sensitive data it's worth looking into it (we might later to move into "future considerations"). @lmolkova example with client.address
in #255 (comment) already called out that it might be relevant to add contextual information to the redaction, the baggage propagation is another example of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
|
||
By adding the proposed features, OpenTelemetry will provide its end-users the tooling needed to make sure that sensitive data is treated according to their requirements. | ||
|
||
This will make sure that these end-users can use OpenTelemetry within their secure and legal requirements and that OpenTelemetry (and implementing libraries) are able to avoid vulnerabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 Nice summary @svrnm!
I suggest that we turn these into a list of principles and call it out clearly in this OTEP.
Here are some examples I could think of:
- OpenTelemetry MUST allow the end-user to meet with their security/privacy/compliance requirements regarding the data being collected.
- OpenTelemetry MUST not provide redaction offerings that lead to bigger security issues such as https://en.wikipedia.org/wiki/ReDoS (e.g. the redaction logic is poorly implemented, so a hacker could forge certain input to DDoS the redaction engine itself).
- OpenTelemetry SHOULD allow the telemetry data to apply different redaction logic per telemetry pipeline/exporter in a single process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About 2: while we cannot guarantee all SDKs will provide a bug-free implementation of the redaction logic, we could offer a logic to be followed by all SDKs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @reyang, that's a good proposal, will incorporate that (+ the suggestion from @jpkrohling which I agree to, I'll try to find some wording to express that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added those at the top of the document. I used "MUST avoid" vs "MUST not provide" since as @jpkrohling said this is impossible to guarantee, but we can put mechanisms in place to avoid them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this goes in the right direction. You did list a "NEVER" option, effectively disabling redaction, but I think it would be more prominently mentioned that and end-user should be able to easily enable/disable this globally.
|
||
By adding the proposed features, OpenTelemetry will provide its end-users the tooling needed to make sure that sensitive data is treated according to their requirements. | ||
|
||
This will make sure that these end-users can use OpenTelemetry within their secure and legal requirements and that OpenTelemetry (and implementing libraries) are able to avoid vulnerabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About 2: while we cannot guarantee all SDKs will provide a bug-free implementation of the redaction logic, we could offer a logic to be followed by all SDKs.
An additional method for setting the sensitivity information for a span attribute might look like the following: | ||
|
||
``` | ||
span.setAttribute("url.query", url.toString(), <SENSITIVITY_DETAILS>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already implicit, but to me, it's important to let users disable the redaction (ie, having a sensitivity details config that is permissive). This would allow them to not pay a performance penalty at runtime, and apply the redaction logic at the collector if this model satisfies their requirements.
| `client.address`| | Rationale: some reasons why dropping octets may be required<br>Type: `maybe_pii`<br>`DEFAULT`: `NONE`<br>`STRICT`: `'s/([0-9]+\.[0-9]+\.[0-9]+\.)[0-9]+/\10/'`<br>`STRICTER`: `'s/([0-9]+\.[0-9]+\.)[0-9]+\.[0-9]+/\10.0/'` | | ||
| `enduser.creditCardNumber`**[1]** | | Rationale: ...<br>Type: `always_pii`<br>DEFAULT: `EXTRACT_IIN`<br>`STRICT`: `DROP`| | ||
|
||
**[1]**: _This is a made-up example for demonstration purpose, it’s not part of the current semantic conventions. It gives a more nuanced example, e.g. that extracting the IIN might be an option over dropping the number completely. It also demonstrated the value of “type”, which can enable Data lineage use cases_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/IIN/Issuer Identification Number (IIN)/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied
### OTEPS | ||
|
||
- [OTEP 100 - Sensitive Data Handling](https://github.com/open-telemetry/oteps/pull/100) | ||
- [OTEP 187 - Data Classification for resources and attributes](https://github.com/open-telemetry/oteps/pull/187) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll eventually need to have something similar to this as well: semconv attributes would need a marker, about the common sensitivity of the attribute's value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to review those 2 PRs in detail and incorporate whatever we want to include here as well. I added a little bit of wording on classification (see my credit card example)
- `NEVER`: No redaction is applied, SDKs may choose to log a warning that this is a risky choice | ||
- `ALWAYS`: All values with sensitivity details will be dropped | ||
|
||
Additionally SDK end users can provide advanced configuration (through code, configuration file, probably not environment variable) to add specific needs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this proposal is going toward, however, I'd like to challenge the notion that redaction is treated as an instrumentation issue, as opposed to a configuration issue. I wonder if it would be possible to treat redaction as a configuration issue, like we do sampling.
Reading through this document and arriving at this part, I was thinking about a solution that relies solely on SDK components (without any API changes needed), by providing SDK redactors and configuring them via "redaction profiles" (which consists of mappings from attribute names to redaction rules, similar to the example below). SDKs could provide some default redaction profiles (for example for redacting strict, stricter, or never), which are based on definitions in semantic conventions.
The main challenges of this approach would be considering multiple versions of semantic conventions, furthermore it wouldn't be that straightforward for authors of instrumentation libraries to add redaction rules.
However, it would give the benefit to have redaction rules defined at a central place. Given that the examples above (different regular expressions for different redaction levels), it might be error prone to implement those consistently across all usages of an attribute in different instrumentation libraries in different languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be possible to treat redaction as a configuration issue, like we do sampling.
For 90% of the cases, I would say the answer to this is yes, but especially library authors need ways to express additional redaction requirements very locally, e.g. if they call a 3rd party service via a HTTP GET and they know that in this one specific case they need to apply additional redaction. Without a functionality for that they can either over-configure (e.g. apply it globally but potentially have other calls redacted without the need for that) or do their own logic on top of what we provide them. Both things are OK and we might decide against having the API for that (in a first version, or forever), but then we need to indiciate that somehow in the specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the updated version, I moved towards leading with the configuration centric approach and mentioned API capabilities towards the end
|-----------|--------------------------|---------------------| | ||
| `url.query`| | Rationale: Some verbatim wording why this is the way it is below<br>Type: `mixed`<br>`DEFAULT`: `REDACT_INSECURE_URL_PARAM_VALUES`<br>STRICT: `REDACT_ALL_URL_VALUES`<br>`STRICTER`: `DROP` | | ||
| `client.address`| | Rationale: some reasons why dropping octets may be required<br>Type: `maybe_pii`<br>`DEFAULT`: `NONE`<br>`STRICT`: `'s/([0-9]+\.[0-9]+\.[0-9]+\.)[0-9]+/\10/'`<br>`STRICTER`: `'s/([0-9]+\.[0-9]+\.)[0-9]+\.[0-9]+/\10.0/'` | | ||
| `enduser.creditCardNumber`**[1]** | | Rationale: ...<br>Type: `always_pii`<br>DEFAULT: `EXTRACT_IIN`<br>`STRICT`: `DROP`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a reasonable default (or strict mode?) for PII data may be anonymization. If we defined an algorithm, we'd be able to do it consistently across implementations and preserve some level of observability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What algorithm would you have in mind for anonymization? Often hashing is used for that, but depending on the underlying data it's pseudo-anonymization (there is a paper on that from the EU, I need to dig up), e.g. email addresses can be guessed easily by tuning your dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, I don't have an algorithm in mind. My intention was to have some form of AMONYMIZE mode listed, but details can be out of scope of this otep.
If you believe there is no common and reliable algorithm, then I'm taking my comment back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf is the document I was referring to, they give a really good overview of different approaches to Anonymization (e.g. stripping octets from IPs would fall into "Generalization" or "Aggregation") and also discuss Pseudonymization (e.g. hashing).
I don't know if this is doable, my point is that if we provide ANONYMIZE as an algorithm it should do exactly that (choosing from the approaches listed in that document), and if we want to use hashing etc an appropriate name should be used (PSEUDONYMIZE?)
It's worth to explore this, but before doing that we need to have the tooling for that in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anonymization method often depends on the underlying data type. The byte-stripping technique you mentioned is applicable to IPs but not necessarily to all data types. This raises the question of what we can support/propose based on the metadata available or not for an attribute (e.g., semconv or schema), for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about that as well right now, I try to come up with a suggestion in the updated proposal.
Co-authored-by: Joao Grassi <[email protected]>
Signed-off-by: svrnm <[email protected]>
I reworked the document to address (most) of the feedback, please take another look, thanks! |
Signed-off-by: svrnm <[email protected]>
Signed-off-by: svrnm <[email protected]>
Or, if the method can not be extended, an additional method can be called: | ||
|
||
``` | ||
span.setAttribute("url.query", OpenTelemetry.redact(url.toString(), <SENSITIVITY_DETAILS>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
span.setAttribute("url.query", OpenTelemetry.redact(url.toString(), <SENSITIVITY_DETAILS>); | |
span.setAttribute("url.query", OpenTelemetry.redact(url.toString(), <SENSITIVITY_DETAILS>)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example the caller is the one doing redaction, but from the OTel API side it doesn't know it already happened and the SDK would do the redaction again.
In the best case it's redundant, but it's also possible that the different redaction layers are inconsistent.
I believe we should allow instrumentations to do the redaction because they may have more context on the data, but then we should avoid doing it again.
The suggestion here is to
- keep
span.setAttribute(string name, any value)
for the cases when SDK does redaction. - allow to express that data is already safe and sanitized with a different API. E.g.
span.setAttribute(string name, any value, <SAFE_DO_NOT_CHANGE>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to add parameters to setAttribute, I just wanted to explore how to do it if we want to avoid that.
In my previous proposal I had something like
span.setAttribute(name, value)
span.redactAttribute(name, <SENSITIVITY_DETAILS>)
I removed and replaced that because I was not sure anymore why I did it, your comment reminded me of that! My idea was that the end user can give redaction hints, but I also like your suggestion of signaling that the application/library took care of redaction themselves (which is related to #255 (comment) and #255 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, if we were asked to design an instrumentation API from the start with support for redaction functionality from the start, whether we woiuld try to create a first-class object out of the attribute key.
Instead of writing "url.query"
in the code, shouldn't we have attribute Key declarations with support for optional details? Then in the code you would refer to conventions::url::QueryKey
or similar, instead of being expected to supply and process the sensitivity details at each call site.
As you said above
OpenTelemetry MUST avoid redaction offerings that lead to bigger security issues
I worry that the API details here are critically important. If the SDK has to do complex processing of optional argument structures at the call site, where performance is critical, complex code will be the result. I would prefer to see us embrace code-generation using the semantic conventions database and https://github.com/open-telemetry/weaver, so that attribute sensitivity and redaction requirements are compiled in, using dedicated APIs with clear, simple processing rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the feedback so far major consensus is that the redaction mainly should happen in the background, so even if you provide url.query
as a string the SDK knows what to do with that (e.g. by leveraging code generated by the weaver).
The big discussion point is what to do if someone wants to add additional redaction requirements. We can not expect that what we provide via the semconv + code-generation is covering all the requirements that users have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generator or the generated code could be aware of an external configuration/schema containing the definition of sensitive data. For the instrumented application that will be totally transparent.
|
||
- **Question 1**: Should sensitivity details for an attribute in the semantic | ||
conventions be excluded from the stability guarantees? This means, updating | ||
them for a **stable** attribute is not a breaking change. The idea behind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stability guarantees cover the part of telemetry that goes over the wire (i.e. the telemetry consumer side), which, AFAIK is not the case for the sensitivity details.
The part sensitivity details affect is the attribute value, but we don't (at least yet) provide any guarantees there.
So I think we can remove this open question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, only guarantee we should consider to give is that redactions are never "downgraded" (see #255 (comment)), such that endusers can trust the redaction to work as expected when they update opentelemetry.
Or, if the method can not be extended, an additional method can be called: | ||
|
||
``` | ||
span.setAttribute("url.query", OpenTelemetry.redact(url.toString(), <SENSITIVITY_DETAILS>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, if we were asked to design an instrumentation API from the start with support for redaction functionality from the start, whether we woiuld try to create a first-class object out of the attribute key.
Instead of writing "url.query"
in the code, shouldn't we have attribute Key declarations with support for optional details? Then in the code you would refer to conventions::url::QueryKey
or similar, instead of being expected to supply and process the sensitivity details at each call site.
As you said above
OpenTelemetry MUST avoid redaction offerings that lead to bigger security issues
I worry that the API details here are critically important. If the SDK has to do complex processing of optional argument structures at the call site, where performance is critical, complex code will be the result. I would prefer to see us embrace code-generation using the semantic conventions database and https://github.com/open-telemetry/weaver, so that attribute sensitivity and redaction requirements are compiled in, using dedicated APIs with clear, simple processing rules.
|
||
### Redactor | ||
|
||
To accomplish redaction the SDK needs a component (similar to a sampler) that inspects attributes when they are set and applies the required redactions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the trace SDK-- yes, this is functionality that is similar to samplers. Sampling SIG has been working on requirements for a V2 sampler API, and I would include redaction functionality there. See open-telemetry/opentelemetry-specification#4012.
For the metrics SDK -- there will be questions about whether redaction happens before or after aggregation. The present SDK specification doesn't support a processor that happens before aggregation; the filtering that is possible through metrics Views doesn't exactly support redaction, since modifications to the attribute set need to be considered before aggregation (or else re-aggregation has to be performed following redaction). There has been a requested feature, which goes by the name MeasurementProcessor
in past discussions, that would be the proper place to (a) redact / drop attributes, (b) extend attributes from context.
In the logs SDK, this is form of processing is the subject of an open discussion, see open-telemetry/opentelemetry-specification#4010.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think I wrote that "similar to a sampler" without giving it a lot of thought beyond "this needs to be a component that can be evolved independently (like samplers) and is also independent of the signal.
I'll review your feedback and try to incorporate it into the document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall want to give a HUGE THANK YOU for making a dent in this problem and throwing a proposal out there.
I have a lot of comments. I wish I had more time to give viable solutions to my comments, but I do not. (I also lack the time to be brief, so this is long winded).
Overall -
This has the bones of what I think we want to build.
- A configurable set of redaction rules
- A mechanism to annotate attributes with meta-infromation about sensitivity/security
- A simple user knob of "ON/OFF/my-custom-thing" for important use cases.
The details need a bit more fleshing out in particular:
- The YAML model (and OTEL API) for annotating attributes with baseline sensitivity/redaction controls. You may need MULTIPLE redaction methods for each level of redaction you support, e.g.
- Stricter lines / definitions (less possible bikeshed) on Sensitivity levels.
- More details on the interaction with the SDK, in particular is it the same interface for all three signals or different?
- More granular model (Span name redaction, e.g.)
Overall a great start. Let me know if there's anything I can do to help push areas of this forward or flesh out any of your ideas.
|
||
```yaml | ||
sensitiveData: | ||
- attribute: url.query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to have these be only global on attributes, or fine grained -> e.g. this specific Span an this specific Attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an example, I think we need more than attributes, see below I even suggested to use OTTL (might be an overkill thinking about the performance impact)
chooses to run a collector nearby the application that will take care of | ||
redaction out of process. If this profile is selected the SDK must make sure | ||
that (almost) no overhead remains from the redaction methods. | ||
- `DEFAULT`: A set of rules that especially make sure that no security-sensitive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see a lot of difference between DEFAULT/STRICT/STRICTER. At a minimum, I see too many "vague" notions where we could wind up bikeshedding what belong in one vs. another.
Can we be more principled in terms of defining:
- What each profile means.
- A set of annotations / meta-semantics for attributes?
E.g.
NONE - no redaction
NO_USER - redact any information that is input by a user of the system or could be used to identify a user of the system (email, name, IP, etc.)
NO_LOCATION - redact any location information (geo lat/lon, address, etc.)
Basically - I don't think the current set is going to be something we can specify. This needs more rigor and strict lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I like that, this is a great idea!
```yaml | ||
sensitiveData: | ||
- attribute: url.query | ||
redaction: REDACT_ALL_URL_VALUES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to only specify "redaction" that can be implemented in all languages? Will this be a new specification of allowable redactions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe? I think at some point we need to do some prototyping to figure out what redaction is allowed to do and what not. Performance is the main reason but "this needs to be implemented in "any" language" is another complex requirement.
- attribute: com.example.customer.name | ||
redaction: DROP | ||
- attribute: com.example.customer.email | ||
redaction: 's/^[^@]*@/REDACTED@/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a regex rule. The reality here is there's a lot of nuanced-differences between regex across languages and libraries. You'll want to pick a specific implementation to reference or avoid exposing it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, regular expressions might not be the right tool to express the redaction requirements. Having differences in languages is the one thing, but also the ReDOS example comes to mind.
method, or a sed-like expression that uses a regular expression to apply redaction. | ||
|
||
Another possibility is to re-use a language like | ||
[OTTL](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTTL needs better formal specificity to be language-neutral before this is viable.
I'd actually like to see that formal specification happen, but I also don't think its current state is a good candidate for this, as it relegates all actual runtime behavior to functions (Today it even relgates the data model via implementation-specific meaning for .
).
TL;DR; This is akin to suggesting we have a set of hardcoded redaction functions.
|
||
## Internal details | ||
|
||
### Redactor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call out how this works across signals?
Right now we have no direct ties between Span/Metric/Trace via the SDK. They actually only communicate with each other via Context
API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The honest answer is "I don't know", @jmacd made some suggestions on this, I need to take a look into this
relevant in this proposal are “security” and “privacy”. They may overlap but we | ||
distinguish them as follows: | ||
|
||
- security-relevant sensitive data: any information that when exposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this no, would love to go back to your definitions above and have sensitivity/redaction be:
ALL_KNOWN
, KNOWN_SECURITY
, KNOWN_PRIVACY
, NONE
I pick these words because we can offer NO GUARANTEES on redaction given our open API.
- The full name of a user in a social network is less sensitive than the full | ||
name of a user in a medical research database | ||
|
||
Depending on the sensitivity data an end-user of an observability system may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to be defining the level of sensitivity for users. For example, a system able to be Privacy compliant could likely handle wipeout rules and protection for all of these. OpenTelemetry should not enforce level of sensitivity, but allow our users to interact with their own definitions using our system.
I'd cut this section or call it out as something for our users to define, not us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already a kind of "appendix" section, I can make it clearer from the document structure.
[Data Minimization](https://en.wikipedia.org/wiki/Data_minimization), to avoid | ||
"unnecessary risks for the data subject". | ||
|
||
**Note 1**: it is not (and can not be) the responsibility of the OpenTelemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the first thing stated in this section. We provide tooling to allow users to be compliant, but they are ultimately responsible.
|
||
By adding an extra layer of processing every time an attribute value gets set, | ||
might have an impact on the performance. There might be ways to reduce that | ||
overhead, e.g. by only redacting values which are finalized and ready to exported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has security implications. E.g. in Java we you wouldn't want sensitive passwords interned in Strings because they can be exfiltrated.
I like what you're suggesting, but I'd go the other way. You could memoize the redaction function when generating the attribute value in some fashion, or use codegen/weaver to optimise these use cases (if we get the yaml model sorted).
Happy to get this conversation started, a big thank you from me to everyone who provided their feedback so far.
I appreciate your feedback, a lot of things I can take and try to provide a solution for.
💯
I'll take another look into all feedback and try to evolve the document with it.
Thanks! |
e.g. drop all query attribute values, drop PII-related data, etc. | ||
- `DROP` or `ALWAYS`: All values with (potentially) sensitivity details will be dropped | ||
|
||
Those profiles will be defined through the semantic conventions (see below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it may seem obvious to some, I find it interesting to clarify the relationship and the functioning mode between these profiles and the configuration files presented in the following section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reworking that
|
||
| Attribute | ... existing columns ... | sensitivity details | | ||
|-----------|--------------------------|---------------------| | ||
| `url.query`| | Rationale: Some verbatim wording why this is the way it is below<br>Type: `mixed`<br>`DEFAULT`: `REDACT_INSECURE_URL_PARAM_VALUES`<br>STRICT: `REDACT_ALL_URL_VALUES`<br>`STRICTER`: `DROP` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would expand this section dedicated to semconvs by including the concept of a telemetry schema, which could eventually override what is defined in a semconv but specifically for a service or application and for a given deployment. In a way, the previously presented configuration file could be reorganized or seen as a subset of this telemetry schema for a specific context.
|-----------|--------------------------|---------------------| | ||
| `url.query`| | Rationale: Some verbatim wording why this is the way it is below<br>Type: `mixed`<br>`DEFAULT`: `REDACT_INSECURE_URL_PARAM_VALUES`<br>STRICT: `REDACT_ALL_URL_VALUES`<br>`STRICTER`: `DROP` | | ||
| `client.address`| | Rationale: some reasons why dropping octets may be required<br>Type: `maybe_pii`<br>`DEFAULT`: `NONE`<br>`STRICT`: `'s/([0-9]+\.[0-9]+\.[0-9]+\.)[0-9]+/\10/'`<br>`STRICTER`: `'s/([0-9]+\.[0-9]+\.)[0-9]+\.[0-9]+/\10.0/'` | | ||
| `enduser.creditCardNumber`**[1]** | | Rationale: ...<br>Type: `always_pii`<br>DEFAULT: `EXTRACT_IIN`<br>`STRICT`: `DROP`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anonymization method often depends on the underlying data type. The byte-stripping technique you mentioned is applicable to IPs but not necessarily to all data types. This raises the question of what we can support/propose based on the metadata available or not for an attribute (e.g., semconv or schema), for example.
|
||
**[1]**: _This is a made-up example for demonstration purpose, it’s not part of the current semantic conventions. It gives a more nuanced example, e.g. that extracting the Issuer Identification Number (IIN) might be an option over dropping the number completely. It also demonstrated the value of “type”, which can enable Data lineage use cases_ | ||
|
||
The `DROP` keyword means that the value is replaced with `REDACTED` (or set to null, …). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am misinterpreting the proposal, but at the semcons level, I don't think we should have something defining the action to take on an attribute or a body. Instead, we should have a property characterizing the potential sensitivity of the data carried by this attribute. In my view, the decision on what to do with this type of sensitive data (e.g., partial or complete combination of name, data type, sensitivity level) should be determined by the profile or the local configuration/schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I plan to change this. I consider to suggest that we keep some kind of "suggested default redaction" if that is feasible, but beyond that it's something that needs to be decided locally.
As an additional building block the OpenTelemetry API should provide capabilities | ||
to library (and application) authors to apply in-place redaction | ||
|
||
One option is to add additional attributes to methods like `setAttribute`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision about what is sensitive and what to do with sensitive data in terms of privacy is likely to be made by people different from those who implement the application or its instrumentation. Moreover, this decision can evolve over time. Therefore, I think it is more interesting to maintain this separation and have such decisions at the schema, profile, or configuration file level, so there is no need to rebuild the application itself. This approach also ensures that there are no omissions or misuses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Developers need to be able to provide an annotation(!) that data is "potentially sensitive". I try to express this better, that it's not their job to make decisions on that but that they need to be part of highlighting that.
Or, if the method can not be extended, an additional method can be called: | ||
|
||
``` | ||
span.setAttribute("url.query", OpenTelemetry.redact(url.toString(), <SENSITIVITY_DETAILS>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generator or the generated code could be aware of an external configuration/schema containing the definition of sensitive data. For the instrumented application that will be totally transparent.
The redactor will also have all the methods to apply predefined redactions | ||
(`REDACT_ALL_URL_VALUES`, `REDACT_INSECURE_URL_PARAM_VALUES`, `DROP`, etc.). | ||
If a method is not implemented (either by the SDK or by the end-user choosing | ||
one that does not exist), it will default to apply `DROP` to avoid leakage of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one that does not exist), it will default to apply `DROP` to avoid leakage of | |
one that does not exist), it will default to apply `DROP` to avoid leakage of data. |
[weakens the overall security of the device/system](https://en.wikipedia.org/wiki/Vulnerability_(computing)). | ||
- privacy-relevant sensitive data: any information that when exposed | ||
[can adversely affect the privacy or welfare of an individual](https://en.wikipedia.org/wiki/Information_sensitivity). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The identification of sensitive data could also be applied at a semantic level. For example, as a company, I might want to apply a redaction rule to ALL URLs, another rule to ALL IP addresses, and so on. One possible approach would be to introduce a new type of group in the semconv that would allow the definition of semantic types, which could be attached to existing attributes in addition to the "physical" data type that we already have. The redaction rules would then apply to the semantic types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 🙌 , this was one of the things I wanted to include in my updated proposal, based on your and @jsuereth's suggestion
**Note 1**: it is not (and can not be) the responsibility of the OpenTelemetry | ||
project to provide compliance with any of those regulations, this is a | ||
responsibility of the OTel end-user. OTel can only facilitate parts of those | ||
requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having support for lineage associated with this type of metadata could help identify where a redaction operation originates from. We have started to support the concept of lineage in Weaver, and I think it could be used in this specific case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 🙌 , I mentioned lineage somewhere in the doc already because it seems to be highly related here.
|
||
Ideally a combination is used. | ||
|
||
### Alternative 2: Backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another complementary option could be to support this type of sensitive data management at the code generation level with a tool like Weaver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding from the discussions so far is that if we integrate the sensitive data details in the semconv a tool like weaver could be part of the solution, not an alternative?
Thank you all for the great feedback. I was not sure what a good practice is, but I reset the PR to |
open-telemetry/semantic-conventions#971 (comment)
|
Related to open-telemetry/semantic-conventions#128 |
This OTEP proposes changes that will enable sensitive data redaction. This is a follow up to open-telemetry/semantic-conventions#971 and open-telemetry/semantic-conventions#961 and the SemConv spec meeting (2024-04-29)