-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Apply schema validation to all topics in local development #2013
Conversation
Alternative to #1981 -- apply schema validation to all outgoing messages when Relay runs in development mode. This means that messages are validated as part of integration tests. The additional dependencies are compiled unconditionally, but unused in release builds. The only way to conditionally compile them is to introduce yet another featureflag, but that seems burdensome to use since we use `--all-features` everywhere right now. We have had similar issues introducing optional SSL support in Kafka a long time ago, IIRC
f79f357
to
b2da267
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach! It has zero runtime overhead in production but is more automatic than relying on manually crafted tests.
Pressing "Request changes" because of some open questions.
@@ -36,6 +39,10 @@ pub enum ClientError { | |||
#[error("failed to serialize json message")] | |||
InvalidJson(#[source] serde_json::Error), | |||
|
|||
/// Failed to run schema validation on message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be conditional?
/// Failed to run schema validation on message. | |
#[cfg(debug_assertions)] | |
/// Failed to run schema validation on message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it has to be, fixed
relay-kafka/src/producer/schemas.rs
Outdated
_ => return Err(SchemaError::InvalidLogicalTopic), | ||
}; | ||
|
||
let schema = match sentry_kafka_schemas::get_schema(logical_topic_name, None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is only for debug builds, but should we keep a cache of schemas by topic name so it isn't loaded from disk for every message? Maybe even put that cache in the sentry-kafka-schemas
library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe sometime but for now we don't want to be opinionated in how people use this library, so we settled for the simplest interface that comes with the least amount of dependencies (aka, return schema as string). This goes for the Python library as well: It requires you to bring your own schema validation library right now, and figure out yourself where it makes sense to store the schema. As such it doesn't have many dependencies. This is a good thing if you're only using the generated types in your service.
I also don't want to build any cache right now because I foresee a (potential, not certain) future where we want/need people to iterate on schemas and relay in parallel (i.e. temporarily pointing relay to a local checkout of the schemas repo), and it would be good if changes to the schema files were picked up immediately in local development without recompiling.
I'm also not convinced that this schema validation library is the best one to use. I detest the fact that the schema and the payload has to be deserialized into intermediate serde_json::Value
, but it seems that that is commonpractice for all schema validation libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good if changes to the schema files were picked up immediately in local development without recompiling.
I agree, I was thinking more of an in-memory cache that picks up the changes when you restart Relay (or re-run a test). I see your argument against putting such a cache in the library, but I still think it makes sense to put it somewhere (e.g. the KafkaClient
). I'm happy to contribute to this PR for this. Or do you see a use case for an actual hot reload, where you keep a local Relay instance running and edit the schema while sending in events?
relay-kafka/src/producer/schemas.rs
Outdated
_ => return Err(SchemaError::InvalidLogicalTopic), | ||
}; | ||
|
||
let schema = match sentry_kafka_schemas::get_schema(logical_topic_name, None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should the library have a get_latest_schema
helper to make it more explicit what the None
does? Or an explicit enum with variant SchemaVersion::Latest
instead of an Option
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to add a get_latest_schema
helper (or rename to have get_schema
and get_schema_version
) -- how strongly do you feel about this happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to add a get_latest_schema helper
Only if you think it is a good idea.
how strongly do you feel about this happening?
Not strongly at all, I just noticed I had to jump into the library code to see what the None
was for.
@@ -18,6 +18,8 @@ rmp-serde = { version = "1.1.1", optional = true } | |||
serde = { version = "1.0.114", features = ["derive"] } | |||
serde_json = { version = "1.0.55", optional = true } | |||
thiserror = "1.0.38" | |||
sentry-kafka-schemas = "0.0.25" | |||
jsonschema = "0.17.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would make sense to hide the jsonschema
dependency in sentry-kafka-schemas
. That is, if we removed this dependency here, and sentry-kafka-schemas
exposed a validate(topic, message)
function.
That way, the library client (Relay) would not have to deal with compiling schemas, and the way that sentry-kafka-schemas
validates would be completely encapsulated (i.e. you could change the library or even the format it uses for schema definition and validation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my reply to the local cache comment. For now I would like to avoid that. There's still many unsolved questions around how the workflow in Relay will look like, what performance requirements we have on the entire system, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Definitely not a blocker.
As such it doesn't have many dependencies. This is a good thing if you're only using the generated types in your service.
Makes sense, though this could be solved with cargo features. I.e. there could be a codegen
feature and a validation
feature, and clients use either-or. My gut feeling is that 90% of clients that want to perform validation will not care about what schema definition language or validation library is used under the hood, hence the idea to hide it inside the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm reconsidering my response now. one of the reasons i was hesitant to change this was because python currently also works the same (it just exposes the raw schema, doesn't prescribe a schema validation library)
but at the same time it would be good to have a vetted recommended library that everybody ought to use (for both python and rust)
@jjbayer after deliberating back and forth I believe we should defer some of the API decisions until we have more feedback from other teams. For that purpose I have summarized the discussion about API here: getsentry/sentry-kafka-schemas#105 Let me know if I missed anything, for now I would like to merge this PR as-is |
Add a caching layer to #2013 such that the schema does not have to be parsed from disk for every message. --------- Co-authored-by: Markus Unterwaditzer <[email protected]>
I believe a recent change in sentry-kafka-schemas has made it so that InvalidSchema is returned for every call to get_schema. Need to investigate, but this PR is broken right now (it won't validate anything at all) |
I added |
Alternative to #1981 -- apply
schema validation to all outgoing messages when Relay runs in
development mode.
This means that messages are validated as part of integration tests.
The additional dependencies are compiled unconditionally, but unused
in release builds. The only way to conditionally compile them is to
introduce yet another featureflag, but that seems burdensome to use
since we use
--all-features
everywhere right now. We have had similarissues introducing optional SSL support in Kafka a long time ago, IIRC
This validation does not run in sentry's local development because (I think)
it uses release builds there. In order to make that happen we could add
a "validate schemas" flag to config.yml, off by default.