-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add opaque type support to hive type serde #11253
Conversation
This pull request was exported from Phabricator. Differential Revision: D64358220 |
✅ Deploy Preview for meta-velox canceled.
|
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Differential Revision: D64358220
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool! Made a few comments, but could you also split the type and fuzzer stuff since they are quite independent?
Thanks!
velox/type/Type.cpp
Outdated
auto it = getTypeIndexByOpaqueAlias().find(name); | ||
VELOX_CHECK( | ||
it != getTypeIndexByOpaqueAlias().end(), | ||
"Could not find type {}. Did you call registerOpaqueType?", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please surround the type name by quotes '{}'
@@ -2008,6 +2008,22 @@ bool registerCustomType( | |||
const std::string& name, | |||
std::unique_ptr<const CustomTypeFactories> factories); | |||
|
|||
std::unordered_map<std::string, std::type_index>& getTypeIndexByOpaqueAlias(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a bit more documentation here about what this does, when it should be used, and some of the restrictions/considerations you added to the PR summary?
velox/type/fbhive/HiveTypeParser.cpp
Outdated
if (nt.isValidType() && nt.isPrimitiveType()) { | ||
|
||
if (!nt.isValidType()) { | ||
VELOX_FAIL(fmt::format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're only copying it, but you can omit the fmt::format as Velox exception macros can do the formatting automatically for you.
velox/type/fbhive/HiveTypeParser.cpp
Outdated
@@ -118,7 +128,16 @@ Result HiveTypeParser::parseType() { | |||
eatToken(TokenType::RightRoundBracket); | |||
} | |||
return Result{scalarType}; | |||
} else if (nt.isValidType()) { | |||
} else if (nt.metadata->tokenString[0] == "opaque") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have a nt.isOpaqueType()
?
registerOpaqueType<Foo>("bar"); | ||
HiveTypeParser parser; | ||
auto t = parser.parse("opaque<bar>"); | ||
ASSERT_EQ(t->toString(), "OPAQUE<facebook::velox::type::fbhive::Foo>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool :)
HiveTypeParser parser; | ||
auto t = parser.parse("opaque<bar>"); | ||
ASSERT_EQ(t->toString(), "OPAQUE<facebook::velox::type::fbhive::Foo>"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also test an invalid opaque type deserialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it on purpose that we still don't have deserialization tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand, what deserialization tests? we have parseOpaque()
// Use a custom name to highlight this is just an alias. | ||
registerOpaqueType<Foo>("bar"); | ||
|
||
std::shared_ptr<const velox::Type> type = velox::OPAQUE<Foo>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can omit the velox::
namespace prefix throughout. You can also use the TypePtr alias or just auto
EXPECT_EQ(result, "opaque<bar>"); | ||
} | ||
|
||
TEST(HiveTypeSerializer, unsupported) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test some unregistered type inside the opaque?
velox/vector/fuzzer/VectorFuzzer.cpp
Outdated
@@ -558,6 +558,8 @@ VectorPtr VectorFuzzer::fuzzFlat(const TypePtr& type, vector_size_t size) { | |||
} | |||
|
|||
return fuzzRow(std::move(childrenVectors), rowType.names(), size); | |||
} else if (type->isOpaque()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc: @kagamiori @bikramSingh91 I remember some discussion about adding opaque support in Fuzzer; here it is :)
@kunigami could you just split the fuzzer support stuff to a different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had a PR for adding it to fuzzer but I missed some codepaths that this test triggered. I'll move it to
#11189
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I just messed the export of the diff stack.
velox/vector/fuzzer/VectorFuzzer.cpp
Outdated
const TypePtr& type, | ||
vector_size_t size) { | ||
auto vector = BaseVector::create(type, size, pool_); | ||
using TFlat = typename KindToFlatVector<TypeKind::OPAQUE>::type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe TFlatOpaque
?
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Differential Revision: D64358220
58dfb67
to
e037138
Compare
This pull request was exported from Phabricator. Differential Revision: D64358220 |
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Differential Revision: D64358220
e037138
to
709fd5c
Compare
This pull request was exported from Phabricator. Differential Revision: D64358220 |
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Differential Revision: D64358220
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Differential Revision: D64358220
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kunigami , looks pretty good!
velox/type/Type.h
Outdated
|
||
std::string getOpaqueAliasForTypeId(std::type_index typeIndex); | ||
|
||
// OpaqueType represents a type that is transparent to the Velox type system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the point is that opaque types are NOT transparent to the type system. :) (they are "opaque" to the type system) :)
velox/type/Type.h
Outdated
// So if we were to serialize an opaque type using its std::type_index, we | ||
// might not be able to deserialize it in another process. To solve this | ||
// problem, we require that both the serializing and deserializing processes | ||
// register the opaque type using registerOpaqueType() with the same alias. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comment. We usually use "///" for documentation header comments such as this.
@@ -140,11 +159,6 @@ Result HiveTypeParser::parseType() { | |||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use toString(nt.typeKind())
instead?
HiveTypeParser parser; | ||
auto t = parser.parse("opaque<bar>"); | ||
ASSERT_EQ(t->toString(), "OPAQUE<facebook::velox::type::fbhive::Foo>"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it on purpose that we still don't have deserialization tests?
(please rebase on main first) @kunigami |
709fd5c
to
4787813
Compare
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Reviewed By: pedroerp Differential Revision: D64358220
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Reviewed By: pedroerp Differential Revision: D64358220
4787813
to
8d0819c
Compare
This pull request was exported from Phabricator. Differential Revision: D64358220 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D64358220 |
rebased |
Summary: My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Reviewed By: pedroerp Differential Revision: D64358220
8d0819c
to
7adb14c
Compare
This pull request was exported from Phabricator. Differential Revision: D64358220 |
This pull request has been merged in 5cbba09. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Pull Request resolved: facebookincubator#11253 My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data. One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here. We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same. With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere. Reviewed By: pedroerp Differential Revision: D64358220 fbshipit-source-id: fb702e4366592c2f0e84c8ea11b7a2a9f5176854
Summary:
My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a
shared_ptr<void>
. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.
We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types.
The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.
With this change, callers are required to register an alias for opaque types before serializing/deserializing it via
HiveTypeSerializer
andHiveTypeParser
. I put this registry inType.h
but if we want to keep this specific toHiveTypeSerializer/HiveTypeParser
we could move it elsewhere.Differential Revision: D64358220