Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add opaque type support to hive type serde #11253

Closed
wants to merge 1 commit into from

Conversation

kunigami
Copy link
Contributor

Summary:
My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a shared_ptr<void>. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types.

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via HiveTypeSerializer and HiveTypeParser. I put this registry in Type.h but if we want to keep this specific to HiveTypeSerializer/HiveTypeParser we could move it elsewhere.

Differential Revision: D64358220

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 14, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64358220

Copy link

netlify bot commented Oct 14, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 7adb14c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6722b7129bdd330008f4de8d

kunigami pushed a commit to kunigami/velox that referenced this pull request Oct 15, 2024
Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Differential Revision: D64358220
Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool! Made a few comments, but could you also split the type and fuzzer stuff since they are quite independent?

Thanks!

auto it = getTypeIndexByOpaqueAlias().find(name);
VELOX_CHECK(
it != getTypeIndexByOpaqueAlias().end(),
"Could not find type {}. Did you call registerOpaqueType?",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please surround the type name by quotes '{}'

@@ -2008,6 +2008,22 @@ bool registerCustomType(
const std::string& name,
std::unique_ptr<const CustomTypeFactories> factories);

std::unordered_map<std::string, std::type_index>& getTypeIndexByOpaqueAlias();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a bit more documentation here about what this does, when it should be used, and some of the restrictions/considerations you added to the PR summary?

if (nt.isValidType() && nt.isPrimitiveType()) {

if (!nt.isValidType()) {
VELOX_FAIL(fmt::format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're only copying it, but you can omit the fmt::format as Velox exception macros can do the formatting automatically for you.

@@ -118,7 +128,16 @@ Result HiveTypeParser::parseType() {
eatToken(TokenType::RightRoundBracket);
}
return Result{scalarType};
} else if (nt.isValidType()) {
} else if (nt.metadata->tokenString[0] == "opaque") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a nt.isOpaqueType()?

registerOpaqueType<Foo>("bar");
HiveTypeParser parser;
auto t = parser.parse("opaque<bar>");
ASSERT_EQ(t->toString(), "OPAQUE<facebook::velox::type::fbhive::Foo>");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool :)

HiveTypeParser parser;
auto t = parser.parse("opaque<bar>");
ASSERT_EQ(t->toString(), "OPAQUE<facebook::velox::type::fbhive::Foo>");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also test an invalid opaque type deserialization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it on purpose that we still don't have deserialization tests?

Copy link
Contributor Author

@kunigami kunigami Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand, what deserialization tests? we have parseOpaque()

// Use a custom name to highlight this is just an alias.
registerOpaqueType<Foo>("bar");

std::shared_ptr<const velox::Type> type = velox::OPAQUE<Foo>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can omit the velox:: namespace prefix throughout. You can also use the TypePtr alias or just auto

EXPECT_EQ(result, "opaque<bar>");
}

TEST(HiveTypeSerializer, unsupported) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also test some unregistered type inside the opaque?

@@ -558,6 +558,8 @@ VectorPtr VectorFuzzer::fuzzFlat(const TypePtr& type, vector_size_t size) {
}

return fuzzRow(std::move(childrenVectors), rowType.names(), size);
} else if (type->isOpaque()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc: @kagamiori @bikramSingh91 I remember some discussion about adding opaque support in Fuzzer; here it is :)

@kunigami could you just split the fuzzer support stuff to a different PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had a PR for adding it to fuzzer but I missed some codepaths that this test triggered. I'll move it to
#11189

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I just messed the export of the diff stack.

const TypePtr& type,
vector_size_t size) {
auto vector = BaseVector::create(type, size, pool_);
using TFlat = typename KindToFlatVector<TypeKind::OPAQUE>::type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe TFlatOpaque ?

kunigami pushed a commit to kunigami/velox that referenced this pull request Oct 29, 2024
Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Differential Revision: D64358220
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64358220

kunigami pushed a commit to kunigami/velox that referenced this pull request Oct 29, 2024
Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Differential Revision: D64358220
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64358220

kunigami pushed a commit to kunigami/velox that referenced this pull request Oct 29, 2024
Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Differential Revision: D64358220
kunigami pushed a commit to kunigami/velox that referenced this pull request Oct 29, 2024
Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Differential Revision: D64358220
Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kunigami , looks pretty good!


std::string getOpaqueAliasForTypeId(std::type_index typeIndex);

// OpaqueType represents a type that is transparent to the Velox type system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the point is that opaque types are NOT transparent to the type system. :) (they are "opaque" to the type system) :)

// So if we were to serialize an opaque type using its std::type_index, we
// might not be able to deserialize it in another process. To solve this
// problem, we require that both the serializing and deserializing processes
// register the opaque type using registerOpaqueType() with the same alias.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment. We usually use "///" for documentation header comments such as this.

@@ -140,11 +159,6 @@ Result HiveTypeParser::parseType() {
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use toString(nt.typeKind()) instead?

HiveTypeParser parser;
auto t = parser.parse("opaque<bar>");
ASSERT_EQ(t->toString(), "OPAQUE<facebook::velox::type::fbhive::Foo>");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it on purpose that we still don't have deserialization tests?

@pedroerp
Copy link
Contributor

(please rebase on main first) @kunigami

@pedroerp pedroerp added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Oct 30, 2024
kunigami pushed a commit to kunigami/velox that referenced this pull request Oct 30, 2024
Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Reviewed By: pedroerp

Differential Revision: D64358220
kunigami pushed a commit to kunigami/velox that referenced this pull request Oct 30, 2024
Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Reviewed By: pedroerp

Differential Revision: D64358220
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64358220

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64358220

@kunigami
Copy link
Contributor Author

rebased

Summary:

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types. 

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Reviewed By: pedroerp

Differential Revision: D64358220
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64358220

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5cbba09.

Copy link

Conbench analyzed the 1 benchmark run on commit 5cbba097.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

athmaja-n pushed a commit to athmaja-n/velox that referenced this pull request Jan 10, 2025
Summary:
Pull Request resolved: facebookincubator#11253

My understanding of opaque types in Velox is that Velox doesn't know about the underlying type of it, and treats them as a `shared_ptr<void>`. For serializing data across processes, we need to somewhat break that assumption, because when we need to know how to deserialize this opaque data.

One option is to have the underlying type as part of the serialized type signature, the other is to store this information with the serialized data itself. I'm adopting the first option here.

We also need to introduce a layer of abstraction for opaque type index, by allowing aliasing opaque types.

The reason we can't use opaque type index is the assumption that they're not stable across processes. So if you serialize a opaque type as string in process A and then deserialize in process B, even if running the same revision there's no guarantee the type ID is the same.

With this change, callers are required to register an alias for opaque types before serializing/deserializing it via `HiveTypeSerializer` and `HiveTypeParser`. I put this registry in `Type.h` but if we want to keep this specific to `HiveTypeSerializer/HiveTypeParser` we could move it elsewhere.

Reviewed By: pedroerp

Differential Revision: D64358220

fbshipit-source-id: fb702e4366592c2f0e84c8ea11b7a2a9f5176854
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants