Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype: Add initial code for IRv2 decoder. #10

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

junhaoliao
Copy link
Collaborator

@junhaoliao junhaoliao commented Oct 1, 2024

The changes in this PR are only to facilitate discussions. The original IR decoder was modified directly for the ease of prototyping and to show the differences between the two decoders. (Maybe we would end up with only one decoder depending on our discussion results.)

Description

Validation performed

Copy link

coderabbitai bot commented Oct 1, 2024

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

CMakeLists.txt Outdated Show resolved Hide resolved
Comment on lines 49 to 73
bool is_four_bytes_encoding{true};
if (auto const err{
clp::ffi::ir_stream::get_encoding_type(*zstd_decompressor, is_four_bytes_encoding)
};
clp::ffi::ir_stream::IRErrorCode::IRErrorCode_Success != err)
{
SPDLOG_CRITICAL("Failed to decode encoding type, err={}", err);
throw ClpFfiJsException{
clp::ErrorCode::ErrorCode_MetadataCorrupted,
__FILENAME__,
__LINE__,
"Failed to decode encoding type."
};
}
if (false == is_four_bytes_encoding) {
throw ClpFfiJsException{
clp::ErrorCode::ErrorCode_Unsupported,
__FILENAME__,
__LINE__,
"IR stream uses unsupported encoding."
};
}

Copy link
Collaborator Author

@junhaoliao junhaoliao Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove this byte-encoding check as it seems to consume bytes from the reader and cause below clp::ffi::ir_stream::Deserializer::create(*zstd_decompressor) to fail with a protocol error. Do we still need this byte-encoding check?

I haven't taken a careful look into the clp-core code. Just out of curiosity, are we sticking with 4-byte or 8-byte encoding in IRv2 or do we support both? I recall there were some discussions to deprecate one in favour of the other but I don't remember we had a conclusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For v2 deserializer, you don't need this check. In terms of encoding types, both are supported by the format. However, the logging library will probably only implement four-byte encoding interface. The deserializer will automatically handle both encodings

while (true) {
auto result{m_stream_reader_data_context->get_deserializer().deserialize_log_event()};
auto result{m_stream_reader_data_context->get_deserializer().deserialize_to_next_log_event(reader)};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: pull out the deserializer to reduce line length.


auto const parsed{log_event.get_message().decode_and_unparse()};
if (false == parsed.has_value()) {
auto const json{log_event.serialize_to_json()};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we retrieve the log event as a JSON since I am concerned about the cost of inserting nodes into an Embind object when we traverse the log event. Let me know if there're any concerns that I missed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some benchmarking might be worthy before we decide on that though.

@junhaoliao
Copy link
Collaborator Author

@LinZhihao-723 / @kirkrodrigues
In addition to the inline questions, could you help me with below queries?

  1. Shall we plan for two decoder classes or a single one for IRv1 and IRv2?
    • If we're to have two classes
      • Shall we release the classes as separate binaries?
        • Note there could be size overhead in the binaries.
        • That likely means we need to differentiate IRv2 from IRv1 in the Log Viewer. Not sure about how yet.
    • If we wish to maintain a single class
      • Do we plan to have the KeyValuePair as a new encoding type? That way it can be passed as a template variable to classes.
      • How do we differentiate IRv2 from IRv1 though? Do we check the version number from the file? (I got a bit confused by the clp-core code as it seems the "beta" version is referring to IRv2 while IRv1 is the regular version. Could you confirm?)
  2. I think there have been different iterations of this discussion though I'm not sure about the latest conclusion: What're the official names for IRv1 and IRv2?
    • How should we name the class / classes?

@LinZhihao-723
Copy link
Member

LinZhihao-723 commented Oct 1, 2024

@LinZhihao-723 / @kirkrodrigues In addition to the inline questions, could you help me with below queries?

  1. Shall we plan for two decoder classes or a single one for IRv1 and IRv2?

    • If we're to have two classes

      • Shall we release the classes as separate binaries?

        • Note there could be size overhead in the binaries.
        • That likely means we need to differentiate IRv2 from IRv1 in the Log Viewer. Not sure about how yet.
    • If we wish to maintain a single class

      • Do we plan to have the KeyValuePair as a new encoding type? That way it can be passed as a template variable to classes.
      • How do we differentiate IRv2 from IRv1 though? Do we check the version number from the file? (I got a bit confused by the clp-core code as it seems the "beta" version is referring to IRv2 while IRv1 is the regular version. Could you confirm?)
  2. I think there have been different iterations of this discussion though I'm not sure about the latest conclusion: What're the official names for IRv1 and IRv2?

    • How should we name the class / classes?
  1. In my opinion we should maintain two classes for now before we have full supports of IR v2, including: (1) backward compatibility with IR v1; (2) metadata kv-pair supports. That's also why the version number for v2 is still marked as "beta". But due to my limited understanding of the new log viewer, I actually don't know the implications/overheads of the two alternatives you proposed. We can talk offline for more details.
  2. There's no special name for IR v1. For IR v2, I think a more formal name we used in our core PRs is "key-value pair IR format"

@kirkrodrigues
Copy link
Member

  1. Shall we plan for two decoder classes or a single one for IRv1 and IRv2?

We should plan for one stream reader (a.k.a. decoder / deserializer) class by the time we start using this in production. Before now and then, we can do whatever is simplest.

If we're to have two classes, shall we release the classes as separate binaries?

Do we need to? The way I see it, the two readers are part of the same IR library, so it feels easier to keep them in one binary.

Note there could be size overhead in the binaries.

I guess it's still smaller than the IR streams we end up decoding?

That likely means we need to differentiate IRv2 from IRv1 in the Log Viewer. Not sure about how yet.

Right, this is another reason to go with one binary.

If we wish to maintain a single class, do we plan to have the KeyValuePair as a new encoding type? That way it can be passed as a template variable to classes.

The production release plan is that the KV-pair IR format will support deserializing IR v1 streams and expose them with the same API as KV-pair IR streams. So hopefully we can move away from all the different template specializations.

How do we differentiate IRv2 from IRv1 though? Do we check the version number from the file? (I got a bit confused by the clp-core code as it seems the "beta" version is referring to IRv2 while IRv1 is the regular version. Could you confirm?)

In my mind, the production release will use a different version number to differentiate the two.

I think there have been different iterations of this discussion though I'm not sure about the latest conclusion: What're the official names for IRv1 and IRv2?

As Zhihao said above.

How should we name the class / classes?

IRStreamReader and KVPairIRStreamReader? I think reader is more appropriate than decoder / deserializer since these classes really do both. Hopefully we don't have to keep these names for long.

@junhaoliao
Copy link
Collaborator Author

@kirkrodrigues Thanks for completing the responses. I'll stick with one binary and redesign the interfaces with one stream reader pictured.

I also shortly discussed offline with Zhihao yesterday, and we have these items opened for discussions (@LinZhihao-723 please correct me if I'm wrong):

  1. Currently we need to select a correct deserializer depending on the IR file version, but there is no existing code in clp-core that can directly differentiate an IRv1 stream from an IRv2 one. We explored below options:

    1. Assign a different magic number to IRv2 files.
    2. decode_preamble(reader) in C++ code and read the version out, although that would consume bytes from the reader, and directly reusing the reader for deserializer creation would cause protocol failures. We can make a copy before passing it to decode_preamble though.
    3. Since the metadata is stored as a JSON in the preamble, "peeking" into the first few bytes in JS code to decode the version out before calling into C++ code is a possible option.
  2. KV-pair IR format will support deserializing IR v1 streams and expose them with the same API as KV-pair IR streams

    Do we have a timeline for this to be ready?

@kirkrodrigues
Copy link
Member

kirkrodrigues commented Oct 2, 2024

Assign a different magic number to IRv2 files.

We could, but at the same time, magic numbers should be used sparingly imo. If one day, CLP IR streams are more pervasive, we don't want the implementers of file type checking tools to have dozens of different entries for the same conceptual format, lol.

decode_preamble(reader) in C++ code and read the version out, although that would consume bytes from the reader, and directly reusing the reader for deserializer creation would cause protocol failures. We can make a copy before passing it to decode_preamble though.

This seems like the easiest option until we implement backwards compatible deserialization.

Since the metadata is stored as a JSON in the preamble, "peeking" into the first few bytes in JS code to decode the version out before calling into C++ code is a possible option.

This is probably more work than the previous option.

KV-pair IR format will support deserializing IR v1 streams...

Do we have a timeline for this to be ready?

Not yet. Hopefully within this month.

log_num
);
++log_num;
return std::make_unique<KVPairIRStreamReader>(KVPairIRStreamReader::create(std::move(data_array)));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bad practice since we have to recreate the data_buffer in KVPairIRStreamReader's factory function. Shall we make KVPairIRStreamReader's constructor public / friend so that we can

  1. seek(0) in the zstd reader
  2. create the deserializer from the zstd reader and then the StreamReaderDataContext
  3. call KVPairIRStreamReader's constructor here.

so long we don't expose the constructor via Embind there should be minor concerns of misuses of the constructor (at the moment we are not exposing StreamReaderDataContext anyways). What do you think?

# Conflicts:
#	src/clp_ffi_js/ir/StreamReader.cpp
#	src/clp_ffi_js/ir/StreamReader.hpp
…e interfaces of the newly proposed StreamReader.
@junhaoliao
Copy link
Collaborator Author

The PR is not ready for review yet as I'm still cleaning up the log level filtering code for IRv2.

@junhaoliao junhaoliao requested a review from davemarco October 11, 2024 11:54
@junhaoliao
Copy link
Collaborator Author

This prototype is for reference only and not meant to be merged.

@junhaoliao junhaoliao changed the title WIP: Add initial code for IRv2 decoder. Prototype: Add initial code for IRv2 decoder. Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants