Prototype: Add initial code for IRv2 decoder. #10

junhaoliao · 2024-10-01T14:17:31Z

The changes in this PR are only to facilitate discussions. The original IR decoder was modified directly for the ease of prototyping and to show the differences between the two decoders. (Maybe we would end up with only one decoder depending on our discussion results.)

Description

Validation performed

coderabbitai · 2024-10-01T14:17:42Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

CMakeLists.txt

junhaoliao · 2024-10-01T17:22:56Z

src/clp_ffi_js/ir/StreamReader.cpp

-    bool is_four_bytes_encoding{true};
-    if (auto const err{
-                clp::ffi::ir_stream::get_encoding_type(*zstd_decompressor, is_four_bytes_encoding)
-        };
-        clp::ffi::ir_stream::IRErrorCode::IRErrorCode_Success != err)
-    {
-        SPDLOG_CRITICAL("Failed to decode encoding type, err={}", err);
-        throw ClpFfiJsException{
-                clp::ErrorCode::ErrorCode_MetadataCorrupted,
-                __FILENAME__,
-                __LINE__,
-                "Failed to decode encoding type."
-        };
-    }
-    if (false == is_four_bytes_encoding) {
-        throw ClpFfiJsException{
-                clp::ErrorCode::ErrorCode_Unsupported,
-                __FILENAME__,
-                __LINE__,
-                "IR stream uses unsupported encoding."
-        };
-    }
-


I had to remove this byte-encoding check as it seems to consume bytes from the reader and cause below clp::ffi::ir_stream::Deserializer::create(*zstd_decompressor) to fail with a protocol error. Do we still need this byte-encoding check?

I haven't taken a careful look into the clp-core code. Just out of curiosity, are we sticking with 4-byte or 8-byte encoding in IRv2 or do we support both? I recall there were some discussions to deprecate one in favour of the other but I don't remember we had a conclusion.

For v2 deserializer, you don't need this check. In terms of encoding types, both are supported by the format. However, the logging library will probably only implement four-byte encoding interface. The deserializer will automatically handle both encodings

junhaoliao · 2024-10-01T17:24:05Z

src/clp_ffi_js/ir/StreamReader.cpp

        while (true) {
-            auto result{m_stream_reader_data_context->get_deserializer().deserialize_log_event()};
+            auto result{m_stream_reader_data_context->get_deserializer().deserialize_to_next_log_event(reader)};


Note: pull out the deserializer to reduce line length.

junhaoliao · 2024-10-01T17:27:49Z

src/clp_ffi_js/ir/StreamReader.cpp

-
-        auto const parsed{log_event.get_message().decode_and_unparse()};
-        if (false == parsed.has_value()) {
+        auto const json{log_event.serialize_to_json()};


Here we retrieve the log event as a JSON since I am concerned about the cost of inserting nodes into an Embind object when we traverse the log event. Let me know if there're any concerns that I missed.

Some benchmarking might be worthy before we decide on that though.

junhaoliao · 2024-10-01T17:41:23Z

@LinZhihao-723 / @kirkrodrigues
In addition to the inline questions, could you help me with below queries?

Shall we plan for two decoder classes or a single one for IRv1 and IRv2?
- If we're to have two classes
  - Shall we release the classes as separate binaries?
    - Note there could be size overhead in the binaries.
    - That likely means we need to differentiate IRv2 from IRv1 in the Log Viewer. Not sure about how yet.
- If we wish to maintain a single class
  - Do we plan to have the KeyValuePair as a new encoding type? That way it can be passed as a template variable to classes.
  - How do we differentiate IRv2 from IRv1 though? Do we check the version number from the file? (I got a bit confused by the clp-core code as it seems the "beta" version is referring to IRv2 while IRv1 is the regular version. Could you confirm?)
I think there have been different iterations of this discussion though I'm not sure about the latest conclusion: What're the official names for IRv1 and IRv2?
- How should we name the class / classes?

LinZhihao-723 · 2024-10-01T17:54:54Z

@LinZhihao-723 / @kirkrodrigues In addition to the inline questions, could you help me with below queries?

Shall we plan for two decoder classes or a single one for IRv1 and IRv2?

If we're to have two classes

Shall we release the classes as separate binaries?

Note there could be size overhead in the binaries.

That likely means we need to differentiate IRv2 from IRv1 in the Log Viewer. Not sure about how yet.

If we wish to maintain a single class

Do we plan to have the KeyValuePair as a new encoding type? That way it can be passed as a template variable to classes.

How do we differentiate IRv2 from IRv1 though? Do we check the version number from the file? (I got a bit confused by the clp-core code as it seems the "beta" version is referring to IRv2 while IRv1 is the regular version. Could you confirm?)

I think there have been different iterations of this discussion though I'm not sure about the latest conclusion: What're the official names for IRv1 and IRv2?

How should we name the class / classes?

In my opinion we should maintain two classes for now before we have full supports of IR v2, including: (1) backward compatibility with IR v1; (2) metadata kv-pair supports. That's also why the version number for v2 is still marked as "beta". But due to my limited understanding of the new log viewer, I actually don't know the implications/overheads of the two alternatives you proposed. We can talk offline for more details.
There's no special name for IR v1. For IR v2, I think a more formal name we used in our core PRs is "key-value pair IR format"

kirkrodrigues · 2024-10-02T11:46:23Z

Shall we plan for two decoder classes or a single one for IRv1 and IRv2?

We should plan for one stream reader (a.k.a. decoder / deserializer) class by the time we start using this in production. Before now and then, we can do whatever is simplest.

If we're to have two classes, shall we release the classes as separate binaries?

Do we need to? The way I see it, the two readers are part of the same IR library, so it feels easier to keep them in one binary.

Note there could be size overhead in the binaries.

I guess it's still smaller than the IR streams we end up decoding?

That likely means we need to differentiate IRv2 from IRv1 in the Log Viewer. Not sure about how yet.

Right, this is another reason to go with one binary.

If we wish to maintain a single class, do we plan to have the KeyValuePair as a new encoding type? That way it can be passed as a template variable to classes.

The production release plan is that the KV-pair IR format will support deserializing IR v1 streams and expose them with the same API as KV-pair IR streams. So hopefully we can move away from all the different template specializations.

How do we differentiate IRv2 from IRv1 though? Do we check the version number from the file? (I got a bit confused by the clp-core code as it seems the "beta" version is referring to IRv2 while IRv1 is the regular version. Could you confirm?)

In my mind, the production release will use a different version number to differentiate the two.

I think there have been different iterations of this discussion though I'm not sure about the latest conclusion: What're the official names for IRv1 and IRv2?

As Zhihao said above.

How should we name the class / classes?

IRStreamReader and KVPairIRStreamReader? I think reader is more appropriate than decoder / deserializer since these classes really do both. Hopefully we don't have to keep these names for long.

junhaoliao · 2024-10-02T12:12:56Z

@kirkrodrigues Thanks for completing the responses. I'll stick with one binary and redesign the interfaces with one stream reader pictured.

I also shortly discussed offline with Zhihao yesterday, and we have these items opened for discussions (@LinZhihao-723 please correct me if I'm wrong):

Currently we need to select a correct deserializer depending on the IR file version, but there is no existing code in clp-core that can directly differentiate an IRv1 stream from an IRv2 one. We explored below options:
1. Assign a different magic number to IRv2 files.
2. decode_preamble(reader) in C++ code and read the version out, although that would consume bytes from the reader, and directly reusing the reader for deserializer creation would cause protocol failures. We can make a copy before passing it to decode_preamble though.
3. Since the metadata is stored as a JSON in the preamble, "peeking" into the first few bytes in JS code to decode the version out before calling into C++ code is a possible option.
KV-pair IR format will support deserializing IR v1 streams and expose them with the same API as KV-pair IR streams

Do we have a timeline for this to be ready?

kirkrodrigues · 2024-10-02T12:20:49Z

Assign a different magic number to IRv2 files.

We could, but at the same time, magic numbers should be used sparingly imo. If one day, CLP IR streams are more pervasive, we don't want the implementers of file type checking tools to have dozens of different entries for the same conceptual format, lol.

decode_preamble(reader) in C++ code and read the version out, although that would consume bytes from the reader, and directly reusing the reader for deserializer creation would cause protocol failures. We can make a copy before passing it to decode_preamble though.

This seems like the easiest option until we implement backwards compatible deserialization.

Since the metadata is stored as a JSON in the preamble, "peeking" into the first few bytes in JS code to decode the version out before calling into C++ code is a possible option.

This is probably more work than the previous option.

KV-pair IR format will support deserializing IR v1 streams...

Do we have a timeline for this to be ready?

Not yet. Hopefully within this month.

…keLists.txt

…r to inherit from it.

… stream_reader_data_context) back to private.

junhaoliao · 2024-10-10T14:47:59Z

src/clp_ffi_js/ir/StreamReader.cpp

-                log_num
-        );
-        ++log_num;
+        return std::make_unique<KVPairIRStreamReader>(KVPairIRStreamReader::create(std::move(data_array)));


This seems a bad practice since we have to recreate the data_buffer in KVPairIRStreamReader's factory function. Shall we make KVPairIRStreamReader's constructor public / friend so that we can

seek(0) in the zstd reader

create the deserializer from the zstd reader and then the StreamReaderDataContext

call KVPairIRStreamReader's constructor here.

so long we don't expose the constructor via Embind there should be minor concerns of misuses of the constructor (at the moment we are not exposing StreamReaderDataContext anyways). What do you think?

# Conflicts: # src/clp_ffi_js/ir/StreamReader.cpp # src/clp_ffi_js/ir/StreamReader.hpp

…e interfaces of the newly proposed StreamReader.

junhaoliao · 2024-10-10T20:59:22Z

The PR is not ready for review yet as I'm still cleaning up the log level filtering code for IRv2.

junhaoliao · 2024-10-22T05:00:30Z

This prototype is for reference only and not meant to be merged.

Add initial code for IRv2 decoder.

acd4563

junhaoliao commented Oct 1, 2024

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

junhaoliao commented Oct 1, 2024

View reviewed changes

junhaoliao and others added 8 commits October 8, 2024 19:33

Revert changes to enable exception catching linker flags - Update CMa…

d883966

…keLists.txt

Merge branch 'main' into irv2

3301fb6

Switch clp to OSS' main.

322b60a

Add code to parse and validate the CLP IR version.

ab37228

Templatize StreamReaderDataContext.

f2ed47e

Rename IRv2 specialized StreamReader -> KVPairIRStreamReader.

9062027

Create a new StreamReader base class and refactor KVPairIRStreamReade…

0f25962

…r to inherit from it.

Revert KVPairIRStreamReader(StreamReaderDataContext<deserializer_t>&&…

80f059b

… stream_reader_data_context) back to private.

junhaoliao commented Oct 10, 2024

View reviewed changes

junhaoliao added 4 commits October 11, 2024 03:21

Reformat code.

9be08c4

Optimize imports.

7842d29

Merge branch 'main' into irv2

541b84e

# Conflicts: # src/clp_ffi_js/ir/StreamReader.cpp # src/clp_ffi_js/ir/StreamReader.hpp

Rename the original StreamReader -> IrStreamReader and adapt it to th…

5b49ec0

…e interfaces of the newly proposed StreamReader.

junhaoliao added 2 commits October 11, 2024 19:31

Add log level filtering to KVPairIRStreamReader.

ff7321b

Rename IrStreamReader -> IRStreamReader.

98b0b70

junhaoliao requested a review from davemarco October 11, 2024 11:54

junhaoliao changed the title ~~WIP: Add initial code for IRv2 decoder.~~ Prototype: Add initial code for IRv2 decoder. Oct 22, 2024

junhaoliao and others added 5 commits November 1, 2024 05:02

Update to latest CLP commit.

f080ece

Upgrade clp submodule commit to 9f6a6ced4da504f6ba3c131efa26fd5b30c6f533

7c529d8

Add tree node id parsing for timestampKey and logLevelKey.

6c0ecde

Complete log level filtering.

40908f0

Update yscope-dev-utils version.

e411420

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype: Add initial code for IRv2 decoder. #10

Prototype: Add initial code for IRv2 decoder. #10

junhaoliao commented Oct 1, 2024 •

edited

Loading

coderabbitai bot commented Oct 1, 2024 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

junhaoliao Oct 1, 2024 •

edited

Loading

LinZhihao-723 Oct 1, 2024

junhaoliao Oct 1, 2024

junhaoliao Oct 1, 2024

junhaoliao Oct 1, 2024

junhaoliao commented Oct 1, 2024

LinZhihao-723 commented Oct 1, 2024 •

edited

Loading

kirkrodrigues commented Oct 2, 2024

junhaoliao commented Oct 2, 2024

kirkrodrigues commented Oct 2, 2024 •

edited

Loading

junhaoliao Oct 10, 2024

junhaoliao commented Oct 10, 2024

junhaoliao commented Oct 22, 2024

Prototype: Add initial code for IRv2 decoder. #10

Are you sure you want to change the base?

Prototype: Add initial code for IRv2 decoder. #10

Conversation

junhaoliao commented Oct 1, 2024 • edited Loading

Description

Validation performed

coderabbitai bot commented Oct 1, 2024 • edited Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

junhaoliao Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

LinZhihao-723 Oct 1, 2024

Choose a reason for hiding this comment

junhaoliao Oct 1, 2024

Choose a reason for hiding this comment

junhaoliao Oct 1, 2024

Choose a reason for hiding this comment

junhaoliao Oct 1, 2024

Choose a reason for hiding this comment

junhaoliao commented Oct 1, 2024

LinZhihao-723 commented Oct 1, 2024 • edited Loading

kirkrodrigues commented Oct 2, 2024

junhaoliao commented Oct 2, 2024

kirkrodrigues commented Oct 2, 2024 • edited Loading

junhaoliao Oct 10, 2024

Choose a reason for hiding this comment

junhaoliao commented Oct 10, 2024

junhaoliao commented Oct 22, 2024

junhaoliao commented Oct 1, 2024 •

edited

Loading

coderabbitai bot commented Oct 1, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

junhaoliao Oct 1, 2024 •

edited

Loading

LinZhihao-723 commented Oct 1, 2024 •

edited

Loading

kirkrodrigues commented Oct 2, 2024 •

edited

Loading