-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp s): IRv2 ingestion to archive functionality #630
base: main
Are you sure you want to change the base?
feat(clp s): IRv2 ingestion to archive functionality #630
Conversation
WalkthroughThis pull request introduces multiple new source and header files to the Changes
Possibly related PRs
Suggested reviewers
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/CMakeLists.txt (1)
Line range hint
11-59
: Consider grouping related components into separate CMake targets.The current structure adds all files to a single source list. Consider organizing related components (IR stream, key-value processing, compression) into separate CMake targets for better modularity and dependency management.
Example restructuring:
# IR Stream component add_library(ir_stream STATIC ../clp/ffi/ir_stream/decoding_methods.cpp ../clp/ffi/ir_stream/encoding_methods.cpp # ... other IR stream files ) # Key-Value processing component add_library(kv_processing STATIC ../clp/ffi/KeyValuePairLogEvent.cpp ../clp/ffi/SchemaTree.cpp # ... other KV files ) # Update main target to use these components target_link_libraries(clp-s PRIVATE ir_stream kv_processing # ... other dependencies )components/core/src/clp_s/clp-s.cpp (1)
118-121
: Refactor to eliminate code duplication in error handlingBoth branches of the if-statement handle parsing errors identically. Consider refactoring to reduce duplication and enhance readability.
Apply this diff to refactor the code:
bool success = false; if (CommandLineArguments::FileType::KeyValueIr == option.input_file_type) { - if (false == parser.parse_from_ir()) { - SPDLOG_ERROR("Encountered error while parsing input"); - return false; - } + success = parser.parse_from_ir(); } else { - if (false == parser.parse()) { - SPDLOG_ERROR("Encountered error while parsing input"); - return false; - } + success = parser.parse(); } + if (false == success) { + SPDLOG_ERROR("Encountered error while parsing input"); + return false; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
components/core/src/clp_s/CMakeLists.txt
(2 hunks)components/core/src/clp_s/JsonParser.cpp
(2 hunks)components/core/src/clp_s/JsonParser.hpp
(4 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp_s/clp-s.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonParser.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (2)
components/core/src/clp_s/JsonParser.hpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
components/core/src/clp_s/JsonParser.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
🔇 Additional comments (8)
components/core/src/clp_s/CMakeLists.txt (3)
11-28
: Well-organized IR stream and key-value processing components.
The organization of the IR stream and key-value processing components follows a clean separation of concerns and maintains consistent naming conventions. The structure aligns well with the PR objectives for IRv2 ingestion functionality.
Line range hint 11-59
: Consider adding corresponding test files.
While the implementation files are well-organized, I notice that corresponding test files aren't included in the CMake configuration. Consider adding test files for:
- IR stream serialization/deserialization
- Key-value pair processing
- Schema handling
- UTF-8 utilities
Would you like me to help create a template for these test files or open a GitHub issue to track this task?
54-55
: Verify ZSTD compression integration with IR stream handling.
The addition of ZSTD compression components alongside IR stream handling suggests a dependency between these features. Please ensure that the IR stream serialization properly integrates with the ZSTD compression.
Also applies to: 11-21
✅ Verification successful
ZSTD compression is properly integrated with IR stream handling
The verification shows that ZSTD compression is already properly integrated with IR stream handling:
LogEventSerializer
uses ZSTD compression for writing IR data throughm_zstd_compressor
- The test suite confirms the integration by using
Decompressor
to read the ZSTD-compressed IR stream - The implementation follows a proper lifecycle (open, write, close) for compression handling
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for ZSTD compression usage in IR stream handling
# Search for ZSTD compression usage in IR stream files
echo "Checking ZSTD integration..."
rg -l "zstd|ZSTD" ../clp/ffi/ir_stream/
Length of output: 170
Script:
#!/bin/bash
# Let's first find the correct paths for IR stream and ZSTD files
echo "Finding IR stream and ZSTD related files..."
fd -t f "ir_stream|IR_Stream|zstd|ZSTD" --full-path
# Then check for any connections between IR stream and ZSTD compression
echo -e "\nChecking for ZSTD usage in IR stream related files..."
rg -l "zstd|ZSTD|Zstd" $(fd -t f "ir_stream|IR_Stream")
# Look for IR stream mentions in ZSTD files
echo -e "\nChecking for IR stream references in ZSTD files..."
rg -l "ir_stream|IR_Stream" $(fd -t f "zstd|ZSTD")
Length of output: 9711
Script:
#!/bin/bash
# Let's examine the actual integration points between IR stream and ZSTD
echo "Checking IR serialization code for compression usage..."
rg -C 3 "Compress|Decompress|zstd|ZSTD" components/core/src/clp/ir/LogEventSerializer.cpp components/core/src/clp/ir/LogEventDeserializer.cpp
echo -e "\nChecking test files for IR and ZSTD integration..."
rg -C 3 "Compress|Decompress|zstd|ZSTD" components/core/tests/test-ir_serializer.cpp
Length of output: 2862
components/core/src/clp_s/JsonParser.hpp (5)
4-4
: Inclusion of necessary headers
The added headers provide the required declarations for the new functionalities implemented in the class.
Also applies to: 8-8, 15-17
74-79
: Addition of parse_from_ir()
method
The introduction of the parse_from_ir()
public method enhances the JsonParser
class by enabling parsing from IR streams, aligning with the PR objectives.
95-107
: Definition of get_archive_node_type()
method
The get_archive_node_type()
method correctly determines the archive node type based on the IR node type and its value, which is essential for accurate data conversion.
108-123
: Implementation of get_archive_node_id()
method
The get_archive_node_id()
method efficiently maps IR node IDs to archive node IDs, utilizing caching to optimize performance.
124-135
: Creation of parse_kv_log_event()
method
The parse_kv_log_event()
method effectively parses key-value log events, facilitating their conversion into the archive format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an initial set of comments. Still need to go through parse_from_ir
onwards, but should be enough comments to iterate on for now.
The other high level question I have is whether the --timestamp-key
argument is supported for this ingestion path? If not we should throw in the same way we do if the --structurize-arrays
argument is passed for IR ingestion.
[[maybe_unused]] UtcOffset utc_offset_old, | ||
[[maybe_unused]] UtcOffset utc_offset_new | ||
) -> IRErrorCode { | ||
return IRErrorCode::IRErrorCode_Success; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this doesn't actually handle utc offset changes, so shouldn't this return something like ErrorCode_Unsupported instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar questions for the other functions that return success without doing anything. Mostly wondering what the responsibilities/expected behaviour for these functions is supposed to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deserializer expects a handler like the one laid out in components/core/src/clp/ffi/ir_stream/IrUnitHandlerInterface.hpp. In the test case for the deserializer the unused functions returned success like this, so I just followed suit. I'm not sure what the goal of the functions were. I think it exposed flexibility where if we wanted to handle each schema node one at a time, do something specific with each event, etc we could. Really we'd have to consult Zhihao to see if he had a specific thing in mind. All functions have pathways to get called by the deserializer, so we need to offer them. Is there something I'm supposed to used in the archive or parser with a UTC change?
.decode_and_unparse() | ||
.value(); | ||
} | ||
auto validated_escaped_encoded_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, don't want to escape the string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and other utf8 validation cases we decided offline to leave it as is for now unless the escaping changes get merged before this PR is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (7)
components/core/src/clp_s/JsonParser.hpp (6)
76-80
: Consider enhancing method documentation.While the documentation is clear, it would be beneficial to add:
- @throw section documenting potential exceptions
- Example of expected IR Stream format
- Description of failure scenarios
97-108
: Fix documentation inconsistency with parameters.The documentation mentions parameters that don't match the method signature:
ir_node_type
→ not in signaturenode_has_value
→ not in signaturenode_value
→ not in signatureUpdate documentation to match the actual parameters:
tree
kv_pair
110-122
: Add missing documentation elements.Please add:
- @return description
- Update parameter list to match signature (parent_node_id is in signature but not documented)
136-140
: Enhance method documentation.Consider adding:
- Description of the parsing process
- @throw section if applicable
- Expected format of the KeyValuePairLogEvent
186-187
: Add documentation for the new member variable.Please add a comment explaining:
- Purpose of the mapping
- Relationship between IR nodes and archive nodes
- Why flat_hash_map was chosen over std::unordered_map
76-80
: Document error handling behaviour.Based on previous implementation patterns, please document that errors from
kv_log_event_result.error()
aside fromstd::errc::no_message_available
andstd::errc::result_out_of_range
are expected behaviour.components/core/src/clp_s/JsonParser.cpp (1)
688-749
: Consider converting recursive implementation to iterative.The current recursive implementation could potentially cause stack overflow with deep trees. Additionally, profilers often have trouble attributing runtime to recursive functions.
Consider converting this to an iterative implementation using a stack in a follow-up PR.
Also, use
auto const
for immutable views:- int curr_node_archive_id{constants::cRootNodeId}; + auto const curr_node_archive_id{constants::cRootNodeId};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/JsonParser.cpp
(2 hunks)components/core/src/clp_s/JsonParser.hpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp_s/JsonParser.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (2)
components/core/src/clp_s/JsonParser.hpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
components/core/src/clp_s/JsonParser.cpp (3)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
🔇 Additional comments (5)
components/core/src/clp_s/JsonParser.hpp (2)
4-19
: LGTM! Includes are well-organized and necessary.
The new includes are properly ordered and support the added IR functionality.
36-36
: LGTM! Using directive is appropriately scoped.
The using directive is specific to the needed type and well-placed after includes.
components/core/src/clp_s/JsonParser.cpp (3)
3-33
: LGTM! Well-organized includes and namespace declarations.
The includes are logically grouped and the using declarations improve code readability.
35-75
: LGTM! Clean implementation of IrUnitHandlerInterface.
The class provides a clear implementation of the interface with appropriate documentation. The empty implementations returning success are consistent with the test cases and expected behaviour.
842-919
: LGTM! Well-structured IR parsing implementation.
The method provides comprehensive error handling and follows a clear flow of operations. The cleanup of resources is properly handled in error cases.
void JsonParser::parse_kv_log_event(KeyValuePairLogEvent const& kv) { | ||
clp::ffi::SchemaTree const& tree = kv.get_schema_tree(); | ||
for (auto const& pair : kv.get_node_id_value_pairs()) { | ||
NodeType const archive_node_type = get_archive_node_type(tree, pair); | ||
auto const node_id = get_archive_node_id(pair.first, archive_node_type, tree); | ||
|
||
switch (archive_node_type) { | ||
case NodeType::Integer: { | ||
auto const i64_value | ||
= pair.second.value().get_immutable_view<clp::ffi::value_int_t>(); | ||
m_current_parsed_message.add_value(node_id, i64_value); | ||
} break; | ||
case NodeType::Float: { | ||
auto const d_value | ||
= pair.second.value().get_immutable_view<clp::ffi::value_float_t>(); | ||
m_current_parsed_message.add_value(node_id, d_value); | ||
} break; | ||
case NodeType::Boolean: { | ||
auto const b_value | ||
= pair.second.value().get_immutable_view<clp::ffi::value_bool_t>(); | ||
m_current_parsed_message.add_value(node_id, b_value); | ||
} break; | ||
case NodeType::VarString: { | ||
auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string( | ||
pair.second.value().get_immutable_view<std::string>() | ||
); | ||
std::string str; | ||
if (validated_escaped_string.has_value()) { | ||
str = validated_escaped_string.value(); | ||
} else { | ||
SPDLOG_ERROR( | ||
"String is not utf8 compliant: \"{}\"", | ||
pair.second.value().get_immutable_view<std::string>() | ||
); | ||
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); | ||
} | ||
m_current_parsed_message.add_value(node_id, str); | ||
} break; | ||
case NodeType::ClpString: { | ||
std::string encoded_str; | ||
std::string decoded_value; | ||
if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) { | ||
decoded_value = pair.second.value() | ||
.get_immutable_view<clp::ir::EightByteEncodedTextAst>() | ||
.decode_and_unparse() | ||
.value(); | ||
|
||
} else { | ||
decoded_value = pair.second.value() | ||
.get_immutable_view<clp::ir::FourByteEncodedTextAst>() | ||
.decode_and_unparse() | ||
.value(); | ||
} | ||
auto const validated_escaped_encoded_string | ||
= clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str()); | ||
if (validated_escaped_encoded_string.has_value()) { | ||
encoded_str = validated_escaped_encoded_string.value(); | ||
} else { | ||
SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value); | ||
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); | ||
} | ||
m_current_parsed_message.add_value(node_id, encoded_str); | ||
} break; | ||
case NodeType::UnstructuredArray: { | ||
std::string array_str; | ||
if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) { | ||
array_str = pair.second.value() | ||
.get_immutable_view<clp::ir::EightByteEncodedTextAst>() | ||
.decode_and_unparse() | ||
.value(); | ||
} else { | ||
array_str = pair.second.value() | ||
.get_immutable_view<clp::ir::FourByteEncodedTextAst>() | ||
.decode_and_unparse() | ||
.value(); | ||
} | ||
m_current_parsed_message.add_value(node_id, array_str); | ||
break; | ||
} | ||
default: | ||
// Don't need to add value for obj or null | ||
break; | ||
} | ||
m_current_schema.insert_ordered(node_id); | ||
} | ||
|
||
int32_t const current_schema_id = m_archive_writer->add_schema(m_current_schema); | ||
m_current_parsed_message.set_id(current_schema_id); | ||
m_archive_writer->append_message(current_schema_id, m_current_schema, m_current_parsed_message); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify string handling and remove unnecessary UTF-8 validation.
The method contains unnecessary UTF-8 validation and string escaping. These operations should be handled during compression.
Apply these changes to simplify string handling:
- Remove UTF-8 validation in VarString case
- Remove UTF-8 validation in ClpString case
- Use
auto const
for immutable views
case NodeType::VarString: {
- auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
- pair.second.value().get_immutable_view<std::string>()
- );
- std::string str;
- if (validated_escaped_string.has_value()) {
- str = validated_escaped_string.value();
- } else {
- SPDLOG_ERROR(
- "String is not utf8 compliant: \"{}\"",
- pair.second.value().get_immutable_view<std::string>()
- );
- throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
- }
+ auto const& str = pair.second.value().get_immutable_view<std::string>();
m_current_parsed_message.add_value(node_id, str);
} break;
Committable suggestion skipped: line range outside the PR's diff.
auto JsonParser::add_node_to_archive_and_translations( | ||
uint32_t ir_node_id, | ||
clp::ffi::SchemaTree::Node const& ir_node_to_add, | ||
NodeType archive_node_type, | ||
int32_t parent_node_id | ||
) -> int { | ||
auto validated_escaped_key | ||
= clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name()); | ||
std::string node_key; | ||
if (validated_escaped_key.has_value()) { | ||
node_key = validated_escaped_key.value(); | ||
} else { | ||
SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name()); | ||
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); | ||
} | ||
int const curr_node_archive_id | ||
= m_archive_writer->add_node(parent_node_id, archive_node_type, node_key); | ||
|
||
m_ir_node_to_archive_node_id_mapping.emplace( | ||
std::make_pair(ir_node_id, archive_node_type), | ||
curr_node_archive_id | ||
); | ||
return curr_node_archive_id; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unnecessary UTF-8 validation.
Per previous discussions, UTF-8 validation is unnecessary at this stage and should be handled during compression.
Apply this diff to simplify the function:
auto JsonParser::add_node_to_archive_and_translations(
uint32_t ir_node_id,
clp::ffi::SchemaTree::Node const& ir_node_to_add,
NodeType archive_node_type,
int32_t parent_node_id
) -> int {
- auto validated_escaped_key
- = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
- std::string node_key;
- if (validated_escaped_key.has_value()) {
- node_key = validated_escaped_key.value();
- } else {
- SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
- throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
- }
+ std::string const& node_key = ir_node_to_add.get_key_name();
int const curr_node_archive_id
= m_archive_writer->add_node(parent_node_id, archive_node_type, node_key);
m_ir_node_to_archive_node_id_mapping.emplace(
std::make_pair(ir_node_id, archive_node_type),
curr_node_archive_id
);
return curr_node_archive_id;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto JsonParser::add_node_to_archive_and_translations( | |
uint32_t ir_node_id, | |
clp::ffi::SchemaTree::Node const& ir_node_to_add, | |
NodeType archive_node_type, | |
int32_t parent_node_id | |
) -> int { | |
auto validated_escaped_key | |
= clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name()); | |
std::string node_key; | |
if (validated_escaped_key.has_value()) { | |
node_key = validated_escaped_key.value(); | |
} else { | |
SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name()); | |
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); | |
} | |
int const curr_node_archive_id | |
= m_archive_writer->add_node(parent_node_id, archive_node_type, node_key); | |
m_ir_node_to_archive_node_id_mapping.emplace( | |
std::make_pair(ir_node_id, archive_node_type), | |
curr_node_archive_id | |
); | |
return curr_node_archive_id; | |
} | |
auto JsonParser::add_node_to_archive_and_translations( | |
uint32_t ir_node_id, | |
clp::ffi::SchemaTree::Node const& ir_node_to_add, | |
NodeType archive_node_type, | |
int32_t parent_node_id | |
) -> int { | |
std::string const& node_key = ir_node_to_add.get_key_name(); | |
int const curr_node_archive_id | |
= m_archive_writer->add_node(parent_node_id, archive_node_type, node_key); | |
m_ir_node_to_archive_node_id_mapping.emplace( | |
std::make_pair(ir_node_id, archive_node_type), | |
curr_node_archive_id | |
); | |
return curr_node_archive_id; | |
} |
auto JsonParser::get_archive_node_type( | ||
clp::ffi::SchemaTree const& tree, | ||
std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair | ||
) -> NodeType { | ||
clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first); | ||
clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type(); | ||
bool const node_has_value = kv_pair.second.has_value(); | ||
std::optional<clp::ffi::Value> node_value{}; | ||
if (node_has_value) { | ||
node_value = kv_pair.second.value(); | ||
} | ||
switch (ir_node_type) { | ||
case clp::ffi::SchemaTree::Node::Type::Int: | ||
return NodeType::Integer; | ||
case clp::ffi::SchemaTree::Node::Type::Float: | ||
return NodeType::Float; | ||
case clp::ffi::SchemaTree::Node::Type::Bool: | ||
return NodeType::Boolean; | ||
case clp::ffi::SchemaTree::Node::Type::UnstructuredArray: | ||
return NodeType::UnstructuredArray; | ||
case clp::ffi::SchemaTree::Node::Type::Str: | ||
if (node_value && node_value->is<std::string>()) { | ||
return NodeType::VarString; | ||
} else { | ||
return NodeType::ClpString; | ||
} | ||
case clp::ffi::SchemaTree::Node::Type::Obj: | ||
if (node_has_value) { | ||
if (node_value->is_null()) { | ||
return NodeType::NullValue; | ||
} | ||
} | ||
return NodeType::Object; | ||
default: | ||
SPDLOG_ERROR("Unknown IR Node Type Detected"); | ||
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify function structure and improve error handling.
Consider these improvements:
- Simplify the function by directly returning types in the switch cases
- Return
NodeType::Unknown
for unknown IR node types instead of throwing an error
Apply this diff to simplify the function:
auto JsonParser::get_archive_node_type(
clp::ffi::SchemaTree const& tree,
std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair
) -> NodeType {
clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first);
clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type();
- bool const node_has_value = kv_pair.second.has_value();
- std::optional<clp::ffi::Value> node_value{};
- if (node_has_value) {
- node_value = kv_pair.second.value();
- }
+ auto const& node_value = kv_pair.second;
switch (ir_node_type) {
case clp::ffi::SchemaTree::Node::Type::Int:
return NodeType::Integer;
case clp::ffi::SchemaTree::Node::Type::Float:
return NodeType::Float;
case clp::ffi::SchemaTree::Node::Type::Bool:
return NodeType::Boolean;
case clp::ffi::SchemaTree::Node::Type::UnstructuredArray:
return NodeType::UnstructuredArray;
case clp::ffi::SchemaTree::Node::Type::Str:
if (node_value && node_value->is<std::string>()) {
return NodeType::VarString;
}
return NodeType::ClpString;
case clp::ffi::SchemaTree::Node::Type::Obj:
if (node_value && node_value->is_null()) {
return NodeType::NullValue;
}
return NodeType::Object;
default:
- SPDLOG_ERROR("Unknown IR Node Type Detected");
- throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+ return NodeType::Unknown;
}
}
Committable suggestion skipped: line range outside the PR's diff.
…rchive-JsonParser-Functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
components/core/src/clp_s/JsonParser.cpp (4)
624-661
: 🛠️ Refactor suggestionSimplify the
get_archive_node_type
function for clarityConsider refactoring the
get_archive_node_type
function to directly return the node type in each case. This simplifies the function, enhances readability, and aligns with best practices.Apply this diff to simplify the function:
auto JsonParser::get_archive_node_type( clp::ffi::SchemaTree const& tree, std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair ) -> NodeType { clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first); clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type(); auto const& node_value = kv_pair.second; switch (ir_node_type) { case clp::ffi::SchemaTree::Node::Type::Int: return NodeType::Integer; case clp::ffi::SchemaTree::Node::Type::Float: return NodeType::Float; case clp::ffi::SchemaTree::Node::Type::Bool: return NodeType::Boolean; case clp::ffi::SchemaTree::Node::Type::UnstructuredArray: return NodeType::UnstructuredArray; case clp::ffi::SchemaTree::Node::Type::Str: if (node_value && node_value->is<std::string>()) { return NodeType::VarString; } else { return NodeType::ClpString; } case clp::ffi::SchemaTree::Node::Type::Obj: if (node_value && node_value->is_null()) { return NodeType::NullValue; } else { return NodeType::Object; } default: return NodeType::Unknown; } }
663-686
: 🛠️ Refactor suggestionRemove unnecessary UTF-8 validation in
add_node_to_archive_and_translations
UTF-8 validation is unnecessary at this stage and should be handled during compression. Removing this validation simplifies the code and reduces overhead.
Apply this diff to remove the UTF-8 validation:
auto JsonParser::add_node_to_archive_and_translations( uint32_t ir_node_id, clp::ffi::SchemaTree::Node const& ir_node_to_add, NodeType archive_node_type, int32_t parent_node_id ) -> int { - auto validated_escaped_key - = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name()); - std::string node_key; - if (validated_escaped_key.has_value()) { - node_key = validated_escaped_key.value(); - } else { - SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name()); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); - } + std::string const& node_key = ir_node_to_add.get_key_name(); int const curr_node_archive_id = m_archive_writer->add_node(parent_node_id, archive_node_type, node_key); m_ir_node_to_archive_node_id_mapping.emplace( std::make_pair(ir_node_id, archive_node_type), curr_node_archive_id ); return curr_node_archive_id; }
774-786
: 🛠️ Refactor suggestionEliminate unnecessary UTF-8 validation in
parse_kv_log_event
forVarString
nodesUTF-8 validation is not required here and can be safely removed. This change reduces redundancy and enhances performance.
Apply this diff:
case NodeType::VarString: { - auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string( - pair.second.value().get_immutable_view<std::string>() - ); - std::string str; - if (validated_escaped_string.has_value()) { - str = validated_escaped_string.value(); - } else { - SPDLOG_ERROR( - "String is not utf8 compliant: \"{}\"", - pair.second.value().get_immutable_view<std::string>() - ); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); - } + auto const& str = pair.second.value().get_immutable_view<std::string>(); m_current_parsed_message.add_value(node_id, str); } break;
790-813
: 🛠️ Refactor suggestionRemove unnecessary UTF-8 validation for encoded strings in
parse_kv_log_event
UTF-8 validation of encoded strings is not needed at this point and should be deferred to the compression phase. Simplifying this code improves efficiency.
Apply this diff:
case NodeType::ClpString: { std::string encoded_str; std::string decoded_value; if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) { decoded_value = pair.second.value() .get_immutable_view<clp::ir::EightByteEncodedTextAst>() .decode_and_unparse() .value(); } else { decoded_value = pair.second.value() .get_immutable_view<clp::ir::FourByteEncodedTextAst>() .decode_and_unparse() .value(); } - auto const validated_escaped_encoded_string - = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str()); - if (validated_escaped_encoded_string.has_value()) { - encoded_str = validated_escaped_encoded_string.value(); - } else { - SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); - } + encoded_str = decoded_value; m_current_parsed_message.add_value(node_id, encoded_str); } break;
🧹 Nitpick comments (1)
components/core/src/clp_s/JsonParser.cpp (1)
728-728
: Adhere to coding guidelines by replacingnot
withfalse ==
Per the coding guidelines, prefer
false == <expression>
rather than!<expression>
. Replacewhile (not ir_id_stack.empty())
withwhile (false == ir_id_stack.empty())
for consistency.Apply this diff:
- while (not ir_id_stack.empty()) { + while (false == ir_id_stack.empty()) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/JsonParser.cpp
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/core/src/clp_s/JsonParser.cpp (3)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event"); | ||
zd.close(); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid unnecessary error logging in parse_from_ir
Encountering errors in kv_log_event_result.error()
is anticipated behaviour and does not require additional logging. Removing this error log prevents cluttering the logs with expected messages.
Apply this diff:
} catch (...) {
- SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event");
zd.close();
return false;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event"); | |
zd.close(); | |
return false; | |
} | |
zd.close(); | |
return false; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly smaller changes this time.
Still want to confirm about the --timestamp-key
behaviour I asked about in the last round of review.
Also wondering how we're tracking original size of the IR stream for metadata purposes. I didn't see anything in our code, and I'm guessing that the ir stream itself doesn't provide anything, but we should probably at least use an estimate like number of decompressed bytes consumed by the Decompressor or something.
|
||
auto JsonParser::parse_from_ir() -> bool { | ||
for (auto& file_path : m_file_paths) { | ||
auto const fsize = std::filesystem::file_size(file_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we do this because the Deserializer treats empty streams as invalid streams? I can see why this makes sense in that case, but it might be simpler to just let it fail on empty streams.
Up to you whether you want to change this or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could also just continue onto the next file rather than returning. What do you think? I'll leave it as is for now, but would be a quick change to remove or to change to continue instead of return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
components/core/src/clp_s/JsonParser.cpp (3)
665-673
: 🛠️ Refactor suggestionRemove unnecessary UTF-8 validation and exception throwing
Based on previous discussions and learnings, the validation and exception throwing for UTF-8 compliance of
ir_node_to_add.get_key_name()
are unnecessary here and should be omitted.Apply this diff to simplify the code:
-auto validated_escaped_key - = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name()); -std::string node_key; -if (validated_escaped_key.has_value()) { - node_key = validated_escaped_key.value(); -} else { - SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name()); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); -} +std::string const& node_key = ir_node_to_add.get_key_name();
770-782
: 🛠️ Refactor suggestionEliminate redundant UTF-8 validation and exception handling
UTF-8 validation and exception throwing for string values are unnecessary at this stage and should be removed to simplify the code.
Apply this diff to remove the validation:
-auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string( - pair.second.value().get_immutable_view<std::string>() -); -std::string str; -if (validated_escaped_string.has_value()) { - str = validated_escaped_string.value(); -} else { - SPDLOG_ERROR( - "String is not utf8 compliant: \"{}\"", - pair.second.value().get_immutable_view<std::string>() - ); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); -} +auto const& str = pair.second.value().get_immutable_view<std::string>();
801-807
: 🛠️ Refactor suggestionRemove unnecessary validation and exception throwing for encoded strings
The UTF-8 validation and exception handling for encoded strings are unnecessary and can be safely removed.
Apply this diff to simplify the code:
-auto const validated_escaped_encoded_string - = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str()); -if (validated_escaped_encoded_string.has_value()) { - encoded_str = validated_escaped_encoded_string.value(); -} else { - SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); -} +std::string const& encoded_str = decoded_value;
🧹 Nitpick comments (3)
components/core/src/clp_s/JsonParser.cpp (3)
892-899
: Simplify exception handling inparse_from_ir
The
catch
block rethrows the exception after logging and clean-up, which may be unnecessary. Consider allowing the exception to propagate naturally or handle it differently to reduce code complexity.Consider modifying the code as follows:
try { parse_kv_log_event(*kv_log_event); } catch (std::exception const& e) { - SPDLOG_ERROR("Encountered error while parsing a kv log event - {}", e.what()); - decompressor.try_get_pos(curr_pos); - m_archive_writer->increment_uncompressed_size(curr_pos - last_pos); - decompressor.close(); - return false; + throw; }
650-652
: Simplify condition by relying onnode_value
Since
node_value
is only meaningful when it has a value, you can simplify the condition by checkingnode_value
directly.Modify the condition as follows:
-if (node_has_value && node_value.is_null()) { +if (node_value && node_value->is_null()) {
631-634
: Avoid unnecessary default initialization ofnode_value
Initializing
node_value
when it may not be used can be avoided. Use a pointer or reference to the optional value to simplify the code.Consider changing the code to:
-clp::ffi::Value node_value{}; -if (node_has_value) { - node_value = kv_pair.second.value(); -} +auto const& node_value = kv_pair.second;Then adjust the subsequent code to handle
node_value
appropriately.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp_s/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/JsonParser.cpp
(2 hunks)components/core/src/clp_s/JsonParser.hpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp_s/JsonParser.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/CommandLineArguments.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/core/src/clp_s/JsonParser.cpp (2)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
🔇 Additional comments (2)
components/core/src/clp_s/CommandLineArguments.cpp (1)
277-284
: LGTM! Input validation enhancement.
The added validation ensures that the --timestamp-key
option is not used with kv-ir
file type, preventing potential issues during processing. The error message is clear and follows the existing error handling pattern.
components/core/src/clp_s/JsonParser.hpp (1)
98-103
: Update function documentation to match parameters
The documentation for get_archive_node_type
does not align with its current parameters. Please update the function comments to accurately reflect the parameters and their descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor changes. Just confirming whether it is required to support UtcOffsetChange/expected behaviour for handling it. Should be good to go after that.
Description
This pull request is the continuation or PR #618, adding the functionality promised. This adds the corresponding ingestion functionality to compress IRv2 into archive.
parse_from_ir()
clp::ffi::ir_stream::IrUnitHandlerInterface
for use deserializing IRv2 files;IrUnitHandler
KeyValuePairLogEvent
converting the log event into the correct format for archive;parse_kv_log_event()
get_archive_node_type
get_archive_node_id
Validation performed
For our posted JSON Datasets:
Generated the corresponding IRv2 file
Compress the IRv2 file to archive
./clp-s c <archive dir> <ir file or directory w/ ir file> --file-type kv-ir
Extract archive back to JSON
./clp-s x <archive dir> <output JSON dir>
Sort keys and compare to JSON generated by clp-s from compressing and extracting directly from original JSON (spark, elasticsearch, and postgresql)
jq -S -c '.' <output JSON dir>/original > out.json
jq -S -c '.' <output JSON dir>/original | sort > out_sorted.json
diff -u out.json clp-s_out.json
./clp-s c <archive dir> <original JSON>
./clp-s x <archive dir> <output JSON dir>
jq -S -c '.' <output JSON dir>/original > clp-s_out.json
Spot checked match of mongodb and cockroach database
Notes:
Summary by CodeRabbit
New Features
JsonParser
to support parsing of Key Value IR Streams.compress
function for parsing Key Value IR files.Bug Fixes
compress
function.