-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core-clp): Defer each archive's global metadata updates until it has been completed (resolves #685). #705
Conversation
WalkthroughThis pull request centralizes the handling of global metadata operations in the Changes
Possibly related PRs
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/core/src/clp/streaming_archive/writer/Archive.hpp (2)
282-286
: Expand documentation to note file pointer deallocation.The docstring only mentions global metadata updates. Since the method also deletes the file pointers, consider clarifying this side effect in the comment to prevent surprises for future maintainers.
/** * Updates metadata in the global metadata database. + * This method also deletes all file pointers stored in m_files_written. */
324-327
: Prefer using smart pointers or storing only file metadata.Maintaining raw pointers in a vector can be error-prone. Consider using
std::unique_ptr<File>
for safer ownership semantics or storing just the necessary file metadata to avoid potential pointer misuse.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(3 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
237-238
: Verify exception handling and final state if metadata update fails.If
update_global_metadata()
throws an exception here, the archive is partially closed. Ensure that this case is handled gracefully, so that the system does not remain in an invalid or half-closed state.
592-593
: Confirm safe usage of raw pointers in m_files_written.Adding raw pointers to
m_files_written
implies that other parts of the code must not reference theseFile*
objects after they are deleted. Please confirm that no other references remain to avoid use-after-free behaviour.
640-651
: Consider robust error handling in update_global_metadata.At present, if any database operation fails, the subsequent file deletion and database closure are skipped. A structured error handling flow (e.g., try-finally) would protect against partial updates and resource leaks.
@@ -593,16 +589,12 @@ void Archive::close_segment_and_persist_file_metadata( | |||
|
|||
for (auto file : files) { | |||
file->mark_as_in_committed_segment(); | |||
m_files_written.emplace_back(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wonder, can we use something like m_files_written.insert(files.begin(), files.end())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I figured the loop was already there. Not sure if better to use existing loop, or do what you proposed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't say for sure which one is better, so I am fine to keep the emplace.
I do notice that emplace construct the object in place and insert copies the object. but in our case, I guess emplace will also somehow have to copy via the copy constructor.
Anyway, I don't think the performance really matter here.
@LinZhihao-723 do you have any insights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, can we directly just do m_files_written = files
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, can we directly just do m_files_written = files?
No this will be executed multiple times per archive. That would overwrite the previous call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ture, you are right. my bad
for (auto* file : m_files_written) { | ||
delete file; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at the function name, one may not expect the function to call delete on m_files_written.
I feel it would be more clear to put these 3 lines out side of the function, or maybe update the funciton name to include the "delete" part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the delete out of the function
@@ -279,6 +279,11 @@ class Archive { | |||
*/ | |||
void update_metadata(); | |||
|
|||
/** | |||
* Updates metadata in the global metadata database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto comment for deleting m_files_written.
You might also need to mention about exceptions if m_global_metadata_db's api throws any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the delete out of the function. None of the functions I am calling have any documented exceptions in the .hpp file.
// Data for all files in this collection has been deallocated, and should only | ||
// contain metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without reading the rest of the code outside this PR, the comment is not very easy to follow.
Can you point me to where the Data for all files is deallocated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function File::append_to_segment
has the following code in it. I believe the members m_timestamp/logtype/variables
are memory intensive.
m_is_written_out = true;
m_timestamps.reset(nullptr);
m_logtypes.reset(nullptr);
m_variables.reset(nullptr);
This function File::append_to_segment
, should be called before the files are added to m_written_files
. In terms of trace, Archive::append_file_contents_to_segment
calls File::append_to_segment
before calling Archive::close_segment_and_persist_file_metadata
. Archive::close_segment_and_persist_file_metadata
is function that adds files to m_written_files
. So all the files added should have their data cleared. Note to test this, i added an assertion that m_is_written_out
is true before adding to list in testing. If you want i can it back to source, but it should never actually trigger.
If you want i can modify the comment to say
Files in this collection only hold metadata. Files are added to this collection after
`file->append_to_segment()` is called, which deallocates memory for timestamp,
logtype, and variable fields.
We could also just remove the comment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment makes sense to me. maybe we can rename the variables?
something like m_file_metadata_for_global_update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered review questions
@@ -593,16 +589,12 @@ void Archive::close_segment_and_persist_file_metadata( | |||
|
|||
for (auto file : files) { | |||
file->mark_as_in_committed_segment(); | |||
m_files_written.emplace_back(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I figured the loop was already there. Not sure if better to use existing loop, or do what you proposed
for (auto* file : m_files_written) { | ||
delete file; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the delete out of the function
@@ -279,6 +279,11 @@ class Archive { | |||
*/ | |||
void update_metadata(); | |||
|
|||
/** | |||
* Updates metadata in the global metadata database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the delete out of the function. None of the functions I am calling have any documented exceptions in the .hpp file.
// Data for all files in this collection has been deallocated, and should only | ||
// contain metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function File::append_to_segment
has the following code in it. I believe the members m_timestamp/logtype/variables
are memory intensive.
m_is_written_out = true;
m_timestamps.reset(nullptr);
m_logtypes.reset(nullptr);
m_variables.reset(nullptr);
This function File::append_to_segment
, should be called before the files are added to m_written_files
. In terms of trace, Archive::append_file_contents_to_segment
calls File::append_to_segment
before calling Archive::close_segment_and_persist_file_metadata
. Archive::close_segment_and_persist_file_metadata
is function that adds files to m_written_files
. So all the files added should have their data cleared. Note to test this, i added an assertion that m_is_written_out
is true before adding to list in testing. If you want i can it back to source, but it should never actually trigger.
If you want i can modify the comment to say
Files in this collection only hold metadata. Files are added to this collection after
`file->append_to_segment()` is called, which deallocates memory for timestamp,
logtype, and variable fields.
We could also just remove the comment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp (1)
644-652
: Consider using early return pattern for better readability.The validation check could be simplified using an early return pattern, making the code more maintainable.
auto Archive::update_global_metadata() -> void { m_global_metadata_db->open(); - if (false == m_local_metadata.has_value()) { + if (!m_local_metadata.has_value()) { throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); } m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value()); m_global_metadata_db->update_metadata_for_files(m_id_as_string, m_files_written); m_global_metadata_db->close(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build (macos-latest)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
237-238
: LGTM! Good placement of global metadata update.The update is correctly placed after all operations are complete and before cleaning up resources.
240-243
: LGTM! Proper cleanup of resources.Files are correctly deleted after metadata is updated, ensuring no data loss.
596-596
: Consider using insert() for better performance.For better performance when adding multiple files, consider using insert() instead of multiple emplace_back() calls.
- m_files_written.emplace_back(file); + m_files_written.insert(m_files_written.end(), files.begin(), files.end());
Looking at the change, I feel it's acceptable as a quick fix. It could be confusing for the File Object to be in a state where it holds no valid data but valid metadata, but with properly commented code I don't think it will be a big deal. Did you discuss this design with Kirk? Just brainstroming the possible ways to properly do it. One possible way is to factor out all File level metadata as a separate class / struct, and each File maintain a pointer to the metadata instance. After appending files to segment, it can then transfer the metadata instance's ownership to archive.cpp, but this could require a bunch of work... Another hackier way is to set a flag in File.cpp to indicate if data in the file is valid. I don't think this is clean either but at least it gives reader a sense that File object could enter a status where only the metadata is valid. |
I did not discuss design with kirk
This could maybe work, but there would be work to refactor a bunch of interfaces which access
We already have the flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(3 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/writer/Archive.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
🪛 GitHub Actions: clp-lint
components/core/src/clp/streaming_archive/writer/Archive.cpp
[error] 650-650: code should be clang-formatted
[error] 650-650: code should be clang-formatted
[error] 650-650: code should be clang-formatted
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build (macos-latest)
🔇 Additional comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
237-244
: LGTM! Proper sequencing and cleanup.The changes ensure that global metadata is updated before nulling the database pointer, followed by proper cleanup of file metadata.
594-597
: LGTM! Efficient metadata handling.The file is properly marked as in committed segment before being added to the list for global metadata update. Using
emplace_back
is efficient for insertion.
auto Archive::update_global_metadata() -> void { | ||
m_global_metadata_db->open(); | ||
if (false == m_local_metadata.has_value()) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} | ||
m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value()); | ||
m_global_metadata_db->update_metadata_for_files(m_id_as_string, m_file_metadata_for_global_update); | ||
m_global_metadata_db->close(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix formatting issues and improve error handling.
The pipeline reports formatting issues. Additionally, consider these improvements:
- Use more idiomatic condition check
- Add descriptive error message
Apply this diff to address the issues:
auto Archive::update_global_metadata() -> void {
m_global_metadata_db->open();
- if (false == m_local_metadata.has_value()) {
- throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+ if (!m_local_metadata) {
+ throw OperationFailed(
+ ErrorCode_Failure,
+ __FILENAME__,
+ __LINE__,
+ "Local metadata not initialized before updating global metadata"
+ );
}
m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value());
m_global_metadata_db->update_metadata_for_files(m_id_as_string,
m_file_metadata_for_global_update);
m_global_metadata_db->close();
}
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 GitHub Actions: clp-lint
[error] 650-650: code should be clang-formatted
[error] 650-650: code should be clang-formatted
[error] 650-650: code should be clang-formatted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build (macos-latest)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
237-238
: LGTM!The order of operations is correct: updating global metadata before nulling the database pointer.
240-243
: LGTM!Proper cleanup of file metadata after the global update is complete.
596-596
: LGTM!Correctly storing file metadata for later global update after marking the file as in a committed segment.
auto Archive::update_global_metadata() -> void { | ||
m_global_metadata_db->open(); | ||
if (false == m_local_metadata.has_value()) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} | ||
m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value()); | ||
m_global_metadata_db->update_metadata_for_files( | ||
m_id_as_string, | ||
m_file_metadata_for_global_update | ||
); | ||
m_global_metadata_db->close(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling and condition check.
The implementation is correct, but there are a few improvements that can be made:
- Use idiomatic condition check
- Add descriptive error message
Apply this diff to improve the code:
auto Archive::update_global_metadata() -> void {
m_global_metadata_db->open();
- if (false == m_local_metadata.has_value()) {
- throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+ if (!m_local_metadata) {
+ throw OperationFailed(
+ ErrorCode_Failure,
+ __FILENAME__,
+ __LINE__,
+ "Local metadata not initialized before updating global metadata"
+ );
}
m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value());
m_global_metadata_db->update_metadata_for_files(
m_id_as_string,
m_file_metadata_for_global_update
);
m_global_metadata_db->close();
}
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
596-600
: Enhance comment clarity about memory deallocation.The comment should be more explicit about memory deallocation to prevent confusion.
- // Files in this collection only hold metadata. Emplaced files have called method - // `File::append_to_segment()` which deallocates memory for timestamp, - // logtype, and variable fields. + // Files in this collection only hold metadata as `File::append_to_segment()` has been called, + // which deallocates memory for timestamp, logtype, and variable fields to reduce memory usage. + // The metadata is preserved for updating the global metadata database.
647-658
: Improve error handling and condition check.The implementation is correct, but could benefit from these improvements:
- Use idiomatic condition check
- Add descriptive error message
auto Archive::update_global_metadata() -> void { m_global_metadata_db->open(); - if (false == m_local_metadata.has_value()) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + if (!m_local_metadata) { + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + "Local metadata not initialized before updating global metadata" + ); } m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value()); m_global_metadata_db->update_metadata_for_files( m_id_as_string, m_file_metadata_for_global_update ); m_global_metadata_db->close(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(3 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/writer/Archive.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build (macos-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
237-238
: LGTM!The placement of the global metadata update ensures all file operations are complete before updating the global metadata.
240-244
: LGTM!The cleanup ensures proper resource management by deleting file pointers after they are no longer needed.
@@ -279,6 +279,11 @@ class Archive { | |||
*/ | |||
void update_metadata(); | |||
|
|||
/** | |||
* Updates metadata in the global metadata database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Updates metadata in the global metadata database. | |
* Updates the archive's metadata in the global metadata database. |
And we should probably update update_metadata
's docstring to something similar but replacing "global" with "local".
// Files in this collection only hold metadata. Emplaced files have called method | ||
// `File::append_to_segment()` which deallocates memory for timestamp, | ||
// logtype, and variable fields. | ||
m_file_metadata_for_global_update.emplace_back(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment confused me a bit since this method doesn't seem to have anything to do with managing the lifecycle of the files (and the grammar is a bit odd). I think it would make more sense to add a comment above m_file_metadata_for_global_update
in Archive.hpp
indicating that the collection contains Files that have been written to the local archive and deallocated but whose metadata has not been persisted globally (rephrase it as appropriate).
@@ -647,6 +644,19 @@ void Archive::update_metadata() { | |||
} | |||
} | |||
|
|||
auto Archive::update_global_metadata() -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming update_metadata
to update_local_metadata
so that there's congruency?
@@ -635,8 +634,6 @@ void Archive::update_metadata() { | |||
m_metadata_file_writer.seek_from_begin(0); | |||
m_local_metadata->write_to_file(m_metadata_file_writer); | |||
|
|||
m_global_metadata_db->update_archive_metadata(m_id_as_string, *m_local_metadata); | |||
|
|||
if (m_print_archive_stats_progress) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block should probably move into Archive::close
since it may mislead the external caller if we indicate that an archive's stats have been updated, but that archive doesn't exist in the global metadata database (whereas it did before this change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the m_print_archive_stats_progress
block. If so, I don't agree. I don't see the point in printing the progress if we only show it when the archive is finished being compressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The progress output is being used by the package to keep track of when new archives have been generated. The package expects that the archive exists in the global metadata database when it receives the progress update. Progress is probably a poor name for this option at this point since its use is not really for humans per se.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol this is sketch, but I guess we don't have a good alternative at the moment. I looked through this code and it looks like it only uses the last progress update? As such, there is no difference between moving it to close and leaving as it is now? If we do what you are suggesting, we are essentially breaking the progress update feature in the cli.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it does use only the last update. For the sake of safety, can you add a comment indicating that the stats are printed even though the archive doesn't exist in the global metadata database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. The ordering now is slightly different. Previously we added to global DB before printing, now we print then add to globalDB. I assume this will be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be more safe. I could potential add a duplicate stat printing when the archive is closed (maybe with a new field like complete) lmk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but the old method was potentially unsafe in another way. The archive wasn't actually closed before it was uploaded to s3 (like write_dir_snapshot()) may have not been called, if there was only one archive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be more safe. I could potential add a duplicate stat printing when the archive is closed (maybe with a new field like complete) lmk
Good idea. Sgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
596-599
: Improve comment clarity and conciseness.The comment provides important context but could be more concise.
- // Files in this collection only hold metadata. Emplaced files have called method - // `File::append_to_segment()` which deallocates memory for timestamp, - // logtype, and variable fields. + // Note: Files here only contain metadata as their data fields were deallocated by + // File::append_to_segment()
647-658
: Improve condition check and error message.The implementation is correct, but could be improved:
- Use idiomatic condition check
- Add descriptive error message
auto Archive::update_global_metadata() -> void { m_global_metadata_db->open(); - if (false == m_local_metadata.has_value()) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + if (!m_local_metadata) { + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + "Local metadata not initialized before updating global metadata" + ); } m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value()); m_global_metadata_db->update_metadata_for_files( m_id_as_string, m_file_metadata_for_global_update ); m_global_metadata_db->close(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(4 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/writer/Archive.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build (macos-latest)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
237-244
: LGTM! The changes properly handle global metadata updates and cleanup.The code correctly:
- Updates global metadata before cleanup
- Properly deallocates file metadata to prevent memory leaks
603-603
: LGTM! Method name change improves clarity.The rename from
update_metadata
toupdate_local_metadata
better reflects the method's scope.
630-645
: LGTM! Method rename improves code clarity.The method name now accurately reflects its responsibility of updating only local metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
596-597
: Add clarifying comment about file metadata state.Add a comment to clarify that files in this collection only hold metadata, as the file data is deallocated after being written to the segment.
+ // Files in this collection only hold metadata as their data has been deallocated + // after being written to the segment m_file_metadata_for_global_update.emplace_back(file);
627-643
: Add comment about stats printing timing.The implementation is correct, but add a comment to clarify why stats are printed before the global metadata update.
- // Note that stats are printed before archive is added to global metadata database. + // Stats are printed before the global metadata update to maintain compatibility with + // the package's progress tracking (see job_orchestration/executor/compress/compression_task.py) if (m_print_archive_stats_progress) {
645-656
: Improve error handling and database connection management.Consider these improvements:
- Add descriptive error message
- Use RAII pattern for database connection
auto Archive::update_global_metadata() -> void { - m_global_metadata_db->open(); - if (false == m_local_metadata.has_value()) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + class ScopedDBConnection { + public: + explicit ScopedDBConnection(GlobalMetadataDB* db) : m_db(db) { m_db->open(); } + ~ScopedDBConnection() { m_db->close(); } + private: + GlobalMetadataDB* m_db; + }; + + if (false == m_local_metadata.has_value()) { + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + "Local metadata not initialized before updating global metadata" + ); } + + ScopedDBConnection db_connection(m_global_metadata_db); m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value()); m_global_metadata_db->update_metadata_for_files( m_id_as_string, m_file_metadata_for_global_update ); - m_global_metadata_db->close(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.cpp
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
237-238
: LGTM! Proper sequencing of global metadata updates.The placement ensures all local metadata is written before updating global metadata, which aligns with the PR objective.
240-244
: LGTM! Proper cleanup of file metadata.The cleanup prevents memory leaks and is correctly placed after the global metadata update.
600-601
: LGTM! Clear method naming.The rename to
update_local_metadata
clearly indicates the scope of the metadata update.
@kirkrodrigues requesting review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/writer/Archive.hpp (1)
330-333
: Consider renaming the member variable for better clarity.Based on previous review discussions, consider renaming
m_file_metadata_for_global_update
to better reflect its purpose.- // Collection of `File`s which have been written to archive and deallocated, but whose - // metadata has not yet been persisted globally. - std::vector<File*> m_file_metadata_for_global_update; + // Collection of `File`s which have been written to archive and deallocated, but whose + // metadata has not yet been persisted globally. Files in this collection only hold + // metadata as their data has been deallocated by `file->append_to_segment()`. + std::vector<File*> m_deallocated_files_pending_global_metadata_update;components/core/src/clp/streaming_archive/writer/Archive.cpp (2)
631-637
: Consider adding a timestamp to the progress output.The JSON output includes essential archive statistics. Consider adding a timestamp field to help track when each progress update was generated.
auto Archive::print_archive_stats_progress() -> void { nlohmann::json json_msg; json_msg["id"] = m_id_as_string; + json_msg["timestamp"] = std::time(nullptr); json_msg["uncompressed_size"] = m_local_metadata->get_uncompressed_size_bytes(); json_msg["size"] = m_local_metadata->get_compressed_size_bytes(); std::cout << json_msg.dump(-1, ' ', true, nlohmann::json::error_handler_t::ignore) << std::endl; }
647-658
: Improve error handling with a descriptive message.The error handling follows coding guidelines but could be more informative.
auto Archive::update_global_metadata() -> void { m_global_metadata_db->open(); if (false == m_local_metadata.has_value()) { - throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); + throw OperationFailed( + ErrorCode_Failure, + __FILENAME__, + __LINE__, + "Local metadata not initialized before updating global metadata" + ); } m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value()); m_global_metadata_db->update_metadata_for_files( m_id_as_string, m_file_metadata_for_global_update ); m_global_metadata_db->close(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/writer/Archive.cpp
(3 hunks)components/core/src/clp/streaming_archive/writer/Archive.hpp
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.hpp
components/core/src/clp/streaming_archive/writer/Archive.cpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/writer/Archive.hpp
components/core/src/clp/streaming_archive/writer/Archive.cpp
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (7)
components/core/src/clp/streaming_archive/writer/Archive.hpp (4)
248-249
: LGTM! Docstring clarifies database scope.The updated docstring explicitly states that metadata is written to the local database, improving clarity.
256-257
: LGTM! Docstring accurately reflects function behaviour.The updated docstring clarifies that metadata is persisted to the local database and mentions cleanup of remaining data.
278-281
: LGTM! Method extraction improves modularity.The extraction of progress statistics printing into a dedicated method follows the Single Responsibility Principle.
284-286
: LGTM! Clear separation of local and global metadata updates.The split into
update_local_metadata
andupdate_global_metadata
methods clearly separates concerns and aligns with the PR's objective to postpone global metadata updates.Also applies to: 288-291
components/core/src/clp/streaming_archive/writer/Archive.cpp (3)
237-248
: LGTM! The changes improve the cleanup and metadata update sequence.The changes correctly implement the postponed global metadata update, add proper cleanup of file metadata, and maintain progress reporting functionality.
600-604
: LGTM! The changes correctly implement metadata update deferral.The changes appropriately store file metadata for later global updates and clarify the local nature of the metadata update.
639-645
: LGTM! The rename improves code clarity.The method name now clearly indicates its focus on local metadata updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
feat(core-clp): Defer each archive's global metadata updates until it has been completed (resolves #685).
Description
See #685 for high level PR goals.
re: implementation
Removed all writes during serialization to Global metadata DB.
Instead, file metadata is maintained in a new list (previously was deleted when a segment was closed), and all file metadata is written to Global metadata DB when archive finished.
I explored other options, but in the end, this seemed to be the cleanest.
Validation performed
Summary by CodeRabbit
New Features
Refactor