Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core-clp): Defer each archive's global metadata updates until it has been completed (resolves #685). #705

Merged
merged 23 commits into from
Feb 14, 2025
Merged
29 changes: 16 additions & 13 deletions components/core/src/clp/streaming_archive/writer/Archive.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -157,10 +157,6 @@ void Archive::open(UserConfig const& user_config) {

m_global_metadata_db = user_config.global_metadata_db;

m_global_metadata_db->open();
m_global_metadata_db->add_archive(m_id_as_string, *m_local_metadata);
m_global_metadata_db->close();

m_file = nullptr;

// Open log-type dictionary
Expand Down Expand Up @@ -238,6 +234,8 @@ void Archive::close() {

m_metadata_file_writer.close();

update_global_metadata();

m_global_metadata_db = nullptr;

m_metadata_db.close();
Expand Down Expand Up @@ -557,8 +555,6 @@ void Archive::persist_file_metadata(vector<File*> const& files) {

m_metadata_db.update_files(files);

m_global_metadata_db->update_metadata_for_files(m_id_as_string, files);

// Mark files' metadata as clean
for (auto file : files) {
file->mark_metadata_as_clean();
Expand Down Expand Up @@ -593,16 +589,12 @@ void Archive::close_segment_and_persist_file_metadata(

for (auto file : files) {
file->mark_as_in_committed_segment();
m_files_written.emplace_back(file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wonder, can we use something like m_files_written.insert(files.begin(), files.end())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I figured the loop was already there. Not sure if better to use existing loop, or do what you proposed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say for sure which one is better, so I am fine to keep the emplace.
I do notice that emplace construct the object in place and insert copies the object. but in our case, I guess emplace will also somehow have to copy via the copy constructor.
Anyway, I don't think the performance really matter here.

@LinZhihao-723 do you have any insights?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can we directly just do m_files_written = files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can we directly just do m_files_written = files?

No this will be executed multiple times per archive. That would overwrite the previous call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ture, you are right. my bad

}

m_global_metadata_db->open();
persist_file_metadata(files);
update_metadata();
m_global_metadata_db->close();

for (auto file : files) {
delete file;
}
files.clear();
}

Expand Down Expand Up @@ -635,8 +627,6 @@ void Archive::update_metadata() {
m_metadata_file_writer.seek_from_begin(0);
m_local_metadata->write_to_file(m_metadata_file_writer);

m_global_metadata_db->update_archive_metadata(m_id_as_string, *m_local_metadata);

if (m_print_archive_stats_progress) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block should probably move into Archive::close since it may mislead the external caller if we indicate that an archive's stats have been updated, but that archive doesn't exist in the global metadata database (whereas it did before this change).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the m_print_archive_stats_progress block. If so, I don't agree. I don't see the point in printing the progress if we only show it when the archive is finished being compressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The progress output is being used by the package to keep track of when new archives have been generated. The package expects that the archive exists in the global metadata database when it receives the progress update. Progress is probably a poor name for this option at this point since its use is not really for humans per se.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol this is sketch, but I guess we don't have a good alternative at the moment. I looked through this code and it looks like it only uses the last progress update? As such, there is no difference between moving it to close and leaving as it is now? If we do what you are suggesting, we are essentially breaking the progress update feature in the cli.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it does use only the last update. For the sake of safety, can you add a comment indicating that the stats are printed even though the archive doesn't exist in the global metadata database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. The ordering now is slightly different. Previously we added to global DB before printing, now we print then add to globalDB. I assume this will be okay.

Copy link
Contributor Author

@davemarco davemarco Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be more safe. I could potential add a duplicate stat printing when the archive is closed (maybe with a new field like complete) lmk

Copy link
Contributor Author

@davemarco davemarco Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but the old method was potentially unsafe in another way. The archive wasn't actually closed before it was uploaded to s3 (like write_dir_snapshot()) may have not been called, if there was only one archive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be more safe. I could potential add a duplicate stat printing when the archive is closed (maybe with a new field like complete) lmk

Good idea. Sgtm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought... I tested this and it looked ugly
Screenshot 2025-02-14 at 4 21 58 PM
I went back to your original suggestion. I also noticed there is another progress feature, so i think its fine to cripple this one a bit

nlohmann::json json_msg;
json_msg["id"] = m_id_as_string;
Expand All @@ -647,6 +637,19 @@ void Archive::update_metadata() {
}
}

auto Archive::update_global_metadata() -> void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming update_metadata to update_local_metadata so that there's congruency?

m_global_metadata_db->open();
if (false == m_local_metadata.has_value()) {
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
}
m_global_metadata_db->add_archive(m_id_as_string, m_local_metadata.value());
m_global_metadata_db->update_metadata_for_files(m_id_as_string, m_files_written);
for (auto* file : m_files_written) {
delete file;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the function name, one may not expect the function to call delete on m_files_written.

I feel it would be more clear to put these 3 lines out side of the function, or maybe update the funciton name to include the "delete" part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the delete out of the function

m_global_metadata_db->close();
}

// Explicitly declare template specializations so that we can define the template methods in this
// file
template void Archive::write_log_event_ir<eight_byte_encoded_variable_t>(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,11 @@ class Archive {
*/
void update_metadata();

/**
* Updates metadata in the global metadata database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto comment for deleting m_files_written.

You might also need to mention about exceptions if m_global_metadata_db's api throws any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the delete out of the function. None of the functions I am calling have any documented exceptions in the .hpp file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Updates metadata in the global metadata database.
* Updates the archive's metadata in the global metadata database.

And we should probably update update_metadata's docstring to something similar but replacing "global" with "local".

*/
auto update_global_metadata() -> void;

// Variables
boost::uuids::uuid m_id;
std::string m_id_as_string;
Expand Down Expand Up @@ -316,6 +321,10 @@ class Archive {
std::vector<File*> m_files_with_timestamps_in_segment;
std::vector<File*> m_files_without_timestamps_in_segment;

// Data for all files in this collection has been deallocated, and should only
// contain metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without reading the rest of the code outside this PR, the comment is not very easy to follow.
Can you point me to where the Data for all files is deallocated?

Copy link
Contributor Author

@davemarco davemarco Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function File::append_to_segment has the following code in it. I believe the members m_timestamp/logtype/variables are memory intensive.

    m_is_written_out = true;
    m_timestamps.reset(nullptr);
    m_logtypes.reset(nullptr);
    m_variables.reset(nullptr);

This function File::append_to_segment, should be called before the files are added to m_written_files. In terms of trace, Archive::append_file_contents_to_segment calls File::append_to_segment before calling Archive::close_segment_and_persist_file_metadata . Archive::close_segment_and_persist_file_metadata is function that adds files to m_written_files. So all the files added should have their data cleared. Note to test this, i added an assertion that m_is_written_out is true before adding to list in testing. If you want i can it back to source, but it should never actually trigger.

If you want i can modify the comment to say

Files in this collection only hold metadata. Files are added to this collection after 
`file->append_to_segment()` is called, which deallocates memory for timestamp,
logtype, and variable fields. 

We could also just remove the comment...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment makes sense to me. maybe we can rename the variables?

something like m_file_metadata_for_global_update.

std::vector<File*> m_files_written;

size_t m_target_segment_uncompressed_size;
Segment m_segment_for_files_with_timestamps;
ArrayBackedPosIntSet<logtype_dictionary_id_t>
Expand Down
Loading