Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-s): Add support for ingesting logs from S3. #639

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Dec 16, 2024

Description

This PR adds support for ingesting logs from S3, and majorly refactors the command line interface to simplify specifying resources that exist at some network location. A follow-up PR will build on the utilties added by this PR to implement search on archives located on s3, as well as general search on single-file archives.

The major changes are as follows

  • Added InputConfig.{hpp,cpp} which introduces new Path and NetworkAuth structs which can be used in combination with new utilties in ReaderUtils to open readers for resources which may exist on the network or local filesystem
  • Modified JsonFileIterator and JsonParser to transparently support ingestion from the network or filesystem via new Path and NetworkAuth structs
  • Modified ArchiveReader to accept Path and NetworkAuth, and just return an error when the requested archive is a SFA or exists on the network.
  • Modified the command line arguments to allow specifying the full path to an archive (the --archive-id option still exists, but is not required). Archive id is also no longer passed around explicitly, but rather determined based on the path to an archive.
  • Made CommandLineArguments fully determine the paths to all relevant files/archives based on the command line options to allow simplifying clp-s.cpp
  • Removed uses of boost::filesystem

Validation performed

  • Validated that logs from s3 can be ingested succesfully
  • Validated that search/extraction correctly error out when trying to interact with S3 or any single file archive
  • Validated that compression, decompression, and search function as normal for multi-file archives

Summary by CodeRabbit

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced support for network authentication and input path handling in command-line arguments.
    • Added new functionalities for managing input configurations and creating readers for file and network sources.
    • Enhanced command-line argument parsing with new options for authentication types and input paths.
  • Bug Fixes

    • Enhanced error handling for archive operations and input path validations.
  • Documentation

    • Updated method signatures and documentation to reflect changes in input path and authentication handling.
  • Refactor

    • Transitioned from Boost Filesystem to C++17's standard filesystem library for improved maintainability.
    • Streamlined the handling of archive paths and JSON parsing logic for clarity and efficiency.

Copy link
Contributor

coderabbitai bot commented Dec 16, 2024

Walkthrough

This pull request introduces significant changes to the CLP (Compressed Log Processing) project, focusing on enhancing input path handling, network authentication, and filesystem operations. The modifications primarily involve transitioning from Boost Filesystem to C++17 standard filesystem, introducing new input configuration management, and updating various components to support more flexible file and archive handling. The changes include adding new classes like InputConfig, modifying reader utilities, and updating command-line argument processing to support network and filesystem sources.

Changes

File Change Summary
components/core/CMakeLists.txt Added Boost::url library linking, updated source files
components/core/src/clp_s/CMakeLists.txt Added multiple new source files, updated library dependencies
components/core/src/clp_s/InputConfig.{cpp,hpp} New files introducing input configuration management, path source determination, and authentication options
components/core/src/clp_s/CommandLineArguments.{cpp,hpp} Updated to handle new input paths and network authentication
components/core/src/clp_s/ArchiveReader.{cpp,hpp} Modified open method to use new path and authentication handling
components/core/src/clp_s/ReaderUtils.{cpp,hpp} Added methods for creating readers with network and file support
components/core/src/clp_s/Utils.{cpp,hpp} Transitioned from Boost to standard filesystem, added new utility methods
components/core/src/clp_s/JsonConstructor.{cpp,hpp} Updated to utilize new path and authentication handling
components/core/src/clp_s/JsonParser.{cpp,hpp} Modified to support new input paths and network authentication
components/core/src/clp_s/JsonFileIterator.{cpp,hpp} Changed to use ReaderInterface for file reading
components/core/src/clp_s/ZstdDecompressor.cpp Transitioned to standard filesystem, updated error handling
components/core/src/clp_s/clp-s.cpp Updated command line argument handling for archives
components/core/src/clp_s/search/kql/CMakeLists.txt Updated library dependencies
components/core/tests/test-clp_s-end_to_end.cpp Adjusted tests to reflect new input path handling

Suggested Reviewers

  • haiqi96
  • kirkrodrigues

Possibly Related PRs


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8fa5aa2 and fe84cd4.

📒 Files selected for processing (1)
  • components/core/src/clp_s/search/kql/CMakeLists.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/search/kql/CMakeLists.txt

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gibber9809 gibber9809 requested a review from wraymo December 16, 2024 22:27
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🔭 Outside diff range comments (1)
components/core/src/clp_s/JsonFileIterator.hpp (1)

Line range hint 17-26: Update constructor documentation to reflect parameter changes.

The Doxygen comments for the constructor still refer to file_name, but the parameter has been changed to reader. Please update the documentation to match the new parameter.

🧹 Nitpick comments (22)
components/core/src/clp_s/ZstdDecompressor.cpp (2)

6-8: Consider evaluating std::filesystem's memory mapping capabilities

While the transition to std::filesystem is a positive step, the code still relies on Boost for memory mapping. Consider evaluating whether std::filesystem's memory mapping capabilities could replace boost::iostreams for a more consistent standard library approach.


206-212: Align condition with coding guidelines

According to the coding guidelines, prefer false == <expression> over !<expression>. Consider updating the error check:

-    if (error_code) {
+    if (false == !error_code) {
components/core/src/clp_s/ReaderUtils.hpp (1)

73-74: Add documentation for the new method.

The method name suggests it might fail, but there's no documentation explaining:

  • The purpose of the method
  • Expected failure cases
  • Return value semantics (when nullptr might be returned)
  • Parameter requirements

Add documentation like this:

+    /**
+     * Attempts to create a reader for the given path with optional network authentication
+     * @param path The path to the resource (local or network)
+     * @param network_auth Network authentication options if required
+     * @return A reader interface if successful, nullptr otherwise
+     * @throws OperationFailed if the resource exists but cannot be read
+     */
     static std::shared_ptr<clp::ReaderInterface>
     try_create_reader(Path const& path, NetworkAuthOption const& network_auth);
components/core/tests/test-clp_s-end_to_end.cpp (1)

103-112: Enhance error handling for archive path operations.

While the implementation correctly uses the new Path struct, consider adding error handling for scenarios where:

  1. The archive path is inaccessible
  2. The archive path permissions are insufficient

Consider this improvement:

 for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) {
     if (false == entry.is_directory()) {
         // Skip non-directories
         continue;
     }

+    try {
         constructor_option.archive_path = clp_s::Path{
                 .source{clp_s::InputSource::Filesystem},
                 .path{entry.path().string()}
         };
         clp_s::JsonConstructor constructor{constructor_option};
         constructor.store();
+    } catch (const std::filesystem::filesystem_error& e) {
+        FAIL("Failed to process archive: " << e.what());
+    }
 }
components/core/src/clp_s/Utils.hpp (1)

48-61: Consider additional URI edge cases.

The method documentation provides good examples, but consider handling these edge cases:

  • Empty URIs
  • URIs with query parameters but no path
  • URIs with fragments
  • URIs with encoded characters

Consider adding validation for these cases and updating the documentation with examples:

 /**
  * Gets the last component of a uri.
  *
  * For example:
  * https://www.something.org/abc-xyz -> abc-xyz
  * https://www.something.org/aaa/bbb/abc-xyz?something=something -> abc-xyz
+ * https://www.something.org -> "" (empty URI path)
+ * https://www.something.org?param=value -> "" (no path component)
+ * https://www.something.org/abc%20xyz -> "abc xyz" (encoded characters)
+ * https://www.something.org/abc#section -> "abc" (with fragment)
  *
  * @param uri
  * @param name Returned component name
  * @return true on success, false otherwise
  */
components/core/src/clp_s/CMakeLists.txt (2)

221-222: Verify version constraints for new dependencies

The addition of Boost::regex, Boost::url, CURL, and OpenSSL::Crypto libraries requires careful version management:

  • Consider adding minimum version requirements for these dependencies
  • Document any specific version requirements in the project's README

Consider adding version checks in CMake:

find_package(Boost 1.81 REQUIRED COMPONENTS regex url)
find_package(CURL 7.88 REQUIRED)
find_package(OpenSSL 3.0 REQUIRED)

Also applies to: 228-228


82-83: Consider separating network configuration

The new InputConfig component appears to handle both local and network paths. Consider:

  • Separating network authentication configuration into a dedicated component
  • Using the factory pattern for reader creation based on path type
components/core/src/clp_s/InputConfig.hpp (2)

16-16: Typographical error in comment: "definining" should be "defining"

There is a typo in the comment on line 16. The word "definining" should be corrected to "defining".


49-49: Incomplete parameter documentation in function comments

The @param annotations lack descriptions for the parameters. Please provide brief explanations for each parameter to improve code readability and maintainability.

Also applies to: 55-55, 63-63, 72-72, 80-80, 89-89, 98-98

components/core/src/clp_s/ArchiveReader.hpp (1)

34-35: Update parameter documentation.

The parameter documentation should be updated to describe the new parameters:

  • archive_path: Path to the archive
  • network_auth: Network authentication options
components/core/src/clp_s/clp-s.cpp (1)

355-361: Enhance error message specificity.

While the error handling is good, consider making the error message more specific by including the failure reason:

-                SPDLOG_ERROR("Failed to open archive - {}", e.what());
+                SPDLOG_ERROR("Failed to open archive at {} - {}", archive_path.path, e.what());
components/core/src/clp_s/JsonParser.cpp (1)

427-429: Add specific error message for reader creation failure.

While the error handling exists, it would be helpful to log the specific reason for reader creation failure:

         auto reader{ReaderUtils::try_create_reader(path, m_network_auth)};
         if (nullptr == reader) {
+            SPDLOG_ERROR("Failed to create reader for path {}", path.path);
             m_archive_writer->close();
             return false;
         }
components/core/src/clp_s/InputConfig.cpp (3)

11-17: Consider logging exceptions in get_source_for_path for better debugging

In the function get_source_for_path, exceptions are caught but not logged. Logging the exception details can assist in diagnosing issues when the function defaults to InputSource::Network due to an exception.

Apply the following diff to add exception logging:

 try {
     return std::filesystem::exists(path) ? InputSource::Filesystem : InputSource::Network;
 } catch (std::exception const& e) {
+    Logger::error("Exception occurred while checking path existence: {}", e.what());
     return InputSource::Network;
 }

76-86: Remove unreachable code after the switch statement in get_archive_id_from_path

The return true; statement at line 85 is unreachable because all paths in the switch statement end with a return statement. Removing this line will clean up the code and prevent any confusion.

Apply the following diff:

     default:
         return false;
     }
-    return true;
 }

27-47: Refactor duplicated logic into a shared utility function

The functions get_input_files_for_path and get_input_archives_for_path share similar logic for handling input paths and populating vectors. Consider refactoring the common code into a shared helper function to reduce duplication and enhance maintainability.

Also applies to: 54-74

components/core/src/clp_s/ReaderUtils.cpp (2)

163-180: Log exceptions when URL signing fails

In try_sign_url, exceptions are caught but not logged. Logging exceptions can provide valuable information during debugging.

Apply this diff to add error logging:

     } catch (std::exception const& e) {
+        SPDLOG_ERROR("Exception during URL signing: {}", e.what());
         return false;
     }
     return true;
 }

206-215: Add logging for unhandled InputSource types

When path.source is neither Filesystem nor Network, the function returns nullptr without any logging. For better debugging and maintenance, consider logging a warning when an unrecognized InputSource is encountered.

Apply this diff:

     } else {
+        SPDLOG_WARN("Unhandled InputSource type: {}", static_cast<int>(path.source));
         return nullptr;
     }
 }
components/core/src/clp_s/JsonFileIterator.hpp (1)

6-7: Remove unnecessary include of FileReader.hpp if it's no longer used.

Since FileReader is no longer utilized in this class, consider removing the include directive for FileReader.hpp to clean up the code.

components/core/src/clp_s/Utils.cpp (2)

68-100: Consider refactoring constant comparisons for maintainability

In directory_is_multi_file_archive, the multiple comparisons of formatted_name against constants can be simplified for better readability and maintainability. Consider using a std::unordered_set or std::set to store the constants and check for membership.

Here is a possible refactor:

// Define the set of archive filenames
static const std::unordered_set<std::string> archive_files = {
    constants::cArchiveTimestampDictFile,
    constants::cArchiveSchemaTreeFile,
    constants::cArchiveSchemaMapFile,
    constants::cArchiveVarDictFile,
    constants::cArchiveLogDictFile,
    constants::cArchiveArrayDictFile,
    constants::cArchiveTableMetadataFile,
    constants::cArchiveTablesFile
};

//...

if (archive_files.count(formatted_name) > 0) {
    continue;
} else {
    // ... Remaining code
}

91-97: Avoid using exceptions for control flow when parsing integers

Using exceptions for control flow can be inefficient. Instead of using std::stoi inside a try-catch block, consider checking if file_name is numeric before attempting to convert it.

Here is a possible refactor:

// Check if file_name is numeric
if (std::all_of(file_name.begin(), file_name.end(), ::isdigit)) {
    // It's a number, continue processing
    continue;
} else {
    return false;
}
components/core/src/clp_s/CommandLineArguments.cpp (2)

426-451: Refactor duplicated code for handling archive_path and archive_id

The logic for processing archive_path and archive_id is duplicated in multiple commands (Extract and Search). Refactoring this code into a shared function can improve maintainability and reduce code duplication.

Also applies to: 694-719


301-306: Refactor authentication handling into a common utility function

The authentication type handling logic is repeated across different command parsing sections (Compress, Extract, and Search). Consider abstracting this into a shared utility function to reduce redundancy and enhance code maintainability.

Also applies to: 447-451, 715-719

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 880a741 and a64bd6b.

📒 Files selected for processing (23)
  • components/core/CMakeLists.txt (2 hunks)
  • components/core/src/clp_s/ArchiveReader.cpp (1 hunks)
  • components/core/src/clp_s/ArchiveReader.hpp (2 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (0 hunks)
  • components/core/src/clp_s/CMakeLists.txt (3 hunks)
  • components/core/src/clp_s/CommandLineArguments.cpp (15 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (3 hunks)
  • components/core/src/clp_s/InputConfig.cpp (1 hunks)
  • components/core/src/clp_s/InputConfig.hpp (1 hunks)
  • components/core/src/clp_s/JsonConstructor.cpp (3 hunks)
  • components/core/src/clp_s/JsonConstructor.hpp (2 hunks)
  • components/core/src/clp_s/JsonFileIterator.cpp (2 hunks)
  • components/core/src/clp_s/JsonFileIterator.hpp (3 hunks)
  • components/core/src/clp_s/JsonParser.cpp (7 hunks)
  • components/core/src/clp_s/JsonParser.hpp (5 hunks)
  • components/core/src/clp_s/ReaderUtils.cpp (2 hunks)
  • components/core/src/clp_s/ReaderUtils.hpp (2 hunks)
  • components/core/src/clp_s/Utils.cpp (2 hunks)
  • components/core/src/clp_s/Utils.hpp (2 hunks)
  • components/core/src/clp_s/ZstdDecompressor.cpp (2 hunks)
  • components/core/src/clp_s/clp-s.cpp (7 hunks)
  • components/core/src/clp_s/search/kql/CMakeLists.txt (1 hunks)
  • components/core/tests/test-clp_s-end_to_end.cpp (3 hunks)
💤 Files with no reviewable changes (1)
  • components/core/src/clp_s/ArchiveWriter.hpp
🧰 Additional context used
📓 Path-based instructions (19)
components/core/tests/test-clp_s-end_to_end.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ZstdDecompressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonConstructor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonConstructor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonFileIterator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/InputConfig.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonFileIterator.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/InputConfig.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (33)
components/core/src/clp_s/ReaderUtils.hpp (2)

4-8: LGTM! Include directives are well-organized.

The new includes are properly ordered (standard library before project-specific headers) and necessary for the new functionality.


Line range hint 1-99: Verify the impact of removing get_archives.

The removal of get_archives method might affect existing code that relied on listing archives from a directory.

Let's verify the usage and potential impact:

✅ Verification successful

The removal of get_archives method appears to be safe

The verification shows that:

  • There are no direct references to ReaderUtils::get_archives in the codebase
  • The functionality for getting archives is properly handled through CommandLineArguments::get_archives_dir() and GlobalSQLiteMetadataDB
  • No test files reference this specific method

The removal aligns with the codebase's architecture where archive handling is managed through dedicated components rather than utility classes.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to the removed get_archives method

# Search for direct method calls
echo "Searching for direct method calls to get_archives..."
rg "get_archives\(" --type cpp

# Search for potential comments or TODOs referencing the method
echo "Searching for references in comments..."
rg "get_archives" --type cpp

# Search for tests that might need updating
echo "Searching for test files that might need updating..."
fd "test.*\.(cpp|hpp)$" | xargs rg "get_archives"

Length of output: 3638


Script:

#!/bin/bash
# Let's check for any references to ReaderUtils::get_archives specifically
echo "Searching for ReaderUtils::get_archives usage..."
rg "ReaderUtils::get_archives" --type cpp

# Let's also check if there are any references in the test files specifically for ReaderUtils
echo "Searching for ReaderUtils tests..."
fd "test.*\.(cpp|hpp)$" | xargs rg "ReaderUtils"

# Let's check the git history for the specific removal
echo "Checking git history for the removal..."
git log -p --no-merges -S "get_archives" -- "components/core/src/clp_s/ReaderUtils.hpp"

Length of output: 365

components/core/tests/test-clp_s-end_to_end.cpp (2)

12-12: LGTM!

The new include is properly placed and necessary for the updated path handling functionality.


74-76: Consider adding test cases for network sources.

While the implementation correctly uses the new Path struct for filesystem sources, we should also verify the S3 ingestion functionality mentioned in the PR objectives.

Let's check if there are any existing S3-related test cases:

Consider adding test cases that:

  1. Verify S3 path handling
  2. Test error scenarios with invalid S3 credentials
  3. Validate behaviour with non-existent S3 buckets
components/core/src/clp_s/Utils.hpp (3)

8-9: LGTM: Required includes added and properly ordered.

The new includes for <string_view> and <vector> are necessary for the updated method signatures.


20-21: LGTM: Method name is now more descriptive.

The rename from find_all_files to find_all_files_in_directory better describes the method's purpose.


34-45: LGTM: Well-documented utility method with clear examples.

The get_last_non_empty_path_component method is thoroughly documented with examples that clarify its behaviour.

components/core/src/clp_s/CommandLineArguments.hpp (3)

15-15: LGTM: Include statement properly added

The addition of InputConfig.hpp is necessary for the new Path and NetworkAuthOption types.


55-57: LGTM: Well-designed getter methods

The getter methods are properly implemented with:

  • Const correctness
  • Efficient return by const reference
  • Clear naming that reflects their purpose

181-182: LGTM: Member variables properly structured

The member variables are well-organized and properly encapsulated. Let's verify the migration from file paths to input paths across the codebase.

✅ Verification successful

Migration from m_file_paths to m_input_paths verified

The codebase shows consistent usage of the new Path type and proper initialization of NetworkAuthOption:

  • The old m_file_paths member has been completely removed with no remaining references
  • m_input_paths is consistently used with the Path type across multiple components:
    • CommandLineArguments.hpp
    • JsonParser.hpp/cpp
    • InputConfig.hpp/cpp
  • NetworkAuthOption is properly initialized with default construction where used
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of Path instead of std::string for paths
# and proper initialization of NetworkAuthOption across the codebase

# Check for any remaining uses of m_file_paths
rg "m_file_paths" --type cpp --type hpp

# Check for consistent use of Path type with m_input_paths
ast-grep --pattern 'std::vector<std::string> $path_var'

# Check for proper initialization of NetworkAuthOption
ast-grep --pattern 'NetworkAuthOption $var;'

Length of output: 172


Script:

#!/bin/bash
# Let's try with correct file extensions and more specific patterns

# Check for any remaining uses of m_file_paths
rg "m_file_paths" -t cpp -t h -t hpp

# Check for path-related vector declarations
rg "std::vector<.*Path.*>" -t cpp -t h -t hpp

# Check NetworkAuthOption usage
rg "NetworkAuthOption" -t cpp -t h -t hpp

# Look for input paths usage
rg "m_input_paths" -t cpp -t h -t hpp

Length of output: 276


Script:

#!/bin/bash
# Search using just the basic extensions and include context

# Check for any remaining uses of m_file_paths
rg "m_file_paths" -t cpp -t h --context 2

# Check for path-related vector declarations
rg "std::vector<.*Path.*>" -t cpp -t h --context 2

# Check NetworkAuthOption usage
rg "NetworkAuthOption" -t cpp -t h --context 2

# Look for input paths usage
rg "m_input_paths" -t cpp -t h --context 2

# Also try finding all header files to understand available extensions
fd -e h -e hpp -e hxx

Length of output: 36707

components/core/src/clp_s/search/kql/CMakeLists.txt (1)

28-28: LGTM! Removal of Boost::filesystem aligns with C++17 transition.

The change correctly removes the Boost::filesystem dependency since the target already uses C++20 which includes std::filesystem.

components/core/CMakeLists.txt (1)

260-261: LGTM! Addition of InputConfig files.

The new InputConfig files align with the PR's objective to support network resource handling.

components/core/src/clp_s/CMakeLists.txt (1)

22-23: Review thread safety of file operations

The addition of FileReader, NetworkReader, and related utilities suggests concurrent operations. Ensure:

  • Thread-safe file and network operations
  • Proper resource cleanup in error cases

Also applies to: 29-30, 37-38, 43-43, 46-47, 49-49

✅ Verification successful

Thread safety and resource management are properly implemented

The NetworkReader implementation demonstrates robust thread safety and resource management through:

  • Proper synchronization using std::mutex, std::condition_variable, and std::atomic variables
  • RAII-based resource management with clear cleanup in the destructor
  • Thread-safe buffer pool management with synchronized acquire/release operations
  • Proper error handling with try-catch blocks and state management
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for thread safety mechanisms and resource cleanup
ast-grep --pattern 'class $_Reader {
  $$$
  std::mutex $_
  $$$
}'

Length of output: 69


Script:

#!/bin/bash
# Search for FileReader implementation and mutex usage
rg -U "class .*Reader" -A 10

# Check for resource cleanup patterns
rg -U "~.*Reader|close\(\)|cleanup\(\)" -A 5

# Look for thread-safety related includes
rg "#include.*mutex|thread|atomic"

# Search for RAII patterns and resource management
rg "std::(unique_ptr|shared_ptr|lock_guard|scoped_lock)"

Length of output: 68699


Script:

#!/bin/bash
# Search for NetworkReader implementation to check thread safety
rg -U "class NetworkReader" -A 30

# Check for mutex and thread synchronization in NetworkReader
rg -U "std::mutex|std::thread|std::condition_variable" components/core/src/clp/NetworkReader

# Look for resource cleanup in NetworkReader destructor and error handling
rg -U "~NetworkReader|NetworkReader::cleanup" -A 10

Length of output: 4189


Script:

#!/bin/bash
# Search for synchronization primitives in NetworkReader implementation
rg -U "std::mutex|std::thread|std::condition_variable|std::atomic" components/core/src/clp/NetworkReader.hpp

# Look for buffer management and thread safety in NetworkReader
rg -U "class Buffer|BufferPool|synchronized|thread_safe" components/core/src/clp/NetworkReader.hpp

# Check for error handling and resource cleanup patterns
rg -U "try|catch|RAII|cleanup|release" components/core/src/clp/NetworkReader.cpp

Length of output: 1193

components/core/src/clp_s/ArchiveReader.cpp (1)

19-30: Proper input validation in the open method

The input validation checks in the open method enhance robustness by ensuring that the archive path is valid, the source is the filesystem, and that the path is a directory. This is a good practice to prevent potential errors during runtime.

components/core/src/clp_s/JsonConstructor.hpp (1)

30-35: Consistent update of JsonConstructorOption with new input structures

The modifications to JsonConstructorOption, replacing archives_dir and archive_id with archive_path and network_auth, are consistent with the updated input handling mechanism. This improves clarity and maintainability.

components/core/src/clp_s/JsonParser.hpp (1)

Line range hint 35-48: Enhancement of input handling and network authentication

The replacement of file_paths with input_paths of type std::vector<Path> and the addition of network_auth enhance the flexibility and scalability of the JsonParser. The changes are consistently applied in both the option struct and class members, aligning with the new input configuration approach.

Also applies to: 115-116

components/core/src/clp_s/ArchiveReader.hpp (2)

11-11: LGTM! Include changes align with network resource handling.

The addition of InputConfig.hpp and removal of boost/filesystem.hpp reflects the transition to the new input configuration management system.


37-37: LGTM! Method signature change supports network resources.

The updated signature properly supports both local and network-based archive access.

components/core/src/clp_s/clp-s.cpp (3)

15-15: LGTM! Proper curl initialization for network operations.

The addition of CurlGlobalInstance ensures proper initialization of libcurl for network operations.

Also applies to: 287-287


310-310: LGTM! Extract command properly handles network archives.

The changes correctly integrate network authentication and path handling for archive extraction.

Also applies to: 318-319


363-368: LGTM! Proper function call formatting.

The search_archive function call maintains good readability with proper parameter alignment.

components/core/src/clp_s/JsonParser.cpp (2)

23-24: LGTM! Proper member initialization.

The new member variables are correctly initialized in the constructor's initialization list.


439-439: LGTM! Consistent path reporting in error messages.

Error messages consistently use path.path for reporting file locations.

Also applies to: 473-473, 498-498, 532-532, 542-542

components/core/src/clp_s/JsonConstructor.cpp (2)

38-38: Function call updated with network authentication—Looks good!

The addition of m_option.network_auth to the m_archive_reader->open call correctly integrates network authentication into the archive opening process.


115-115: Duplicate concern regarding sanitization of get_archive_id()

As previously mentioned, when using m_archive_reader->get_archive_id()—here within the BSON document—it is important to validate and sanitize the archive ID to ensure data integrity and security.

components/core/src/clp_s/ReaderUtils.cpp (2)

3-9: Includes are appropriate and necessary

The added include directives are correct and required for the new functionality.


154-161: Proper exception handling in try_create_file_reader

The function correctly handles exceptions and logs errors when opening files. Good job.

components/core/src/clp_s/JsonFileIterator.cpp (1)

10-17: Constructor implementation is correct.

The constructor now correctly initializes m_reader with the provided clp::ReaderInterface& reader.

components/core/src/clp_s/Utils.cpp (5)

23-23: No issues found

The check for whether path is a directory is correctly implemented using std::filesystem::is_directory.


29-29: No issues found

Correctly checks if the directory is empty using std::filesystem::is_empty.


42-43: Efficient handling of empty directories

The code correctly disables recursion into empty directories, improving traversal efficiency.


126-148: No issues found

The function get_last_non_empty_path_component correctly retrieves the last non-empty component of a path.


150-161: No issues found

The function get_last_uri_component properly parses the URI and retrieves the last component using boost::urls.

components/core/src/clp_s/Utils.hpp Show resolved Hide resolved
components/core/CMakeLists.txt Show resolved Hide resolved
components/core/src/clp_s/CMakeLists.txt Show resolved Hide resolved
components/core/src/clp_s/JsonConstructor.cpp Show resolved Hide resolved
components/core/src/clp_s/ReaderUtils.cpp Show resolved Hide resolved
components/core/src/clp_s/ReaderUtils.cpp Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant