Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-package): Add support for deleting archives that are exclusively within a time range. #594

Merged
merged 20 commits into from
Nov 27, 2024

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Nov 15, 2024

Description

Our current package implementation doesn't support deleting compressed logs. As a result, after packages are used for a period of time, the size of the logs could be huge and cause storage issues, for example, #585. It would be desirable to support something similar to retension period where logs can be deleted based on their timestamp.

There are two possible implementations to prune compressed logs based on their timestamp.
Given a timestamp range, begin_ts and end_ts.

  1. Open each archive, only delete the messages where begin_ts < msg.ts < end_ts. This requires a non-trivial implementation as it requires to modify the archive data.
  2. Simply remove any archives whose timestamps strictly fall within a specified range (i.e. begin_ts < archive.begin_ts < archive_end_ts < end_ts). Compared to method 1, this is only an approximation because for case like archive.begin_ts < begin_ts < archive_end_ts < end_ts. this method will skip the entire archive.

In this PR, we use the implementation for simplicity. We introduce a new package script, which is similar to the existing decompress.py. The script doesn't create any new job for scheduler, but directly operate on archives and databases. The native script takes user parameters begin_ts and end_ts and performs the following:

  1. look up the database and find all archives with archive.begin_ts < begin_ts < archive_end_ts < end_ts. returns the IDs.
  2. Find all files in the database with the IDs returned in step1. remove the archives and those files from the database.
  3. Delete archives from the local storage.

Validation performed

Manually compressed different archives with different begin & end ts. Ran the script with different ts parameters and verified that archives are deleted correctly.

Summary by CodeRabbit

  • New Features

    • Introduced a command-line utility for deleting outdated archives.
    • Added a new shell script for executing the archive deletion functionality.
  • Changes

    • Expanded job types with the addition of DEL_ARCHIVE.
    • Updated method signature for generating container names to accept strings instead of enumeration values.
    • Modified various scripts to ensure container names are generated as strings.
    • Improved error handling and logging clarity across multiple scripts.
    • Enhanced command-line interface usability with new argument options.

These enhancements improve the flexibility and usability of the archive management features.

Copy link
Contributor

coderabbitai bot commented Nov 15, 2024

Walkthrough

The changes in this pull request introduce a new enumeration value DEL_ARCHIVE to the JobType class and modify the method signature of generate_container_name to accept a string instead of a JobType parameter. Additionally, a new command-line utility for deleting outdated archives is added in del_archives.py, alongside a corresponding shell script. Other scripts, such as compress.py, decompress.py, and search.py, are updated to reflect the new string argument for generate_container_name. The overall structure and functionality of the existing code remain intact.

Changes

File Change Summary
components/clp-package-utils/clp_package_utils/general.py Added enumeration value DEL_ARCHIVE to JobType; changed generate_container_name to accept a str instead of JobType.
components/clp-package-utils/clp_package_utils/scripts/del_archives.py Introduced a command-line utility for deleting outdated archives, including logging and error handling. Added def main(argv).
components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py Implemented archive deletion logic with logging and error handling. Added def main(argv) and _delete_archives(...).
components/clp-package-utils/clp_package_utils/scripts/compress.py Changed generate_container_name argument from JobType.COMPRESSION to str(JobType.COMPRESSION).
components/clp-package-utils/clp_package_utils/scripts/decompress.py Updated generate_container_name argument from JobType.FILE_EXTRACTION to str(JobType.FILE_EXTRACTION).
components/clp-package-utils/clp_package_utils/scripts/search.py Altered generate_container_name argument from JobType.SEARCH to str(JobType.SEARCH).
components/package-template/src/sbin/admin-tools/del-archives.sh Added new shell script del-archives.sh to execute the del_archives Python module.
components/clp-package-utils/clp_package_utils/__init__.py Introduced logging setup for the module with a console handler and formatter.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py Removed logger configuration; improved error handling for job processing.
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py Removed logging setup; core functionality remains unchanged.
components/clp-package-utils/clp_package_utils/scripts/native/search.py Removed logging setup; streamlined logic in job creation and monitoring functions.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py Simplified logging setup; restructured main function for better command-line argument handling.
components/clp-package-utils/clp_package_utils/scripts/stop_clp.py Simplified logging setup; main function remains unchanged.

Possibly related PRs

Suggested reviewers

  • kirkrodrigues

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1548ae3 and 73b656b.

📒 Files selected for processing (10)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (1 hunks)
💤 Files with no reviewable changes (3)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py
  • components/clp-package-utils/clp_package_utils/scripts/search.py
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py
  • components/clp-package-utils/clp_package_utils/scripts/stop_clp.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@haiqi96
Copy link
Contributor Author

haiqi96 commented Nov 15, 2024

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Nov 15, 2024

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (7)
components/package-template/src/sbin/del-archives.sh (1)

1-9: Add script documentation and usage information

The script would benefit from a header comment explaining its purpose and usage.

Add this documentation at the start of the script:

 #!/usr/bin/env bash
 
+# del-archives.sh - Delete archives based on timestamp
+#
+# This script executes the del_archives Python module which handles the deletion
+# of outdated archives based on specified time ranges and configurations.
+#
+# Usage: ./del-archives.sh [options]
+# For detailed options, run: ./del-archives.sh --help
+
 script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
components/clp-package-utils/clp_package_utils/scripts/del_archives.py (2)

67-73: Add debug logging for timestamp values.

Consider adding debug logging to show the actual timestamp values being used. This will help with troubleshooting issues.

Add logging before validation:

 begin_ts = parsed_args.begin_ts
 end_ts = parsed_args.end_ts
+logger.debug(f"Using time range: begin_ts={begin_ts}, end_ts={end_ts}")
 if end_ts < 0 or begin_ts < 0:

100-101: Add error handling for cleanup operation.

The cleanup step could fail silently. Consider adding error handling to ensure cleanup issues are logged.

Add error handling for cleanup:

-# Remove generated files
-generated_config_path_on_host.unlink()
+# Remove generated files
+try:
+    generated_config_path_on_host.unlink()
+except Exception as e:
+    logger.warning(f"Failed to clean up generated config file: {e}")
components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (4)

20-20: Consider using __name__ instead of __file__ for the logger name.

Using __file__ may include the full path to the script, which can vary between environments and potentially expose sensitive directory structures. It's a common practice to use __name__, which provides a consistent and more secure logger name.

Apply this diff to update the logger initialization:

-logger = logging.getLogger(__file__)
+logger = logging.getLogger(__name__)

29-29: Revise the comment for clarity and grammar.

The comment # Consider deduplicate this could be rephrased for better clarity and grammatical correctness.

Consider updating it as follows:

-# Consider deduplicate this
+# Consider deduplicating this function

110-110: Handle exceptions during directory deletion to ensure robustness.

Deleting directories may fail due to permissions or other issues. It's advisable to handle exceptions to prevent the script from crashing and to log any errors encountered during deletion.

Wrap the deletion in a try-except block:

try:
    shutil.rmtree(archive_path)
except Exception as e:
    logger.error(f"Failed to delete archive {archive}: {e}")

131-137: Correct and simplify the help texts for begin_ts and end_ts arguments.

The help texts contain unnecessary string concatenations, and the end_ts help text incorrectly refers to "lower-bound" instead of "upper-bound".

Apply this diff to correct and simplify the help texts:

 args_parser.add_argument(
     "begin_ts",
     type=int,
-    help="Time range filter lower-bound (inclusive) as milliseconds" " from the UNIX epoch.",
+    help="Time range filter lower-bound (inclusive) as milliseconds from the UNIX epoch.",
 )
 args_parser.add_argument(
     "end_ts",
     type=int,
-    help="Time range filter lower-bound (inclusive) as milliseconds" " from the UNIX epoch.",
+    help="Time range filter upper-bound (inclusive) as milliseconds from the UNIX epoch.",
 )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ac4f1c1 and dc9366b.

📒 Files selected for processing (4)
  • components/clp-package-utils/clp_package_utils/general.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1 hunks)
  • components/package-template/src/sbin/del-archives.sh (1 hunks)
🔇 Additional comments (5)
components/package-template/src/sbin/del-archives.sh (1)

8-8: Verify Python module and its dependencies

Let's ensure the Python module and its dependencies are properly set up in the repository.

✅ Verification successful

Python module and dependencies are properly configured

The verification confirms that:

  • The Python module exists at components/clp-package-utils/clp_package_utils/scripts/del_archives.py
  • The module correctly imports all required dependencies
  • The module is properly referenced in the shell script, as it uses the native implementation at clp_package_utils.scripts.native.del_archives
  • The implementation follows a container-based execution pattern with appropriate configuration handling
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify Python module existence and its dependencies

# Check if the module exists
echo "Checking for del_archives module..."
fd "del_archives.py$" --type f

# Check module imports and dependencies
echo -e "\nChecking module dependencies..."
rg -l "^import|^from.*import" --type py | grep -i "del_archives"

Length of output: 555


Script:

#!/bin/bash
# Let's check the content of both Python files to understand their relationship and dependencies

echo "Content of components/clp-package-utils/clp_package_utils/scripts/del_archives.py:"
cat components/clp-package-utils/clp_package_utils/scripts/del_archives.py

echo -e "\nContent of components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:"
cat components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py

# Check if the module is imported or used elsewhere
echo -e "\nChecking usage of del_archives module across the codebase:"
rg "del_archives" -l

Length of output: 8785

components/clp-package-utils/clp_package_utils/scripts/del_archives.py (3)

1-28: LGTM! Well-structured imports and logging setup.

The imports are appropriate, and the logging configuration follows best practices with proper formatting and handlers.


106-107: LGTM! Proper main guard implementation.

The main entry point follows Python best practices with proper exit code handling.


89-89: Verify the native del_archives module exists.

This script depends on the native module "clp_package_utils.scripts.native.del_archives". Let's verify its existence and implementation.

✅ Verification successful

Native del_archives module is properly implemented and integrated

The native module exists at the correct location with a complete implementation that:

  • Accepts required command-line arguments (config file path, begin timestamp, end timestamp)
  • Handles archive deletion from both database and storage
  • Provides proper logging and error handling
  • Maintains consistent interface with the calling script
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and implementation of the native del_archives module

# Check if the module exists
fd -t f "del_archives.py" components/clp-package-utils/clp_package_utils/scripts/native/

# If it exists, check its interface
rg -A 5 "def main" components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py

Length of output: 508


Script:

#!/bin/bash
# Let's verify the complete implementation and interface
cat components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py

Length of output: 4826

components/clp-package-utils/clp_package_utils/general.py (1)

54-54: LGTM! The new job type follows the existing pattern.

The addition of DEL_ARCHIVE to the JobType enum is consistent with the PR's objective of supporting archive deletion based on timestamp.

Let's verify the usage of this new job type:

✅ Verification successful

The new DEL_ARCHIVE job type is properly integrated and used

The verification confirms that:

  • The enum is correctly added to JobType class alongside other job types
  • It's properly used in del_archives.py with generate_container_name() function, following the same pattern as other job types
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for references to the new DEL_ARCHIVE job type
# Expected: References in the new del_archives.py scripts mentioned in the AI summary

# Search for direct string references
rg -l "DEL_ARCHIVE"

# Search for container name generation with the new job type
ast-grep --pattern 'generate_container_name($x)' -l

Length of output: 308


Script:

#!/bin/bash
# Let's check how DEL_ARCHIVE is used in both files
rg "DEL_ARCHIVE" -A 3 -B 3

# Check container name generation with proper language
ast-grep --pattern 'generate_container_name($x)' --lang python

Length of output: 1445

@haiqi96 haiqi96 marked this pull request as ready for review November 19, 2024 01:56
@haiqi96 haiqi96 changed the title feat(clp-package): Add support for deleting archives based on timestamp (Draft) feat(clp-package): Add support for deleting archives based on timestamp Nov 19, 2024
@haiqi96 haiqi96 changed the title feat(clp-package): Add support for deleting archives based on timestamp feat(clp-package): Add support for deleting archives based on timestamp. Nov 19, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a brand new script, let's use our latest conventions and best practices.

# Note, the error message doesn't output the value of archives_dir because it is a mounted
# path. It could be confusing for user because the path will not exist in their file system.
if not archives_dir.exists():
logger.error(f"archive directory doesn't exist. abort deletion")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.error(f"archive directory doesn't exist. abort deletion")
# TODO: Figure out a way to log `archives_dir` relative to the user's file system rather
# than the container's file system.
logger.error(f"archive directory doesn't exist. abort deletion")

@kirkrodrigues
Copy link
Member

  • I anticipate we'll need more of these kind of admin-only scripts, so how about moving them into scripts/admin_tools?
  • I think we can actually configure logging inside scripts/admin_tools/__init__.py to avoid repeating the logging configure preamable that we have in every script, right?
  • In a future PR, can we extend this script so that it can delete an archive by ID as well as list archives? I think that will help for an issue like How to remove clp file via management console? #597.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (2)

29-76: Refactor main function to improve separation of concerns

The main function currently handles both argument parsing and configuration loading. Consider separating these responsibilities for better maintainability.

Consider refactoring like this:

def _load_and_validate_config(config_path: pathlib.Path, default_config_path: pathlib.Path, clp_home: pathlib.Path):
    try:
        clp_config = load_config_file(config_path, default_config_path, clp_home)
        clp_config.validate_logs_dir()
        return clp_config
    except Exception as e:
        logger.exception(f"Failed to load config: {str(e)}")
        raise

def main(argv):
    clp_home = get_clp_home()
    default_config_file_path = clp_home / CLP_DEFAULT_CONFIG_FILE_RELATIVE_PATH

    args_parser = argparse.ArgumentParser(
        description="Deletes archives that fall within the specified time range."
    )
    # ... argument parsing code ...
    parsed_args = args_parser.parse_args(argv[1:])

    try:
        config_file_path = pathlib.Path(parsed_args.config)
        clp_config = _load_and_validate_config(
            config_file_path,
            default_config_file_path,
            clp_home
        )
    except Exception:
        return -1

    if not clp_config.archive_output.directory.exists():
        logger.error("`archive_output.directory` doesn't exist.")
        return -1

    return _delete_archives(
        clp_config.archive_output.directory,
        clp_config.database,
        parsed_args.begin_ts,
        parsed_args.end_ts,
    )

78-92: Enhance function documentation

The docstring could be improved by adding type hints and more detailed parameter descriptions.

Consider updating the docstring:

def _delete_archives(
    archives_dir: pathlib.Path,
    database_config: Database,
    begin_ts: int,
    end_ts: int,
) -> int:
    """
    Deletes all archives where `begin_ts <= archive.begin_timestamp` and
    `archive.end_timestamp <= end_ts` from both the metadata database and disk.

    Args:
        archives_dir (pathlib.Path): Directory containing the archives
        database_config (Database): Database configuration object
        begin_ts (int): Lower bound timestamp in milliseconds (inclusive)
        end_ts (int): Upper bound timestamp in milliseconds (inclusive)

    Returns:
        int: 0 on success, -1 on failure

    Note:
        This operation is not atomic. If it fails after deleting from the database
        but before completing disk cleanup, manual intervention may be required.
    """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 478d8ca and a16e8bc.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1 hunks)
  • components/package-template/src/sbin/admin_tools/del-archives.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py
🧰 Additional context used
📓 Learnings (1)
components/package-template/src/sbin/admin_tools/del-archives.sh (1)
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/package-template/src/sbin/del-archives.sh:6-9
Timestamp: 2024-11-15T16:28:08.644Z
Learning: For wrapper scripts in the `components/package-template/src/sbin/` directory, keep them simple and avoid adding additional validation code.
🔇 Additional comments (3)
components/package-template/src/sbin/admin_tools/del-archives.sh (2)

1-5: LGTM! Clean and effective directory setup.

The script correctly determines its location and the package root using robust path resolution.


6-9: Verify Python module path exists in the package.

The implementation looks good and follows the principle of keeping wrapper scripts simple. Let's verify the Python module path.

✅ Verification successful

Python module path and package structure are correctly implemented

The verification confirms that:

  • The Python module del_archives.py exists and is properly structured
  • The package hierarchy is complete with __init__.py files in all required directories
  • The module imports and uses the expected package structure (clp_package_utils.scripts.native.del_archives)
  • The package is located in the correct component (components/clp-package-utils)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of the del_archives module and its package structure

# Check if the module exists in the expected location
fd --type f "del_archives.py" --exec echo "Module found: {}"

# Verify the package structure
fd --type f "__init__.py" --parent-dir "clp_package_utils" --exec echo "Package structure found: {}"

Length of output: 609


Script:

#!/bin/bash
# Let's verify the package structure more accurately
# First, check the content of the Python module
echo "=== Checking del_archives.py content ==="
cat ./components/clp-package-utils/clp_package_utils/scripts/del_archives.py

# Then verify the package structure by checking for __init__.py files
echo -e "\n=== Checking package structure ==="
fd "__init__.py" ./components/clp-package-utils/clp_package_utils

Length of output: 3867

components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1)

145-146: LGTM!

The main guard implementation follows Python best practices.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

Line range hint 1-150: Consider reorganizing script location as suggested in PR comments.

As per kirkrodrigues's feedback, consider moving this script to an appropriate directory based on its role (admin vs. user-facing) when implementing the suggested scripts/admin_tools reorganization.

components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (2)

56-58: Validate config file path before loading

Add explicit path validation before attempting to load the config file.

Apply this diff:

     config_file_path = Path(parsed_args.config)
+    if not config_file_path.is_file():
+        logger.error(f"Config file {config_file_path} does not exist or is not a file.")
+        return -1
     try:
         clp_config = load_config_file(config_file_path, default_config_file_path, clp_home)

84-92: Enhance function documentation

The docstring should include type hints and better parameter descriptions.

Apply this diff:

     """
     Deletes all archives where `begin_ts <= archive.begin_timestamp` and
     `archive.end_timestamp <= end_ts` from both the metadata database and disk.
-    :param archives_dir:
-    :param database_config:
-    :param begin_ts:
-    :param end_ts:
+    :param archives_dir: Path to the directory containing archives
+    :param database_config: Database configuration object
+    :param begin_ts: Start timestamp in milliseconds since epoch (inclusive)
+    :param end_ts: End timestamp in milliseconds since epoch (inclusive)
     :return: 0 on success, -1 otherwise.
+    :raises: Exception if database operations fail
     """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 816892a and 4e73e4d.

📒 Files selected for processing (7)
  • components/clp-package-utils/clp_package_utils/general.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (1 hunks)
  • components/package-template/src/sbin/admin-tools/del-archives.sh (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • components/package-template/src/sbin/admin-tools/del-archives.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • components/clp-package-utils/clp_package_utils/general.py
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1)

145-146: LGTM!

The entry point is correctly implemented using the standard Python idiom.

Comment on lines 103 to 109
db_cursor.execute(
f"""
DELETE FROM `{table_prefix}archives` WHERE
begin_timestamp >= {begin_ts} AND end_timestamp <= {end_ts}
RETURNING id
"""
)
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Resolve timestamp handling in SQL query

The current implementation has an unresolved issue with large integer timestamps. Consider alternative approaches for handling large integers.

Here are potential solutions:

  1. Use string representation:
-                f"""
-                DELETE FROM `{table_prefix}archives` WHERE
-                begin_timestamp >= {begin_ts} AND end_timestamp <= {end_ts}
-                RETURNING id
-                """
+                f"DELETE FROM `{table_prefix}archives` WHERE begin_timestamp >= %s AND end_timestamp <= %s RETURNING id",
+                (str(begin_ts), str(end_ts))
  1. Use DECIMAL type:
from decimal import Decimal

db_cursor.execute(
    f"DELETE FROM `{table_prefix}archives` WHERE begin_timestamp >= %s AND end_timestamp <= %s RETURNING id",
    (Decimal(begin_ts), Decimal(end_ts))
)

Please test these solutions to determine which one resolves the integer overflow issue in your environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, managed to resolve it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haiqi96, great to hear that you resolved the issue!


If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)

Line range hint 1-1000: Consider adding type hints consistently across all functions

While type hints have been added to add_num_workers_argument and main, other functions like container_exists, update_settings_object, etc., still lack type hints. Consider adding type hints consistently across all functions to improve code maintainability and enable better static type checking.


Line range hint 1-1000: Consider improving error handling and configuration management

Several areas could be enhanced:

  1. Error handling in subprocess calls could be more robust by capturing and logging specific error messages
  2. Container startup timeouts are hardcoded (e.g., 180s for DB, 60s for queue) - consider making these configurable
  3. Configuration validation is scattered across multiple functions - consider centralizing validation logic

Example of improved error handling:

-    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+    try:
+        result = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True)
+    except subprocess.CalledProcessError as e:
+        logger.error(f"Failed to start container: {e.stderr.decode()}")
+        raise

Line range hint 1-1000: Address potential security concerns

Several security considerations should be addressed:

  1. Temporary files containing credentials should be securely deleted (e.g., using shred or similar secure deletion)
  2. Error messages should be sanitized to prevent potential exposure of sensitive information
  3. Consider implementing credential rotation mechanisms for long-running containers

Example of secure file deletion:

+def secure_delete(file_path: pathlib.Path):
+    """Securely delete a file containing sensitive information."""
+    subprocess.run(['shred', '-u', '-z', str(file_path)], check=True)

-    db_config_file_path.unlink()
+    secure_delete(db_config_file_path)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9449c52 and 1127973.

📒 Files selected for processing (11)
  • components/clp-package-utils/clp_package_utils/__init__.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (3 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (1 hunks)
💤 Files with no reviewable changes (3)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
✅ Files skipped from review due to trivial changes (1)
  • components/clp-package-utils/clp_package_utils/init.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py
  • components/clp-package-utils/clp_package_utils/scripts/search.py
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

63-63: LGTM: Logger initialization follows best practices

Using __file__ as the logger name provides better traceability and consistency across modules.

@@ -18,15 +18,8 @@
validate_and_load_db_credentials_file,
)

# Setup logging
# Create logger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create logger

Don't think this comment is necessary. Can delete it in all the refactored files.

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(clp-package): Add support for deleting archives that are exclusively within a time range.

@haiqi96 haiqi96 changed the title feat(clp-package): Add support for deleting archives based on timestamp. feat(clp-package): Add support for deleting archives that are exclusively within a time range. Nov 27, 2024
@haiqi96 haiqi96 merged commit b8e22da into y-scope:main Nov 27, 2024
7 checks passed
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
@haiqi96 haiqi96 deleted the del_archive_script branch December 6, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants