Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Deployment timeouts #278

Merged
merged 1 commit into from
Jan 8, 2025
Merged

fix: Deployment timeouts #278

merged 1 commit into from
Jan 8, 2025

Conversation

adityachoudhari26
Copy link
Contributor

@adityachoudhari26 adityachoudhari26 commented Jan 7, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added deployment timeout configuration
    • Introduced a new job to check and mark jobs that exceed their timeout
  • Improvements

    • Enhanced deployment form with timeout input and validation
    • Added timeout management for tracking job progress
  • Database Changes

    • Added a new "timeout" column to the deployment table
    • Updated database migration tracking

The release introduces a timeout mechanism for deployments, allowing users to set maximum durations for jobs and automatically mark them as failed if they exceed the specified time.

Copy link
Contributor

coderabbitai bot commented Jan 7, 2025

Walkthrough

This pull request introduces a new timeout feature for deployments across multiple components. A new "timeout-checker" job is added to periodically check and mark jobs that have exceeded their configured timeout duration. The changes span the database schema, job configuration, web interface, and a new background job to automatically handle job timeouts. The implementation allows setting optional timeout values for deployments and automatically fails jobs that run longer than their specified duration.

Changes

File Change Summary
apps/jobs/src/index.ts Added "timeout-checker" job configuration with schedule of every minute
apps/jobs/src/timeout-checker/index.ts New async run function to check and update jobs exceeding timeout
apps/webservice/src/.../EditDeploymentSection.tsx Added timeout input field with validation and formatting
packages/db/drizzle/0051_brown_gambit.sql Added timeout column to deployment table
packages/db/drizzle/meta/_journal.json Updated migration journal
packages/db/src/schema/deployment.ts Added timeout property to deployment schema and table

Sequence Diagram

sequenceDiagram
    participant Job as Background Job
    participant DB as Database
    participant Checker as Timeout Checker

    loop Every Minute
        Checker->>DB: Query in-progress jobs
        DB-->>Checker: Return jobs with exceeded timeout
        Checker->>DB: Update job status to Failure
    end
Loading

Possibly related PRs

Suggested Reviewers

  • jsbroks

Poem

🐰 Timeout Tracker, swift and bright,
Watching jobs with rabbit's might,
When tasks run long, beyond their prime,
We mark them failed, just in time!
A coding bunny's vigilant rhyme 🕰️


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
apps/jobs/src/index.ts (1)

25-25: Consider reducing the frequency of the timeout checker.

Running the timeout checker every minute might be unnecessarily frequent and could impact database performance. Consider a less frequent schedule (e.g., every 5 minutes) since timeouts are typically in minutes or hours.

-  "timeout-checker": { run: timeoutChecker, schedule: "* * * * *" },
+  "timeout-checker": { run: timeoutChecker, schedule: "*/5 * * * *" },
apps/webservice/src/app/[workspaceSlug]/(app)/systems/[systemSlug]/deployments/[deploymentSlug]/EditDeploymentSection.tsx (3)

46-63: Improve timeout validation and user guidance.

The timeout validation is good but could be more user-friendly.

 const timeoutSchema = z
   .string()
   .optional()
   .refine((val) => {
     if (val == null || val === "") return true;
     try {
       ms(val);
       return true;
     } catch {
       return false;
     }
-  }, "Invalid timeout, must be a valid duration string")
+  }, "Invalid timeout format. Examples: 1h, 30m, 1h30m")
   .refine((val) => {
     if (val == null || val === "") return true;
     const timeout = ms(val);
-    if (timeout < 1000) return false;
+    if (timeout < 1000 || timeout > 86400000) return false;
     return true;
-  }, "Timeout must be at least 1 second");
+  }, "Timeout must be between 1 second and 24 hours");

236-261: Improve the timeout input field UI.

The timeout input field needs UI improvements for better usability.

           <FormField
             control={form.control}
             name="timeout"
             render={({ field }) => (
               <FormItem>
                 <FormLabel className="flex items-center gap-2">
                   Timeout
                   <TooltipProvider>
                     <Tooltip>
                       <TooltipTrigger>
                         <IconInfoCircle className="h-3 w-3 text-muted-foreground" />
                       </TooltipTrigger>
                       <TooltipContent className="p-2 text-xs text-muted-foreground">
-                        If a job for this deployment takes longer than the
-                        timeout, it will be marked as failed.
+                        Specify how long a job can run before being marked as failed.
+                        Examples: 1h, 30m, 1h30m (max 24h)
                       </TooltipContent>
                     </Tooltip>
                   </TooltipProvider>
                 </FormLabel>
                 <FormControl>
-                  <Input {...field} className="w-16" />
+                  <Input {...field} className="w-32" placeholder="e.g., 1h30m" />
                 </FormControl>
                 <FormMessage />
               </FormItem>
             )}
           />

107-111: Improve timeout conversion clarity.

The conversion between milliseconds and seconds could be more explicit.

-    const timeout =
-      data.timeout != null && data.timeout !== ""
-        ? ms(data.timeout) / 1000
-        : null;
+    // Convert duration string to seconds for database storage
+    const timeoutInSeconds =
+      data.timeout != null && data.timeout !== ""
+        ? Math.floor(ms(data.timeout) / 1000)
+        : null;
-    const updates = { ...data, resourceFilter: filter, timeout };
+    const updates = { ...data, resourceFilter: filter, timeout: timeoutInSeconds };
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 141f048 and b513bc6.

📒 Files selected for processing (6)
  • apps/jobs/src/index.ts (2 hunks)
  • apps/jobs/src/timeout-checker/index.ts (1 hunks)
  • apps/webservice/src/app/[workspaceSlug]/(app)/systems/[systemSlug]/deployments/[deploymentSlug]/EditDeploymentSection.tsx (6 hunks)
  • packages/db/drizzle/0051_brown_gambit.sql (1 hunks)
  • packages/db/drizzle/meta/_journal.json (1 hunks)
  • packages/db/src/schema/deployment.ts (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • packages/db/drizzle/0051_brown_gambit.sql
🧰 Additional context used
📓 Path-based instructions (4)
apps/jobs/src/timeout-checker/index.ts (1)

Pattern **/*.{ts,tsx}: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.

packages/db/src/schema/deployment.ts (1)

Pattern **/*.{ts,tsx}: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.

apps/jobs/src/index.ts (1)

Pattern **/*.{ts,tsx}: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.

apps/webservice/src/app/[workspaceSlug]/(app)/systems/[systemSlug]/deployments/[deploymentSlug]/EditDeploymentSection.tsx (1)

Pattern **/*.{ts,tsx}: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.

🔇 Additional comments (2)
apps/jobs/src/timeout-checker/index.ts (1)

24-27: Verify the timeout calculation logic.

The SQL interval calculation assumes the timeout is in seconds. Ensure this matches the timeout values stored in the database.

✅ Verification successful

Timeout calculation is correctly implemented using seconds

The timeout value is stored as a non-negative integer in the database and is explicitly used with interval '1 second' in the SQL query, confirming that the timeout calculation is consistent with the schema definition.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the units used for timeout values in the codebase
rg -A 2 "timeout.*=|timeout.*:" --type ts --type tsx

Length of output: 82


Script:

#!/bin/bash
# Search for timeout-related code and configurations
rg -l "timeout" --type ts

# Search for schema definitions related to timeout
fd schema -e ts -x cat {}

# Check documentation
fd -e md -x grep -l "timeout" {} || true

Length of output: 42835


Script:

#!/bin/bash
# Check the deployment schema definition
cat packages/db/src/schema/deployment.ts

# Check the timeout checker implementation
cat apps/jobs/src/timeout-checker/index.ts

Length of output: 4620

packages/db/drizzle/meta/_journal.json (1)

361-367: LGTM! Migration entry follows the established pattern.

The new migration entry maintains consistency with existing entries and follows the correct sequence.

Let's verify the corresponding migration file exists and contains the timeout-related changes:

✅ Verification successful

✓ Migration and schema changes are correctly implemented

The timeout feature has been properly integrated:

  • Migration adds nullable integer column
  • Schema definition matches migration exactly
  • Includes proper validation for non-negative values
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the migration file exists and contains timeout-related changes

# Check if the migration file exists
ls packages/db/drizzle/0051_brown_gambit.sql

# Check the migration content for timeout-related changes
cat packages/db/drizzle/0051_brown_gambit.sql | grep -i "timeout"

# Verify the deployment schema changes
fd --type f "deployment.ts" | xargs rg -l "timeout"

Length of output: 309


Script:

#!/bin/bash
# Check the schema definition
cat packages/db/src/schema/deployment.ts | grep -A 5 -B 5 "timeout"

Length of output: 780

Comment on lines +7 to +34
export const run = async () =>
db
.select({ id: SCHEMA.job.id })
.from(SCHEMA.deployment)
.innerJoin(
SCHEMA.release,
eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id),
)
.innerJoin(
SCHEMA.releaseJobTrigger,
eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id),
)
.innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id))
.where(
and(
isNotNull(SCHEMA.deployment.timeout),
eq(SCHEMA.job.status, JobStatus.InProgress),
lt(
SCHEMA.job.createdAt,
sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`,
),
),
)
.then(async (jobs) => {
await Promise.all(
jobs.map((job) => updateJob(job.id, { status: JobStatus.Failure })),
);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and improve observability.

The timeout checker implementation needs better error handling and logging for production reliability.

Consider these improvements:

 export const run = async () =>
-  db
+  try {
+    const timedOutJobs = await db
     .select({ id: SCHEMA.job.id })
     .from(SCHEMA.deployment)
     .innerJoin(
       SCHEMA.release,
       eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id),
     )
     .innerJoin(
       SCHEMA.releaseJobTrigger,
       eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id),
     )
     .innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id))
     .where(
       and(
         isNotNull(SCHEMA.deployment.timeout),
         eq(SCHEMA.job.status, JobStatus.InProgress),
         lt(
           SCHEMA.job.createdAt,
           sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`,
         ),
       ),
-    )
-    .then(async (jobs) => {
-      await Promise.all(
-        jobs.map((job) => updateJob(job.id, { status: JobStatus.Failure })),
-      );
-    });
+    );
+
+    if (timedOutJobs.length > 0) {
+      logger.info(`Found ${timedOutJobs.length} timed-out jobs`);
+      // Process in batches of 50 to avoid overwhelming the system
+      for (let i = 0; i < timedOutJobs.length; i += 50) {
+        const batch = timedOutJobs.slice(i, i + 50);
+        await Promise.all(
+          batch.map(async (job) => {
+            try {
+              await updateJob(job.id, { status: JobStatus.Failure });
+              logger.info(`Marked job ${job.id} as failed due to timeout`);
+            } catch (error) {
+              logger.error(`Failed to update job ${job.id}:`, error);
+            }
+          }),
+        );
+      }
+    }
+  } catch (error) {
+    logger.error("Error in timeout checker:", error);
+    throw error;
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const run = async () =>
db
.select({ id: SCHEMA.job.id })
.from(SCHEMA.deployment)
.innerJoin(
SCHEMA.release,
eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id),
)
.innerJoin(
SCHEMA.releaseJobTrigger,
eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id),
)
.innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id))
.where(
and(
isNotNull(SCHEMA.deployment.timeout),
eq(SCHEMA.job.status, JobStatus.InProgress),
lt(
SCHEMA.job.createdAt,
sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`,
),
),
)
.then(async (jobs) => {
await Promise.all(
jobs.map((job) => updateJob(job.id, { status: JobStatus.Failure })),
);
});
export const run = async () =>
try {
const timedOutJobs = await db
.select({ id: SCHEMA.job.id })
.from(SCHEMA.deployment)
.innerJoin(
SCHEMA.release,
eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id),
)
.innerJoin(
SCHEMA.releaseJobTrigger,
eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id),
)
.innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id))
.where(
and(
isNotNull(SCHEMA.deployment.timeout),
eq(SCHEMA.job.status, JobStatus.InProgress),
lt(
SCHEMA.job.createdAt,
sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`,
),
),
);
if (timedOutJobs.length > 0) {
logger.info(`Found ${timedOutJobs.length} timed-out jobs`);
// Process in batches of 50 to avoid overwhelming the system
for (let i = 0; i < timedOutJobs.length; i += 50) {
const batch = timedOutJobs.slice(i, i + 50);
await Promise.all(
batch.map(async (job) => {
try {
await updateJob(job.id, { status: JobStatus.Failure });
logger.info(`Marked job ${job.id} as failed due to timeout`);
} catch (error) {
logger.error(`Failed to update job ${job.id}:`, error);
}
}),
);
}
}
} catch (error) {
logger.error("Error in timeout checker:", error);
throw error;
}

Comment on lines +48 to +54
timeout: z
.number()
.nullable()
.default(null)
.refine((val) => val == null || val >= 0, {
message: "Timeout must be a non-negative number.",
}),
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add an upper limit for timeout values.

While the schema correctly validates for non-negative values, it should also have an upper limit to prevent unreasonably long timeouts.

   timeout: z
     .number()
     .nullable()
     .default(null)
-    .refine((val) => val == null || val >= 0, {
-      message: "Timeout must be a non-negative number.",
+    .refine((val) => val == null || (val >= 0 && val <= 86400), {
+      message: "Timeout must be between 0 and 86400 seconds (24 hours).",
     }),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
timeout: z
.number()
.nullable()
.default(null)
.refine((val) => val == null || val >= 0, {
message: "Timeout must be a non-negative number.",
}),
timeout: z
.number()
.nullable()
.default(null)
.refine((val) => val == null || (val >= 0 && val <= 86400), {
message: "Timeout must be between 0 and 86400 seconds (24 hours).",
}),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityachoudhari26 Having an upper limit for deployment timeouts is important for several reasons:

  1. Resource Management: Without an upper limit, deployments could potentially run indefinitely, consuming system resources and potentially affecting other operations.
  2. User Experience: It prevents accidental input of extremely large values that could lock resources for unreasonable periods.
  3. Operational Predictability: A bounded timeout range helps in better capacity planning and system monitoring.

I see that you already have a minimum limit of 1 second in the UI. The suggested 24-hour upper limit is a common practice as it's usually sufficient for most deployment scenarios while preventing indefinite runs. However, if you have specific use cases that require longer timeouts, we can adjust this limit accordingly. What's the longest deployment duration you typically encounter?

🤔

🔗 Analysis chain

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for timeout-related configurations
rg -i "timeout.*=|timeout.*:" --type ts --type tsx -A 2
# Search for duration-related constants
rg -i "const.*duration|const.*timeout" --type ts --type tsx

Length of output: 175


Script:

#!/bin/bash
# Search for timeout-related configurations in TypeScript files
rg -t typescript -i "timeout.*=|timeout.*:" -A 2
# Search for duration-related constants
rg -t typescript -i "const.*duration|const.*timeout"

Length of output: 175


Script:

#!/bin/bash
# Search for timeout-related configurations
rg -i "timeout.*=|timeout.*:" --type-add 'ts:*.{ts,tsx}' -t ts -A 2
# Search for any timeout or duration constants
rg -i "const.*(?:timeout|duration)" --type-add 'ts:*.{ts,tsx}' -t ts

Length of output: 12947

@adityachoudhari26 adityachoudhari26 merged commit 935cb7e into main Jan 8, 2025
10 checks passed
@adityachoudhari26 adityachoudhari26 deleted the deployment-timeout branch January 8, 2025 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant