-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Deployment timeouts #278
Conversation
WalkthroughThis pull request introduces a new timeout feature for deployments across multiple components. A new "timeout-checker" job is added to periodically check and mark jobs that have exceeded their configured timeout duration. The changes span the database schema, job configuration, web interface, and a new background job to automatically handle job timeouts. The implementation allows setting optional timeout values for deployments and automatically fails jobs that run longer than their specified duration. Changes
Sequence DiagramsequenceDiagram
participant Job as Background Job
participant DB as Database
participant Checker as Timeout Checker
loop Every Minute
Checker->>DB: Query in-progress jobs
DB-->>Checker: Return jobs with exceeded timeout
Checker->>DB: Update job status to Failure
end
Possibly related PRs
Suggested Reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
apps/jobs/src/index.ts (1)
25-25
: Consider reducing the frequency of the timeout checker.Running the timeout checker every minute might be unnecessarily frequent and could impact database performance. Consider a less frequent schedule (e.g., every 5 minutes) since timeouts are typically in minutes or hours.
- "timeout-checker": { run: timeoutChecker, schedule: "* * * * *" }, + "timeout-checker": { run: timeoutChecker, schedule: "*/5 * * * *" },apps/webservice/src/app/[workspaceSlug]/(app)/systems/[systemSlug]/deployments/[deploymentSlug]/EditDeploymentSection.tsx (3)
46-63
: Improve timeout validation and user guidance.The timeout validation is good but could be more user-friendly.
const timeoutSchema = z .string() .optional() .refine((val) => { if (val == null || val === "") return true; try { ms(val); return true; } catch { return false; } - }, "Invalid timeout, must be a valid duration string") + }, "Invalid timeout format. Examples: 1h, 30m, 1h30m") .refine((val) => { if (val == null || val === "") return true; const timeout = ms(val); - if (timeout < 1000) return false; + if (timeout < 1000 || timeout > 86400000) return false; return true; - }, "Timeout must be at least 1 second"); + }, "Timeout must be between 1 second and 24 hours");
236-261
: Improve the timeout input field UI.The timeout input field needs UI improvements for better usability.
<FormField control={form.control} name="timeout" render={({ field }) => ( <FormItem> <FormLabel className="flex items-center gap-2"> Timeout <TooltipProvider> <Tooltip> <TooltipTrigger> <IconInfoCircle className="h-3 w-3 text-muted-foreground" /> </TooltipTrigger> <TooltipContent className="p-2 text-xs text-muted-foreground"> - If a job for this deployment takes longer than the - timeout, it will be marked as failed. + Specify how long a job can run before being marked as failed. + Examples: 1h, 30m, 1h30m (max 24h) </TooltipContent> </Tooltip> </TooltipProvider> </FormLabel> <FormControl> - <Input {...field} className="w-16" /> + <Input {...field} className="w-32" placeholder="e.g., 1h30m" /> </FormControl> <FormMessage /> </FormItem> )} />
107-111
: Improve timeout conversion clarity.The conversion between milliseconds and seconds could be more explicit.
- const timeout = - data.timeout != null && data.timeout !== "" - ? ms(data.timeout) / 1000 - : null; + // Convert duration string to seconds for database storage + const timeoutInSeconds = + data.timeout != null && data.timeout !== "" + ? Math.floor(ms(data.timeout) / 1000) + : null; - const updates = { ...data, resourceFilter: filter, timeout }; + const updates = { ...data, resourceFilter: filter, timeout: timeoutInSeconds };
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
apps/jobs/src/index.ts
(2 hunks)apps/jobs/src/timeout-checker/index.ts
(1 hunks)apps/webservice/src/app/[workspaceSlug]/(app)/systems/[systemSlug]/deployments/[deploymentSlug]/EditDeploymentSection.tsx
(6 hunks)packages/db/drizzle/0051_brown_gambit.sql
(1 hunks)packages/db/drizzle/meta/_journal.json
(1 hunks)packages/db/src/schema/deployment.ts
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- packages/db/drizzle/0051_brown_gambit.sql
🧰 Additional context used
📓 Path-based instructions (4)
apps/jobs/src/timeout-checker/index.ts (1)
Pattern **/*.{ts,tsx}
: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.
packages/db/src/schema/deployment.ts (1)
Pattern **/*.{ts,tsx}
: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.
apps/jobs/src/index.ts (1)
Pattern **/*.{ts,tsx}
: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.
apps/webservice/src/app/[workspaceSlug]/(app)/systems/[systemSlug]/deployments/[deploymentSlug]/EditDeploymentSection.tsx (1)
Pattern **/*.{ts,tsx}
: Note on Error Handling:
Avoid strict enforcement of try/catch blocks. Code may use early returns, Promise chains (.then().catch()), or other patterns for error handling. These are acceptable as long as they maintain clarity and predictability.
🔇 Additional comments (2)
apps/jobs/src/timeout-checker/index.ts (1)
24-27
: Verify the timeout calculation logic.The SQL interval calculation assumes the timeout is in seconds. Ensure this matches the timeout values stored in the database.
✅ Verification successful
Timeout calculation is correctly implemented using seconds
The timeout value is stored as a non-negative integer in the database and is explicitly used with
interval '1 second'
in the SQL query, confirming that the timeout calculation is consistent with the schema definition.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check the units used for timeout values in the codebase rg -A 2 "timeout.*=|timeout.*:" --type ts --type tsxLength of output: 82
Script:
#!/bin/bash # Search for timeout-related code and configurations rg -l "timeout" --type ts # Search for schema definitions related to timeout fd schema -e ts -x cat {} # Check documentation fd -e md -x grep -l "timeout" {} || trueLength of output: 42835
Script:
#!/bin/bash # Check the deployment schema definition cat packages/db/src/schema/deployment.ts # Check the timeout checker implementation cat apps/jobs/src/timeout-checker/index.tsLength of output: 4620
packages/db/drizzle/meta/_journal.json (1)
361-367
: LGTM! Migration entry follows the established pattern.The new migration entry maintains consistency with existing entries and follows the correct sequence.
Let's verify the corresponding migration file exists and contains the timeout-related changes:
✅ Verification successful
✓ Migration and schema changes are correctly implemented
The timeout feature has been properly integrated:
- Migration adds nullable integer column
- Schema definition matches migration exactly
- Includes proper validation for non-negative values
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the migration file exists and contains timeout-related changes # Check if the migration file exists ls packages/db/drizzle/0051_brown_gambit.sql # Check the migration content for timeout-related changes cat packages/db/drizzle/0051_brown_gambit.sql | grep -i "timeout" # Verify the deployment schema changes fd --type f "deployment.ts" | xargs rg -l "timeout"Length of output: 309
Script:
#!/bin/bash # Check the schema definition cat packages/db/src/schema/deployment.ts | grep -A 5 -B 5 "timeout"Length of output: 780
export const run = async () => | ||
db | ||
.select({ id: SCHEMA.job.id }) | ||
.from(SCHEMA.deployment) | ||
.innerJoin( | ||
SCHEMA.release, | ||
eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id), | ||
) | ||
.innerJoin( | ||
SCHEMA.releaseJobTrigger, | ||
eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id), | ||
) | ||
.innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id)) | ||
.where( | ||
and( | ||
isNotNull(SCHEMA.deployment.timeout), | ||
eq(SCHEMA.job.status, JobStatus.InProgress), | ||
lt( | ||
SCHEMA.job.createdAt, | ||
sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`, | ||
), | ||
), | ||
) | ||
.then(async (jobs) => { | ||
await Promise.all( | ||
jobs.map((job) => updateJob(job.id, { status: JobStatus.Failure })), | ||
); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling and improve observability.
The timeout checker implementation needs better error handling and logging for production reliability.
Consider these improvements:
export const run = async () =>
- db
+ try {
+ const timedOutJobs = await db
.select({ id: SCHEMA.job.id })
.from(SCHEMA.deployment)
.innerJoin(
SCHEMA.release,
eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id),
)
.innerJoin(
SCHEMA.releaseJobTrigger,
eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id),
)
.innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id))
.where(
and(
isNotNull(SCHEMA.deployment.timeout),
eq(SCHEMA.job.status, JobStatus.InProgress),
lt(
SCHEMA.job.createdAt,
sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`,
),
),
- )
- .then(async (jobs) => {
- await Promise.all(
- jobs.map((job) => updateJob(job.id, { status: JobStatus.Failure })),
- );
- });
+ );
+
+ if (timedOutJobs.length > 0) {
+ logger.info(`Found ${timedOutJobs.length} timed-out jobs`);
+ // Process in batches of 50 to avoid overwhelming the system
+ for (let i = 0; i < timedOutJobs.length; i += 50) {
+ const batch = timedOutJobs.slice(i, i + 50);
+ await Promise.all(
+ batch.map(async (job) => {
+ try {
+ await updateJob(job.id, { status: JobStatus.Failure });
+ logger.info(`Marked job ${job.id} as failed due to timeout`);
+ } catch (error) {
+ logger.error(`Failed to update job ${job.id}:`, error);
+ }
+ }),
+ );
+ }
+ }
+ } catch (error) {
+ logger.error("Error in timeout checker:", error);
+ throw error;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
export const run = async () => | |
db | |
.select({ id: SCHEMA.job.id }) | |
.from(SCHEMA.deployment) | |
.innerJoin( | |
SCHEMA.release, | |
eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id), | |
) | |
.innerJoin( | |
SCHEMA.releaseJobTrigger, | |
eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id), | |
) | |
.innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id)) | |
.where( | |
and( | |
isNotNull(SCHEMA.deployment.timeout), | |
eq(SCHEMA.job.status, JobStatus.InProgress), | |
lt( | |
SCHEMA.job.createdAt, | |
sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`, | |
), | |
), | |
) | |
.then(async (jobs) => { | |
await Promise.all( | |
jobs.map((job) => updateJob(job.id, { status: JobStatus.Failure })), | |
); | |
}); | |
export const run = async () => | |
try { | |
const timedOutJobs = await db | |
.select({ id: SCHEMA.job.id }) | |
.from(SCHEMA.deployment) | |
.innerJoin( | |
SCHEMA.release, | |
eq(SCHEMA.release.deploymentId, SCHEMA.deployment.id), | |
) | |
.innerJoin( | |
SCHEMA.releaseJobTrigger, | |
eq(SCHEMA.releaseJobTrigger.releaseId, SCHEMA.release.id), | |
) | |
.innerJoin(SCHEMA.job, eq(SCHEMA.releaseJobTrigger.jobId, SCHEMA.job.id)) | |
.where( | |
and( | |
isNotNull(SCHEMA.deployment.timeout), | |
eq(SCHEMA.job.status, JobStatus.InProgress), | |
lt( | |
SCHEMA.job.createdAt, | |
sql`now() - ${SCHEMA.deployment.timeout} * interval '1 second'`, | |
), | |
), | |
); | |
if (timedOutJobs.length > 0) { | |
logger.info(`Found ${timedOutJobs.length} timed-out jobs`); | |
// Process in batches of 50 to avoid overwhelming the system | |
for (let i = 0; i < timedOutJobs.length; i += 50) { | |
const batch = timedOutJobs.slice(i, i + 50); | |
await Promise.all( | |
batch.map(async (job) => { | |
try { | |
await updateJob(job.id, { status: JobStatus.Failure }); | |
logger.info(`Marked job ${job.id} as failed due to timeout`); | |
} catch (error) { | |
logger.error(`Failed to update job ${job.id}:`, error); | |
} | |
}), | |
); | |
} | |
} | |
} catch (error) { | |
logger.error("Error in timeout checker:", error); | |
throw error; | |
} |
timeout: z | ||
.number() | ||
.nullable() | ||
.default(null) | ||
.refine((val) => val == null || val >= 0, { | ||
message: "Timeout must be a non-negative number.", | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add an upper limit for timeout values.
While the schema correctly validates for non-negative values, it should also have an upper limit to prevent unreasonably long timeouts.
timeout: z
.number()
.nullable()
.default(null)
- .refine((val) => val == null || val >= 0, {
- message: "Timeout must be a non-negative number.",
+ .refine((val) => val == null || (val >= 0 && val <= 86400), {
+ message: "Timeout must be between 0 and 86400 seconds (24 hours).",
}),
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
timeout: z | |
.number() | |
.nullable() | |
.default(null) | |
.refine((val) => val == null || val >= 0, { | |
message: "Timeout must be a non-negative number.", | |
}), | |
timeout: z | |
.number() | |
.nullable() | |
.default(null) | |
.refine((val) => val == null || (val >= 0 && val <= 86400), { | |
message: "Timeout must be between 0 and 86400 seconds (24 hours).", | |
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adityachoudhari26 Having an upper limit for deployment timeouts is important for several reasons:
- Resource Management: Without an upper limit, deployments could potentially run indefinitely, consuming system resources and potentially affecting other operations.
- User Experience: It prevents accidental input of extremely large values that could lock resources for unreasonable periods.
- Operational Predictability: A bounded timeout range helps in better capacity planning and system monitoring.
I see that you already have a minimum limit of 1 second in the UI. The suggested 24-hour upper limit is a common practice as it's usually sufficient for most deployment scenarios while preventing indefinite runs. However, if you have specific use cases that require longer timeouts, we can adjust this limit accordingly. What's the longest deployment duration you typically encounter?
🤔
🔗 Analysis chain
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for timeout-related configurations
rg -i "timeout.*=|timeout.*:" --type ts --type tsx -A 2
# Search for duration-related constants
rg -i "const.*duration|const.*timeout" --type ts --type tsx
Length of output: 175
Script:
#!/bin/bash
# Search for timeout-related configurations in TypeScript files
rg -t typescript -i "timeout.*=|timeout.*:" -A 2
# Search for duration-related constants
rg -t typescript -i "const.*duration|const.*timeout"
Length of output: 175
Script:
#!/bin/bash
# Search for timeout-related configurations
rg -i "timeout.*=|timeout.*:" --type-add 'ts:*.{ts,tsx}' -t ts -A 2
# Search for any timeout or duration constants
rg -i "const.*(?:timeout|duration)" --type-add 'ts:*.{ts,tsx}' -t ts
Length of output: 12947
Summary by CodeRabbit
Release Notes
New Features
Improvements
Database Changes
The release introduces a timeout mechanism for deployments, allowing users to set maximum durations for jobs and automatically mark them as failed if they exceed the specified time.