Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Deployment - Web Pod Crashing Due to Failed Migration #47

Open
WendesonSilva-neuro opened this issue Dec 10, 2024 · 9 comments
Open
Labels
bug Something isn't working

Comments

@WendesonSilva-neuro
Copy link

While deploying Langfuse using the official Helm guide: Self-Hosting with Kubernetes Helm, the langfuse-web pod enters a CrashLoopBackOff state. Logs from the pod indicate a migration failure during startup:

273 migrations found in prisma/migrations

Error: P3009

migrate found failed migrations in the target database, new migrations will not be applied. Read more about how to resolve migration issues in a production database: https://pris.ly/d/migrate-resolve
The `20230918180320_add_indices` migration started at 2024-12-09 20:48:44.914801 UTC failed

Steps to Reproduce:

  1. Follow the steps in the Langfuse Helm installation guide.
  2. Deploy using Helm.
  3. The langfuse-web pod fails to start and enters CrashLoopBackOff.

Expected Behavior:
The deployment should succeed, and all pods should be running without errors.

Actual Behavior:
The langfuse-web pod fails during startup due to a failed Prisma migration (20230918180320_add_indices).

Additional Context:

  • What is the recommended way to resolve this migration issue?

  • Let me know if additional logs or debugging information would help.

Thank you!

Copy link

dosubot bot commented Dec 10, 2024

Hey @WendesonSilva-neuro! I'm here to help you with the issue you're facing. Feel free to ask me anything about bugs, questions, or becoming a contributor.

To resolve the migration issue causing the langfuse-web pod to crash, you can follow these steps:

  1. Check the Migration Logs: Review the detailed logs of the migration process to identify the specific reason for the failure. This can provide insights into what went wrong during the 20230918180320_add_indices migration.

  2. Prisma Migration Resolution Guide: Refer to the Prisma documentation on resolving migration issues, particularly the section on handling failed migrations in a production database. The link provided in the error message (https://pris.ly/d/migrate-resolve) is a good starting point.

  3. Manual Intervention: If the migration failure is due to a specific database state or conflict, you might need to manually adjust the database schema to match the expected state before re-running the migration. This could involve adding missing indices or correcting schema discrepancies.

  4. Reapply Migrations: Once the underlying issue is resolved, you can attempt to reapply the migrations. This can be done by running the Prisma migration commands manually within the pod or using a Kubernetes job.

  5. Database Backup: Before making any changes, ensure you have a backup of your database to prevent data loss in case of further issues.

  6. Consult the Helm Chart Configuration: Verify that your Helm chart configuration, especially the database connection settings, is correct. Ensure that the environment variables and values in values.yaml are properly set up for your database setup [1][2].

By following these steps, you should be able to address the migration failure and get the langfuse-web pod running successfully. If the issue persists, further investigation into the specific migration script and database state may be necessary.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Dec 10, 2024
@Steffen911
Copy link
Contributor

Hey @WendesonSilva-neuro ,
Is this a fresh setup or did you upgrade an existing stack?

This issue may happen, if Kubernetes kills the langfuse-web container before it completes its migrations. With the startupProbe and livenessProbe defaults it should have about 45s to perform the migrations before being killed. Did you change those settings in any way?

If it's a fresh stack, you may try to reinstall the chart with a longer startupProbe duration or, in case you want to make the current installation work, you can connect to the Postgres database and remove the last entry in the _prisma_migrations table that includes the failure. When the web container restarts, it should attempt to perform the relevant migrations again.

Let me know if that helps or if you encounter any other problems.

@WendesonSilva-neuro
Copy link
Author

Hi @Steffen911,

Thank you for your response!

I updated the Helm repository and applied the following values.yaml configuration, but unfortunately, the issue persists:

replicaCount: 1

image:
  repository: langfuse/langfuse
  pullPolicy: Always
  tag: 3

langfuse:
  nodeEnv: production

  web:
    livenessProbe:
      initialDelaySeconds: 120
      periodSeconds: 15
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 5

  worker:
    livenessProbe:
      initialDelaySeconds: 60
      periodSeconds: 15
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 5

  additionalEnv:
    ...

serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::-
  name: "dev-langfuse"

postgresql:
  host: -.us-east-1.rds.amazonaws.com
  auth:
    username: postgres
    password: -
    database: langfuse_dev
  deploy: false
  directUrl: -

clickhouse:
  deploy: true
  shards: 1
  replicaCount: 3
  resourcesPreset: large
  auth:
    username: default
    password: -

I also extended the livenessProbe settings for both langfuse-web and langfuse-worker pods to allow more time for startup. Despite these adjustments, the langfuse-web pod continues to enter a CrashLoopBackOff state with the same migration error:

Error: P3009

migrate found failed migrations in the target database, new migrations will not be applied. Read more about how to resolve migration issues in a production database: https://pris.ly/d/migrate-resolve
The `20240528214728_add_cursor_index_17` migration started at 2024-12-11 20:31:58.853881 UTC failed

I am using an external PostgreSQL database hosted on RDS.

Could you confirm if any additional steps are required to resolve this issue? Should I follow the manual migration rollback process you suggested, or is there another recommended approach given this setup?

Thanks again for your assistance!

@Steffen911
Copy link
Contributor

Hey @WendesonSilva-neuro ,
Could you reset the langfuse_dev database in case there is no relevant information in there and try again afterwards? Or did you already reset the state and it runs into the issue again?

Can you share the contents of the _prisma_migrations table with me? The timestamps in there should indicate whether we're talking about a timing issue where more time might help.

@arhamhamood1306
Copy link

arhamhamood1306 commented Dec 13, 2024

Hi @Steffen911,
I'm also facing similar Kind of issue. I used the exact same Helm code and when I deploy It gives error of

_**Prisma schema loaded from packages/shared/prisma/schema.prisma
Datasource "db": PostgreSQL database "postgres_langfuse", schema "public" at "langfuse-postgresql"

273 migrations found in prisma/migrations

No pending migrations to apply.
error: Dirty database version 1. Fix and force version.
Applying clickhouse migrations failed. This is mostly caused by the database being unavailable.
Exiting...**_

I Tried to create a database on Azure Cloud also but error is same. I'm doing fresh installation. Attaching Screen shot. I have tried with increasing probes timeout also.

Web Pod Logs
image

All Pods
image

@Steffen911
Copy link
Contributor

@arhamhamood1306 Can you delete all tables in the clickhouse cluster and restart the web container?

@arhamhamood1306
Copy link

@Steffen911 Can you please share command to delete tables in clinkhouse cluster. FYI I have tried to delete the clickhouse pods but issue is there.

@Steffen911
Copy link
Contributor

@arhamhamood1306 One option is to delete the PVCs associated with ClickHouse, or you connect with the clickhouse cli and run DROP TABLE observations ON CLUSTER default and the same for the traces, scores, and schema_migrations tables.

@arhamhamood1306
Copy link

@Steffen911 It seems to work now after deleting the PVCs. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants