Debug failed RDS connections #558

m0ar · 2024-10-09T12:27:38Z

We see these failed connections to RDS in the app sometimes, need to figure out what's causing them:

m0ar · 2024-10-09T13:43:42Z

Some semi qualified initial observations:

looks aurora serverless rds should allow for ~190 connections per ACU, so we should have a base headroom of about 400 connections (source)
prisma defaults to a pool size of num_physical_cpus * 2 + 1 (source)
checked os.cpus() on a random desci-server pod, returns 4 logical cores. This could mean prisma defaults to a pool size of 9. Potentially overkill as we have a resource limit of 1 cpu on the pod, but I'm not sure if this limits us to 1 core/2 threads.
across all envs, we have 24 instances of desci-server => 216 open connections just for the main backend service

We should:

check the rds console for actual stats on connections
investigate potential errors on the rds side
see if we can adjust max_connections to fit our idle pool size
see if we can lower the pool size on the desci-server nodes if the autodetect doesn't work like it should
most importantly, implement connection retrial where missing

Provide feedback