🐛 TSL/SSL Errors in jobs carried out by Create-A-Derived-Table #3038
Labels
bug
Something isn't working
data-platform-apps-and-tools
This issue is owned by Data Platform Apps and Tools
stale
Describe the bug.
As we experienced on the Cloud Platform runners, under certain (unclear) circumstances, a DBT job running in a Github Runner will experience temporary loss of service with AWS. When this occurs, it will experience the error:
ssl.SSLZeroReturnError: TLS/SSL connection has been closed (EOF) (_ssl.c:1129)
This issue does not appear to have a temporal dependency (the same job run at different times will experience the same error), but DOES appear to have a workload dependency (If a job experiences these errors, it will seem to do so relatively consistently, with few runs entirely unaffected by the issue). No way has been found thus far to replicate the errors on a developer machine.
This issue will prevent tables associated with that job from deploying successfully. Currently, the impact of this is limited to
NOMIS Curated
, but we need to better understand why this issue is occuring so we can advise on mitigation/prevention.To Reproduce
Deploy DBT Models
stepERROR
. Any tables that depend on them will be marked asSKIP
.Expected Behaviour
Additional context
Would recommend running the pod during a period where we can easily audit the VPC flow logs as well as potentially execing into the pod while it's experiencing this error to better diagnose what's happening. Might be worth monitoring the pods via Kibana etc to see if there's any obviously anomalous stuff in the resource usage etc.
The text was updated successfully, but these errors were encountered: