Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle FreeSWITCH Gateway timeouts #763

Merged
merged 20 commits into from
Sep 24, 2024
Merged

Conversation

dwilkie
Copy link
Collaborator

@dwilkie dwilkie commented Sep 19, 2024

Problem Description

FreeSWITCH sometimes replies with a 200 OK to probe requests from the OpenSIPS load balancer, but when load balancing a request to it it times out failing to send a 200 OK back to the gateway within the 30 second time limit.

Since the load balancer still thinks the gateway is up, it continues to send requests to it. Even if we mark it manually as down using the lb_disable_dst()function (see: https://opensips.org/html/docs/modules/3.4.x/load_balancer.html#idp5699408) the probing mechanism will bring it back up again.

Solution

In order to recover from this issue, I manually restarted the FreeSWITCH task. In order to automate this, If we get enough timeouts we want to mark the task as unhealthy.

Todo

Enable watchdog?

https://developer.signalwire.com/freeswitch/FreeSWITCH-Explained/Configuration/Sofia-SIP-Stack/

  • Reproduce the problem locally using an automated test
  • Log the IP address of the problem FreeSWITCH instance in OpenSIPS along with an error message.
  • Use json formatting for logging on OpenSIPS
  • Apply the same set of changes to the client gateway
  • Configure log group subscription filter (see: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html#LambdaFunctionExample) which matches a certain log level (e.g. Error).
  • Configure the subscription filter to run the Services lambda function
  • Parse the message to obtain the IP of the FreeSWITCH task.
  • aws ecs update-service --force-new-deployment --service my-service --cluster cluster-name

Testing cloudwatch

aws-vault exec somleng/administrator -- aws logs put-log-events --log-group-name public-gateway-staging --log-stream-name testing --log-events "[{\"timestamp\":$(gdate +%s%3N),\"message\":\"{\\\"time\\\": \\\"Sep 22 07:24:26\\\", \\\"pid\\\": 82, \\\"level\\\": \\\"ALERT\\\", \\\"message\\\": \\\"408-lb-response-error-172.18.0.5\\\"}\"}]"

Update

FreeSWITCH times out when the SwitchApp times out because it's requesting TwiML and it doesn't respond in time. This specific case was fixed here: somleng/open-ews#1592

@dwilkie dwilkie marked this pull request as draft September 19, 2024 15:01
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (ce9646f) to head (235e8c5).
Report is 31 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #763   +/-   ##
========================================
  Coverage    98.54%   98.54%           
========================================
  Files          166      166           
  Lines         2891     2892    +1     
========================================
+ Hits          2849     2850    +1     
  Misses          42       42           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dwilkie dwilkie marked this pull request as ready for review September 24, 2024 16:39
@dwilkie dwilkie merged commit 5a25a3d into develop Sep 24, 2024
32 of 33 checks passed
@dwilkie dwilkie deleted the handle_freeswitch_gw_timeouts branch September 24, 2024 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant