Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart caclmgrd whenever catch exception in child thread or in main thread #194

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ZhaohuiS
Copy link
Contributor

  • Description
    If there is exception happens in child thread of caclmgrd, the whole caclmgrd service will get stuck.
    Can't get any chance to recover it until we restart caclmgrd service manually.
    So, if it detect any exception in child thread or main thread, it will kill the caclmgrd process itself and systemctl service will restart it.

  • Microsoft work item
    27122359

  • Test evidence

Dec 12 11:06:35.984388 bjw2-can-4600c-2 INFO caclmgrd[1692587]: Translating ACL rules for control plane ACL 'EXTERNAL_CLIENT_ACL' (service: 'EXTERNAL_CLIENT')
Dec 12 11:06:35.984530 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Exception occured at Thread-2 thread due to KeyError('L4_DST_PORT')
Dec 12 11:06:35.985585 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Traceback (most recent call last):
Dec 12 11:06:35.985677 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 890, in check_and_update_control_plane_acls
Dec 12 11:06:35.985759 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     self.update_control_plane_acls(namespace, new_config_db_connector)
Dec 12 11:06:35.985870 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 827, in update_control_plane_acls
Dec 12 11:06:35.985982 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     iptables_cmds, service_to_source_ip_map  = self.get_acl_rules_and_translate_to_iptables_commands(namespace, config_db_connector)
Dec 12 11:06:35.986100 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 715, in get_acl_rules_and_translate_to_iptables_commands
Dec 12 11:06:35.986201 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     dst_ports = [rule_props["L4_DST_PORT"]]
Dec 12 11:06:35.986304 bjw2-can-4600c-2 ERR caclmgrd[1692587]: KeyError: 'L4_DST_PORT'
Dec 12 11:06:35.986405 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Exiting thread Thread-2, put it into exception_queue <queue.Queue object at 0x7fe3e44aec10>
Dec 12 11:09:30.917205 bjw2-can-4600c-2 INFO caclmgrd[1692587]: Checking for exceptions in the queue ...
Dec 12 11:09:30.917414 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Exception in namespace '': KeyError('L4_DST_PORT')
Dec 12 11:09:30.917498 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Traceback (most recent call last):
Dec 12 11:09:30.917561 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 890, in check_and_update_control_plane_acls
Dec 12 11:09:30.917615 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     self.update_control_plane_acls(namespace, new_config_db_connector)
Dec 12 11:09:30.917665 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 827, in update_control_plane_acls
Dec 12 11:09:30.917715 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     iptables_cmds, service_to_source_ip_map  = self.get_acl_rules_and_translate_to_iptables_commands(namespace, config_db_connector)
Dec 12 11:09:30.917773 bjw2-can-4600c-2 ERR caclmgrd[1692587]:   File "/usr/local/bin/caclmgrd", line 715, in get_acl_rules_and_translate_to_iptables_commands
Dec 12 11:09:30.917828 bjw2-can-4600c-2 ERR caclmgrd[1692587]:     dst_ports = [rule_props["L4_DST_PORT"]]
Dec 12 11:09:30.917882 bjw2-can-4600c-2 ERR caclmgrd[1692587]: KeyError: 'L4_DST_PORT'
Dec 12 11:09:30.917944 bjw2-can-4600c-2 ERR caclmgrd[1692587]: Detect exception in Child thread Thread-2 , generating SIGKILL for main thread
Dec 12 11:09:30.919860 bjw2-can-4600c-2 WARNING systemd[1]: caclmgrd.service: Main process exited, code=killed, status=9/KILL
Dec 12 11:09:30.919949 bjw2-can-4600c-2 WARNING systemd[1]: caclmgrd.service: Failed with result 'signal'.

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants