Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False success after running the alba proxy-test a second time #351

Open
matthiasdeblock opened this issue Jul 7, 2017 · 8 comments
Open
Labels
Milestone

Comments

@matthiasdeblock
Copy link
Contributor

False success after running the alba proxy-test a second time. This gave us a Recovery on a broken albaproxy. This needs to be 'INFO' instead of 'SUCCESS' as we do not want to get a RECOVERY on these.

@jeroenmaelbrancke and @jtorreke : Do you guys agree on this?

root@NY1SRV0011:~# ovs healthcheck alba proxy-test
[INFO] Storagerouter Id: 334JtHQZVZ4TJORV                                                                                                                                                                                                                                      
[INFO] Environment Os: Ubuntu 16.04 xenial                                                                                                                                                                                                                                     
[INFO] Hostname: NY1SRV0011                                                                                                                                                                                                                                                    
[INFO] Cluster Id: zUorEHTo5DmajQNd                                                                                                                                                                                                                                            
[INFO] Storagerouter Type: EXTRA                                                                                                                                                                                                                                               
[INFO] Starting OpenvStorage Healthcheck version 3.3.5-1                                                                                                                                                                                                                       
[INFO] ======================                                                                                                                                                                                                                                                  
[INFO] Checking the ALBA proxies.                                                                                                                                                                                                                                              
[INFO] Checking ALBA proxy albaproxy_data-ny1-04_1.                                                                                                                                                                                                                            
[SKIPPED] Preset global_no_encrypt is not in use and will not be checked                                                                                                                                                                                                       
[FAILED] Create namespace has failed with Command 'proxy-create-namespace' failed with 'Proxy exception: Proxy_protocol.Protocol.Error.Unknown'. on namespace ovs-healthcheck-ns-local_encrypt-334JtHQZVZ4TJORV_abf075c1-50da-4731-98d1-645bfa0e7129 with proxy albaproxy_data-ny1-04_1 with preset local_encrypt                                                                                                                                                                                                                                             
[SKIPPED] Preset local_ss_encrypt is not in use and will not be checked                                                                                                                                                                                                        
[INFO] Checking ALBA proxy albaproxy_data-g-02_1.                                                                                                                                                                                                                              
[INFO] Deleting namespace ovs-healthcheck-ns-global_encrypt-334JtHQZVZ4TJORV_a805b32b-38f2-4bbb-aa23-c72104386827.
[SUCCESS] Namespace ovs-healthcheck-ns-global_encrypt-334JtHQZVZ4TJORV_a805b32b-38f2-4bbb-aa23-c72104386827 successfully removed.
[EXCEPTION] Unhandled exception caught when executing check_if_proxies_work. Got Creation namespace has timed out after 30.6417989731s
[INFO] Recap of Health Check module alba test proxy-test!
[INFO] ======================
[INFO] SUCCESS=1 FAILED=1 SKIPPED=2 WARNING=0 EXCEPTION=1

root@NY1SRV0011:~# ovs healthcheck alba proxy-test
[INFO] Storagerouter Id: 334JtHQZVZ4TJORV
[INFO] Environment Os: Ubuntu 16.04 xenial
[INFO] Hostname: NY1SRV0011
[INFO] Cluster Id: zUorEHTo5DmajQNd
[INFO] Storagerouter Type: EXTRA
[INFO] Starting OpenvStorage Healthcheck version 3.3.5-1
[INFO] ======================
[SUCCESS] Test check_if_proxies_work is already being executed on this node.
[INFO] Recap of Health Check module alba test proxy-test!
[INFO] ======================
[INFO] SUCCESS=1 FAILED=0 SKIPPED=0 WARNING=0 EXCEPTION=0
@JeffreyDevloo
Copy link
Contributor

JeffreyDevloo commented Jul 7, 2017

I thought that the idea was that checkMK needed some form of input when calling the commands. As both SKIPPED and INFO are ignored, success was the the most valid one.
EDIT: origin ticket: #236

@jeroenmaelbrancke
Copy link
Contributor

Indeed, Checkmk ignore SKIPPED. Info levels are not shown in json output. Or you can use WARNING if you want to see it in checkmk but not get awake of this.

@JeffreyDevloo
Copy link
Contributor

@matthiasdeblock
Should we change it to warning?

@matthiasdeblock
Copy link
Contributor Author

As long as it isn't INFO... We do not want to receive a SUCCESS on something that is already been tested on another node. This will give us the false positives.

@wimpers
Copy link

wimpers commented Nov 13, 2017

Please change to [SKIPPED]

@wimpers wimpers added this to the J milestone Nov 13, 2017
@JeffreyDevloo
Copy link
Contributor

@wimpers SKIPPED is filtered by operations
WARNING seems to be the way to go as it is something they catch but we should think about users such a GIG too (they check on warning)

@JeffreyDevloo
Copy link
Contributor

It is not feasible to make it a warning as it will have the same result as it has now (going from WARNING -> critical would trigger an alarm)
There are however some solutions around this problem:

Option 1: Integrate the healthcheck in the framework and work with caching and scheduling of checks

  • The framework would offload the checks to the workers
  • Healthcheck would return cached values (if none, execute, cache and return)
  • Make the scheduling configurable for anyone
  • Add options to avoid caching to retrieve latest state

PROS:

  • Managed by the framework team and able to keep it up to date with all changes
  • Most stable solution, caching can be omitted by executing it directly thus debugging would still work

CONS:

  • Not fully backwards compatible (might need some configuration work)
  • Most time consuming (integrating into the framework would result in all helper functions being wiped as they are wrappers around the framework)
  • Expose to cli should be used by the framework too

My estimation for this approach is 1,5 months (with the loss of Kvan, some knowledge will be lost)

Option 2: Healthcheck keeping its own results

  • The healthcheck would store its own results inside arakoon
  • Other cluster checks would wait for the output of the function to SKIP

PROS

  • Short term solution, involves around the cluster_check decorator
  • Cached results could be a base for option 1 (unlikely though)
    CONS:
  • Timing can be a problem (issue that is currently present too) (checkmk checks are grouped by scheduling time, 5m, 10m, ...)
  • Scheduling might be needed to populate the cache (avoids the con above)
  • Users who use the python code instead of the shell will be held up by these cluster checks (the shell interface starts a new process each time)

Without the own scheduling, I estimate it to be 2 weeks of work
With the scheduling I'd say go for option 1

Option 3: change it to skipped

  • The healthcheck does not care about its consumers and just marks it as skipped

PROS

  • Fast implementation (5 minutes)

CONS

  • No backwards compatibility
  • Internal monitoring relies on CheckMK which does not support this route
  • Still stuck with timing issues as before

@wimpers wimpers modified the milestones: J, K Jan 16, 2018
@wimpers wimpers modified the milestones: K, M Mar 6, 2018
@wimpers wimpers modified the milestones: M, Roadmap Apr 13, 2018
@wimpers
Copy link

wimpers commented May 17, 2018

If we pick this up and store results in Arakoon we might want to do something with max_hours_zero_disk_safety which was introduced in 6deb314#diff-a00aa6255fb5cf4f75fc130a527a174a but never used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants