False success after running the alba proxy-test a second time #351

matthiasdeblock · 2017-07-07T09:01:48Z

False success after running the alba proxy-test a second time. This gave us a Recovery on a broken albaproxy. This needs to be 'INFO' instead of 'SUCCESS' as we do not want to get a RECOVERY on these.

@jeroenmaelbrancke and @jtorreke : Do you guys agree on this?

root@NY1SRV0011:~# ovs healthcheck alba proxy-test
[INFO] Storagerouter Id: 334JtHQZVZ4TJORV                                                                                                                                                                                                                                      
[INFO] Environment Os: Ubuntu 16.04 xenial                                                                                                                                                                                                                                     
[INFO] Hostname: NY1SRV0011                                                                                                                                                                                                                                                    
[INFO] Cluster Id: zUorEHTo5DmajQNd                                                                                                                                                                                                                                            
[INFO] Storagerouter Type: EXTRA                                                                                                                                                                                                                                               
[INFO] Starting OpenvStorage Healthcheck version 3.3.5-1                                                                                                                                                                                                                       
[INFO] ======================                                                                                                                                                                                                                                                  
[INFO] Checking the ALBA proxies.                                                                                                                                                                                                                                              
[INFO] Checking ALBA proxy albaproxy_data-ny1-04_1.                                                                                                                                                                                                                            
[SKIPPED] Preset global_no_encrypt is not in use and will not be checked                                                                                                                                                                                                       
[FAILED] Create namespace has failed with Command 'proxy-create-namespace' failed with 'Proxy exception: Proxy_protocol.Protocol.Error.Unknown'. on namespace ovs-healthcheck-ns-local_encrypt-334JtHQZVZ4TJORV_abf075c1-50da-4731-98d1-645bfa0e7129 with proxy albaproxy_data-ny1-04_1 with preset local_encrypt                                                                                                                                                                                                                                             
[SKIPPED] Preset local_ss_encrypt is not in use and will not be checked                                                                                                                                                                                                        
[INFO] Checking ALBA proxy albaproxy_data-g-02_1.                                                                                                                                                                                                                              
[INFO] Deleting namespace ovs-healthcheck-ns-global_encrypt-334JtHQZVZ4TJORV_a805b32b-38f2-4bbb-aa23-c72104386827.
[SUCCESS] Namespace ovs-healthcheck-ns-global_encrypt-334JtHQZVZ4TJORV_a805b32b-38f2-4bbb-aa23-c72104386827 successfully removed.
[EXCEPTION] Unhandled exception caught when executing check_if_proxies_work. Got Creation namespace has timed out after 30.6417989731s
[INFO] Recap of Health Check module alba test proxy-test!
[INFO] ======================
[INFO] SUCCESS=1 FAILED=1 SKIPPED=2 WARNING=0 EXCEPTION=1

root@NY1SRV0011:~# ovs healthcheck alba proxy-test
[INFO] Storagerouter Id: 334JtHQZVZ4TJORV
[INFO] Environment Os: Ubuntu 16.04 xenial
[INFO] Hostname: NY1SRV0011
[INFO] Cluster Id: zUorEHTo5DmajQNd
[INFO] Storagerouter Type: EXTRA
[INFO] Starting OpenvStorage Healthcheck version 3.3.5-1
[INFO] ======================
[SUCCESS] Test check_if_proxies_work is already being executed on this node.
[INFO] Recap of Health Check module alba test proxy-test!
[INFO] ======================
[INFO] SUCCESS=1 FAILED=0 SKIPPED=0 WARNING=0 EXCEPTION=0

The text was updated successfully, but these errors were encountered:

JeffreyDevloo · 2017-07-07T12:41:52Z

I thought that the idea was that checkMK needed some form of input when calling the commands. As both SKIPPED and INFO are ignored, success was the the most valid one.
EDIT: origin ticket: #236

jeroenmaelbrancke · 2017-07-10T15:41:54Z

Indeed, Checkmk ignore SKIPPED. Info levels are not shown in json output. Or you can use WARNING if you want to see it in checkmk but not get awake of this.

JeffreyDevloo · 2017-08-23T09:38:50Z

@matthiasdeblock
Should we change it to warning?

matthiasdeblock · 2017-08-23T14:19:13Z

As long as it isn't INFO... We do not want to receive a SUCCESS on something that is already been tested on another node. This will give us the false positives.

wimpers · 2017-11-13T09:58:09Z

Please change to [SKIPPED]

JeffreyDevloo · 2017-12-04T14:03:48Z

@wimpers SKIPPED is filtered by operations
WARNING seems to be the way to go as it is something they catch but we should think about users such a GIG too (they check on warning)

JeffreyDevloo · 2017-12-04T16:02:32Z

It is not feasible to make it a warning as it will have the same result as it has now (going from WARNING -> critical would trigger an alarm)
There are however some solutions around this problem:

Option 1: Integrate the healthcheck in the framework and work with caching and scheduling of checks

The framework would offload the checks to the workers
Healthcheck would return cached values (if none, execute, cache and return)
Make the scheduling configurable for anyone
Add options to avoid caching to retrieve latest state

PROS:

Managed by the framework team and able to keep it up to date with all changes
Most stable solution, caching can be omitted by executing it directly thus debugging would still work

CONS:

Not fully backwards compatible (might need some configuration work)
Most time consuming (integrating into the framework would result in all helper functions being wiped as they are wrappers around the framework)
Expose to cli should be used by the framework too

My estimation for this approach is 1,5 months (with the loss of Kvan, some knowledge will be lost)

Option 2: Healthcheck keeping its own results

The healthcheck would store its own results inside arakoon
Other cluster checks would wait for the output of the function to SKIP

PROS

Short term solution, involves around the cluster_check decorator
Cached results could be a base for option 1 (unlikely though)
CONS:
Timing can be a problem (issue that is currently present too) (checkmk checks are grouped by scheduling time, 5m, 10m, ...)
Scheduling might be needed to populate the cache (avoids the con above)
Users who use the python code instead of the shell will be held up by these cluster checks (the shell interface starts a new process each time)

Without the own scheduling, I estimate it to be 2 weeks of work
With the scheduling I'd say go for option 1

Option 3: change it to skipped

The healthcheck does not care about its consumers and just marks it as skipped

PROS

Fast implementation (5 minutes)

CONS

No backwards compatibility
Internal monitoring relies on CheckMK which does not support this route
Still stuck with timing issues as before

wimpers · 2018-05-17T16:03:05Z

If we pick this up and store results in Arakoon we might want to do something with max_hours_zero_disk_safety which was introduced in 6deb314#diff-a00aa6255fb5cf4f75fc130a527a174a but never used

JeffreyDevloo added the state_question label Aug 23, 2017

matthiasdeblock removed the state_question label Aug 23, 2017

wimpers added the type_bug label Nov 13, 2017

wimpers added this to the J milestone Nov 13, 2017

wimpers modified the milestones: J, K Jan 16, 2018

wimpers modified the milestones: K, M Mar 6, 2018

wimpers modified the milestones: M, Roadmap Apr 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False success after running the alba proxy-test a second time #351

False success after running the alba proxy-test a second time #351

matthiasdeblock commented Jul 7, 2017

JeffreyDevloo commented Jul 7, 2017 •

edited

Loading

jeroenmaelbrancke commented Jul 10, 2017

JeffreyDevloo commented Aug 23, 2017

matthiasdeblock commented Aug 23, 2017

wimpers commented Nov 13, 2017

JeffreyDevloo commented Dec 4, 2017

JeffreyDevloo commented Dec 4, 2017

wimpers commented May 17, 2018

False success after running the alba proxy-test a second time #351

False success after running the alba proxy-test a second time #351

Comments

matthiasdeblock commented Jul 7, 2017

JeffreyDevloo commented Jul 7, 2017 • edited Loading

jeroenmaelbrancke commented Jul 10, 2017

JeffreyDevloo commented Aug 23, 2017

matthiasdeblock commented Aug 23, 2017

wimpers commented Nov 13, 2017

JeffreyDevloo commented Dec 4, 2017

JeffreyDevloo commented Dec 4, 2017

Option 1: Integrate the healthcheck in the framework and work with caching and scheduling of checks

Option 2: Healthcheck keeping its own results

Option 3: change it to skipped

wimpers commented May 17, 2018

JeffreyDevloo commented Jul 7, 2017 •

edited

Loading