Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 4.6.0 - Alpha 1 - Workload benchmarks metrics #18874

Closed
3 tasks done
fdalmaup opened this issue Sep 7, 2023 · 5 comments
Closed
3 tasks done

Release 4.6.0 - Alpha 1 - Workload benchmarks metrics #18874

fdalmaup opened this issue Sep 7, 2023 · 5 comments

Comments

@fdalmaup
Copy link
Member

fdalmaup commented Sep 7, 2023

The following issue aims to run all workload benchmarks for the current release candidate, report the results, and open new issues for any encountered errors.

Workload benchmarks metrics information

Main release candidate issue #18858
Version 4.6.0
Release candidate # Alpha 1
Tag v4.6.0-alpha1
Previous Workload benchmarks metrics issue #18413

Test configuration

All tests will be run and workload performance metrics will be obtained for the following clustered environment configurations:

# Agents # Worker nodes
50000 25

Test report procedure

All individual test checks must be marked as:

Pass The test ran successfully.
Xfail The test was expected to fail and it failed. It must be properly justified and reported in an issue.
Skip The test was not run. It must be properly justified and reported in an issue.
Fail The test failed. A new issue must be opened to evaluate and address the problem.

All test results must have one the following statuses:

🟢 All checks passed.
🔴 There is at least one failed check.
🟡 There is at least one expected fail or skipped test and no failures.

Any failing test must be properly addressed with a new issue, detailing the error and the possible cause. It must be included in the Fixes section of the current release candidate main issue.

Any expected fail or skipped test must have an issue justifying the reason. All auditors must validate the justification for an expected fail or skipped test.

An extended report of the test results must be attached as a zip or txt. This report can be used by the auditors to dig deeper into any possible failures and details.

Conclusions 🟡

All tests have been executed and the results can be found here and here.

The following already reported defects were found:

API Performance 🟡

Cluster 🟡

The cluster tests were run manually since some changes need to be introduced in order to be run in the pipeline (wazuh/wazuh-qa#4298 and wazuh/wazuh-qa#4478).

Reliability

Two failures were found in these tests. Already reported to be fixed:

Performance

For a detailed conclusion and report on the cluster performance metrics please refer to #18874 (comment).

Auditors validation

The definition of done for this one is the validation of the conclusions and the test results from all auditors.

All checks from below must be accepted in order to close this issue.

@fdalmaup fdalmaup self-assigned this Sep 7, 2023
@wazuhci wazuhci moved this to Backlog in Release 4.6.0 Sep 7, 2023
@wazuhci wazuhci moved this from Backlog to In progress in Release 4.6.0 Sep 7, 2023
@fdalmaup
Copy link
Member Author

fdalmaup commented Sep 7, 2023

Issue update

I found an error in one of the final steps of the pipeline that should be solved for Alpha 2. It did not allow for the artifacts to be uploaded but after modifying it for this issue the following results were obtained:
artifacts.zip

API performance results


Endpoint test name Status Issues Ref.
GET /cluster/local/info 🟢
GET /cluster/nodes 🟢
GET /cluster/healthcheck 🟢
GET /cluster/status 🟢
GET /cluster/local/config 🟢
GET /cluster/api/config 🟢
GET /cluster/configuration/validation 🟢
GET /manager/status 🟢
GET /manager/info 🟢
GET /manager/configuration 🟢
GET /manager/logs 🔴 Fixed in #17946. The test will be marked as XFAIL until the version in which the fix was introduced is tested (wazuh/wazuh-qa#4508).
GET /manager/api/config 🟢
GET /manager/configuration/validation 🟢
GET /mitre/groups 🟢
GET /mitre/metadata 🟢
GET /mitre/mitigations 🟢
GET /mitre/references 🟢
GET /mitre/software 🟢
GET /mitre/tactics 🟢
GET /mitre/techniques 🟢
GET /overview/agents 🟢
GET /tasks/status 🟢
PUT /active-response 🟡 wazuh/wazuh-qa#1266
PUT /rootcheck 🟢
PUT /syscheck 🟢
POST /groups 🟢
POST /security/users 🟢
POST /security/roles 🟢
POST /security/policies 🟢
POST /security/rules 🟢
PUT /agents/group 🟡 Endpoint is not being tested (wazuh/wazuh-qa#3665). This related issue was opened in the past: #13872
PUT /agents/group/new_test_group/restart 🟢
PUT /agents/restart 🟢
POST /agents 🟢
POST /agents/insert​ 🟢
POST /agents/insert​/quick 🟢
GET /agents 🟢
GET /agents/no_group 🟢
GET /agents/outdated 🟢
GET /decoders 🟢
GET /groups 🟢
GET /lists 🟢
GET /rules 🟢
GET /security/users 🟢
GET /security/roles 🟢
GET /security/policies 🟢
GET /security/rules 🟢
DELETE /groups 🟢
DELETE /agents 🟢
DELETE /security/users 🟢
DELETE /security/roles 🟢
DELETE /security/policies 🟢
DELETE /security/rules 🟢
PUT /manager/restart 🟢
PUT /cluster/restart 🟢

@fdalmaup
Copy link
Member Author

fdalmaup commented Sep 8, 2023

Cluster

No errors were found in the cluster.log file of the master, although some are present in the workers.

Performance

The Performance tests (performance/test_cluster/test_cluster_performance/test_cluster_performance.py) fail as reported here. Nevertheless, the Cluster tasks duration stats were manually obtained:

Cluster tasks duration
{
    "setup_phase": {
        "agent-info_sync": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_8", 0.23868750000000002),
                    "max":("worker_13",1.397)
                },
                "master": {
                    "mean":("master", 0.10698632162661738),
                    "max":("master", 1.208)
                }
            }
        },
        "integrity_check": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_8", 0.13771428571428568),
                    "max":("worker_25", 1.363)
                },
                "master": {
                    "mean":("master", 0.007136493795736558),
                    "max":("master", 0.145)
                }
            }
        },
        "integrity_sync": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_12", 0.04213461538461538),
                    "max":("worker_23", 0.102)
                },
                "master": {
                    "mean":("master", 0.2697212355212356),
                    "max":("master", 1.528)
                }
            }
        }
    },
    "stable_phase": {
        "agent-info_sync": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_4", 0.013296296296296297),
                    "max":("worker_4", 0.31)
                },
                "master": {
                    "mean":("master", 0.12942857142857142),
                    "max":("master", 0.153)
                }
            }
        },
        "integrity_check": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_6", 0.010344827586206898),
                    "max":("worker_23", 0.034)
                },
                "master": {
                    "mean":("master", 0.003953488372093024),
                    "max":("master", 0.023)
                }
            }
        }
    }
}

Reliability

Since the necessary changes to run the tests with Python 3.7 have not been introduced yet (wazuh/wazuh-qa#4478), these were performed locally:

Reliability tests results
====================================================================== test session starts ======================================================================
platform linux -- Python 3.10.13, pytest-7.1.2, pluggy-1.3.0
rootdir: /home/fdalmau/git/wazuh-qa/tests, configfile: pytest.ini
plugins: metadata-3.0.0, html-3.1.1, testinfra-5.0.0
collected 6 items                                                                                                                                               

test_cluster_connection/test_cluster_connection.py F                                                                                                      [ 16%]
test_cluster_error_logs/test_cluster_error_logs.py F                                                                                                      [ 33%]
test_cluster_master_logs_order/test_cluster_master_logs_order.py .                                                                                        [ 50%]
test_cluster_sync/test_cluster_sync.py .                                                                                                                  [ 66%]
test_cluster_task_order/test_cluster_task_order.py .                                                                                                      [ 83%]
test_cluster_worker_logs_order/test_cluster_worker_logs_order.py .                                                                                        [100%]

=========================================================================== FAILURES ============================================================================
____________________________________________________________________ test_cluster_connection ____________________________________________________________________

artifacts_path = 'artifacts'

    def test_cluster_connection(artifacts_path):
        """Verify that no worker disconnects from the master once they are connected.
    
        For each worker, this test looks for the first successful connection message
        in its logs. Then it looks for any failed connection attempts after the successful
        connection found above.
    
        Args:
            artifacts_path (str): Path where folders with cluster information can be found.
        """
        if not artifacts_path:
            pytest.fail("Parameter '--artifacts_path=<path>' is required.")
    
        cluster_log_files = glob(join(artifacts_path, 'worker_*', 'logs', 'cluster.log'))
        if len(cluster_log_files) == 0:
            pytest.fail(f'No files found inside {artifacts_path}.')
    
        for log_file in cluster_log_files:
            with open(log_file) as f:
                s = mmap(f.fileno(), 0, access=ACCESS_READ)
                # Search first successful connection message.
                conn = re.search(rb'^.*Successfully connected to master.*$', s, flags=re.MULTILINE)
                if not conn:
                    pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                                f'{node_name.search(log_file)[1]}')
    
                # Search if there are any connection attempts after the message found above.
                if re.search(rb'^.*Could not connect to master. Trying.*$|^.*Sucessfully connected to master.*$',
                             s[conn.end():], flags=re.MULTILINE):
                    disconnected_nodes.append(node_name.search(log_file)[1])
    
        if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_1
E           - worker_10
E           - worker_7
E           - worker_19
E           - worker_21
E           - worker_13
E           - worker_11
E           - worker_25
E           - worker_6
E           - worker_2
E           - worker_18
E           - worker_9
E           - worker_23
E           - worker_14
E           - worker_15
E           - worker_4
E           - worker_3
E           - worker_24
E           - worker_22
E           - worker_12
E           - worker_20
E           - worker_17
E           - worker_5
E           - worker_16
E           - worker_8

test_cluster_connection/test_cluster_connection.py:47: Failed
____________________________________________________________________ test_cluster_error_logs ____________________________________________________________________

artifacts_path = 'artifacts'

    def test_cluster_error_logs(artifacts_path):
        """Look for any error messages in the logs of the cluster nodes.
    
        Any error message that is not included in the "white_list" will cause the test to fail.
        Errors found are attached to an html report if the "--html=report.html" parameter is specified.
    
        Args:
            artifacts_path (str): Path where folders with cluster information can be found.
        """
        if not artifacts_path:
            pytest.fail('Parameter "--artifacts_path=<path>" is required.')
    
        cluster_log_files = glob(join(artifacts_path, '*', 'logs', 'cluster.log'))
        if len(cluster_log_files) == 0:
            pytest.fail(f'No files found inside {artifacts_path}.')
    
        for log_file in cluster_log_files:
            with open(log_file) as f:
                s = mmap(f.fileno(), 0, access=ACCESS_READ)
                error_lines = re.findall(rb'(^.*?error.*?$)', s, flags=re.MULTILINE | re.IGNORECASE)
                if error_lines:
                    error_lines = [error for error in error_lines if not error_in_white_list(error)]
                    if error_lines:
                        nodes_with_errors.update({node_name.search(log_file)[1]: error_lines})
    
>       assert not nodes_with_errors, 'Errors were found in the "cluster.log" file of ' \
                                      'these nodes: \n- ' + '\n- '.join(nodes_with_errors)
E       AssertionError: Errors were found in the "cluster.log" file of these nodes: 
E         - worker_1
E         - worker_10
E         - worker_7
E         - worker_19
E         - worker_21
E         - worker_13
E         - worker_11
E         - worker_25
E         - worker_6
E         - worker_2
E         - worker_18
E         - worker_9
E         - worker_23
E         - worker_14
E         - worker_15
E         - worker_4
E         - worker_3
E         - worker_24
E         - worker_22
E         - worker_12
E         - worker_20
E         - worker_17
E         - worker_5
E         - worker_16
E         - worker_8
E       assert not {'worker_1': [b'2023/09/07 20:05:18 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B291_manager_1] [Main] Error se...nager_12] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok', ...], ...}

test_cluster_error_logs/test_cluster_error_logs.py:57: AssertionError
==================================================================== short test summary info ====================================================================
FAILED test_cluster_connection/test_cluster_connection.py::test_cluster_connection - Failed: The following nodes disconnected from master at any point:
FAILED test_cluster_error_logs/test_cluster_error_logs.py::test_cluster_error_logs - AssertionError: Errors were found in the "cluster.log" file of these nodes: 
============================================================ 2 failed, 4 passed in 168.47s (0:02:48) ============================================================
  • test_cluster_connection/test_cluster_connection.py
    The failure of this test was already reported here.

  • test_cluster_error_logs/test_cluster_error_logs.py
    The failure is due to the following error type in the majority of the workers (already reported here):

2023/09/07 20:05:21 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B291_manager_16] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok

@wazuhci wazuhci moved this from In progress to Pending final review in Release 4.6.0 Sep 8, 2023
@davidjiglesias
Copy link
Member

I see some failed tests (reliability), but in the conclusions above you mention there are no errors.

@Selutario
Copy link
Contributor

Errors in test_cluster_connection/test_cluster_connection.py are expected since the cluster is restarted as part of the API performance test:

2023/09/07 20:29:19 INFO: wazuh 172.31.72.150 "PUT /cluster/restart" with parameters {} and body {} done in 0.104s: 200
2023/09/07 20:29:25 INFO: Checking RBAC database integrity...
2023/09/07 20:29:25 INFO: /var/ossec/api/configuration/security/rbac.db file was detected

We'll need to review if there is any problem in remoted or in the cluster related to groups sync using sendsync, as you already mentioned.

Everything else looks good to me.

@wazuhci wazuhci moved this from Pending final review to In final review in Release 4.6.0 Sep 12, 2023
@wazuhci wazuhci moved this from In final review to Pending final review in Release 4.6.0 Sep 12, 2023
@davidjiglesias
Copy link
Member

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants