Release 4.6.0 - Alpha 1 - Workload benchmarks metrics #18874

fdalmaup · 2023-09-07T12:04:15Z

The following issue aims to run all workload benchmarks for the current release candidate, report the results, and open new issues for any encountered errors.

Workload benchmarks metrics information


Main release candidate issue	#18858
Version	4.6.0
Release candidate #	Alpha 1
Tag	v4.6.0-alpha1
Previous Workload benchmarks metrics issue	#18413

Test configuration

All tests will be run and workload performance metrics will be obtained for the following clustered environment configurations:


# Agents	# Worker nodes
50000	25

Test report procedure

All individual test checks must be marked as:


Pass	The test ran successfully.
Xfail	The test was expected to fail and it failed. It must be properly justified and reported in an issue.
Skip	The test was not run. It must be properly justified and reported in an issue.
Fail	The test failed. A new issue must be opened to evaluate and address the problem.

All test results must have one the following statuses:


🟢	All checks passed.
🔴	There is at least one failed check.
🟡	There is at least one expected fail or skipped test and no failures.

Any failing test must be properly addressed with a new issue, detailing the error and the possible cause. It must be included in the Fixes section of the current release candidate main issue.

Any expected fail or skipped test must have an issue justifying the reason. All auditors must validate the justification for an expected fail or skipped test.

An extended report of the test results must be attached as a zip or txt. This report can be used by the auditors to dig deeper into any possible failures and details.

Conclusions 🟡

All tests have been executed and the results can be found here and here.

The following already reported defects were found:

API Performance 🟡

GET /manager/logs fixed in Fix error requesting plain manager logs in the API #17946 and will be introduced for 4.6.0 in Wazuh API is not capable of fetching Wazuh Manager logs #18939.

Cluster 🟡

The cluster tests were run manually since some changes need to be introduced in order to be run in the pipeline (wazuh/wazuh-qa#4298 and wazuh/wazuh-qa#4478).

Reliability

Two failures were found in these tests. Already reported to be fixed:

test_cluster_connection Unstable connection between master and workers according cluster.log in Reliability test wazuh-qa#4385
test_cluster_error_logs Error sending sendsync in Wazuh cluster #15802

Performance

For a detailed conclusion and report on the cluster performance metrics please refer to #18874 (comment).

Auditors validation

The definition of done for this one is the validation of the conclusions and the test results from all auditors.

All checks from below must be accepted in order to close this issue.

The text was updated successfully, but these errors were encountered:

fdalmaup · 2023-09-07T20:40:39Z

Issue update

I found an error in one of the final steps of the pipeline that should be solved for Alpha 2. It did not allow for the artifacts to be uploaded but after modifying it for this issue the following results were obtained:
artifacts.zip

API performance results


Endpoint test name	Status	Issues Ref.
GET /cluster/local/info	🟢
GET /cluster/nodes	🟢
GET /cluster/healthcheck	🟢
GET /cluster/status	🟢
GET /cluster/local/config	🟢
GET /cluster/api/config	🟢
GET /cluster/configuration/validation	🟢
GET /manager/status	🟢
GET /manager/info	🟢
GET /manager/configuration	🟢
GET /manager/logs	🔴	Fixed in #17946. The test will be marked as `XFAIL` until the version in which the fix was introduced is tested (wazuh/wazuh-qa#4508).
GET /manager/api/config	🟢
GET /manager/configuration/validation	🟢
GET /mitre/groups	🟢
GET /mitre/metadata	🟢
GET /mitre/mitigations	🟢
GET /mitre/references	🟢
GET /mitre/software	🟢
GET /mitre/tactics	🟢
GET /mitre/techniques	🟢
GET /overview/agents	🟢
GET /tasks/status	🟢
PUT /active-response	🟡	wazuh/wazuh-qa#1266
PUT /rootcheck	🟢
PUT /syscheck	🟢
POST /groups	🟢
POST /security/users	🟢
POST /security/roles	🟢
POST /security/policies	🟢
POST /security/rules	🟢
PUT /agents/group	🟡	Endpoint is not being tested (wazuh/wazuh-qa#3665). This related issue was opened in the past: #13872
PUT /agents/group/new_test_group/restart	🟢
PUT /agents/restart	🟢
POST /agents	🟢
POST /agents/insert	🟢
POST /agents/insert/quick	🟢
GET /agents	🟢
GET /agents/no_group	🟢
GET /agents/outdated	🟢
GET /decoders	🟢
GET /groups	🟢
GET /lists	🟢
GET /rules	🟢
GET /security/users	🟢
GET /security/roles	🟢
GET /security/policies	🟢
GET /security/rules	🟢
DELETE /groups	🟢
DELETE /agents	🟢
DELETE /security/users	🟢
DELETE /security/roles	🟢
DELETE /security/policies	🟢
DELETE /security/rules	🟢
PUT /manager/restart	🟢
PUT /cluster/restart	🟢

fdalmaup · 2023-09-08T18:05:50Z

Cluster

No errors were found in the cluster.log file of the master, although some are present in the workers.

Performance

The Performance tests (performance/test_cluster/test_cluster_performance/test_cluster_performance.py) fail as reported here. Nevertheless, the Cluster tasks duration stats were manually obtained:

Cluster tasks duration

{
    "setup_phase": {
        "agent-info_sync": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_8", 0.23868750000000002),
                    "max":("worker_13",1.397)
                },
                "master": {
                    "mean":("master", 0.10698632162661738),
                    "max":("master", 1.208)
                }
            }
        },
        "integrity_check": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_8", 0.13771428571428568),
                    "max":("worker_25", 1.363)
                },
                "master": {
                    "mean":("master", 0.007136493795736558),
                    "max":("master", 0.145)
                }
            }
        },
        "integrity_sync": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_12", 0.04213461538461538),
                    "max":("worker_23", 0.102)
                },
                "master": {
                    "mean":("master", 0.2697212355212356),
                    "max":("master", 1.528)
                }
            }
        }
    },
    "stable_phase": {
        "agent-info_sync": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_4", 0.013296296296296297),
                    "max":("worker_4", 0.31)
                },
                "master": {
                    "mean":("master", 0.12942857142857142),
                    "max":("master", 0.153)
                }
            }
        },
        "integrity_check": {
            "time_spent(s)": {
                "workers": {
                    "mean":("worker_6", 0.010344827586206898),
                    "max":("worker_23", 0.034)
                },
                "master": {
                    "mean":("master", 0.003953488372093024),
                    "max":("master", 0.023)
                }
            }
        }
    }
}

Reliability

Since the necessary changes to run the tests with Python 3.7 have not been introduced yet (wazuh/wazuh-qa#4478), these were performed locally:

Reliability tests results

====================================================================== test session starts ======================================================================
platform linux -- Python 3.10.13, pytest-7.1.2, pluggy-1.3.0
rootdir: /home/fdalmau/git/wazuh-qa/tests, configfile: pytest.ini
plugins: metadata-3.0.0, html-3.1.1, testinfra-5.0.0
collected 6 items                                                                                                                                               

test_cluster_connection/test_cluster_connection.py F                                                                                                      [ 16%]
test_cluster_error_logs/test_cluster_error_logs.py F                                                                                                      [ 33%]
test_cluster_master_logs_order/test_cluster_master_logs_order.py .                                                                                        [ 50%]
test_cluster_sync/test_cluster_sync.py .                                                                                                                  [ 66%]
test_cluster_task_order/test_cluster_task_order.py .                                                                                                      [ 83%]
test_cluster_worker_logs_order/test_cluster_worker_logs_order.py .                                                                                        [100%]

=========================================================================== FAILURES ============================================================================
____________________________________________________________________ test_cluster_connection ____________________________________________________________________

artifacts_path = 'artifacts'

    def test_cluster_connection(artifacts_path):
        """Verify that no worker disconnects from the master once they are connected.
    
        For each worker, this test looks for the first successful connection message
        in its logs. Then it looks for any failed connection attempts after the successful
        connection found above.
    
        Args:
            artifacts_path (str): Path where folders with cluster information can be found.
        """
        if not artifacts_path:
            pytest.fail("Parameter '--artifacts_path=<path>' is required.")
    
        cluster_log_files = glob(join(artifacts_path, 'worker_*', 'logs', 'cluster.log'))
        if len(cluster_log_files) == 0:
            pytest.fail(f'No files found inside {artifacts_path}.')
    
        for log_file in cluster_log_files:
            with open(log_file) as f:
                s = mmap(f.fileno(), 0, access=ACCESS_READ)
                # Search first successful connection message.
                conn = re.search(rb'^.*Successfully connected to master.*$', s, flags=re.MULTILINE)
                if not conn:
                    pytest.fail(f'Could not find "Sucessfully connected to master" message in the '
                                f'{node_name.search(log_file)[1]}')
    
                # Search if there are any connection attempts after the message found above.
                if re.search(rb'^.*Could not connect to master. Trying.*$|^.*Sucessfully connected to master.*$',
                             s[conn.end():], flags=re.MULTILINE):
                    disconnected_nodes.append(node_name.search(log_file)[1])
    
        if disconnected_nodes:
>           pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E           Failed: The following nodes disconnected from master at any point:
E           - worker_1
E           - worker_10
E           - worker_7
E           - worker_19
E           - worker_21
E           - worker_13
E           - worker_11
E           - worker_25
E           - worker_6
E           - worker_2
E           - worker_18
E           - worker_9
E           - worker_23
E           - worker_14
E           - worker_15
E           - worker_4
E           - worker_3
E           - worker_24
E           - worker_22
E           - worker_12
E           - worker_20
E           - worker_17
E           - worker_5
E           - worker_16
E           - worker_8

test_cluster_connection/test_cluster_connection.py:47: Failed
____________________________________________________________________ test_cluster_error_logs ____________________________________________________________________

artifacts_path = 'artifacts'

    def test_cluster_error_logs(artifacts_path):
        """Look for any error messages in the logs of the cluster nodes.
    
        Any error message that is not included in the "white_list" will cause the test to fail.
        Errors found are attached to an html report if the "--html=report.html" parameter is specified.
    
        Args:
            artifacts_path (str): Path where folders with cluster information can be found.
        """
        if not artifacts_path:
            pytest.fail('Parameter "--artifacts_path=<path>" is required.')
    
        cluster_log_files = glob(join(artifacts_path, '*', 'logs', 'cluster.log'))
        if len(cluster_log_files) == 0:
            pytest.fail(f'No files found inside {artifacts_path}.')
    
        for log_file in cluster_log_files:
            with open(log_file) as f:
                s = mmap(f.fileno(), 0, access=ACCESS_READ)
                error_lines = re.findall(rb'(^.*?error.*?$)', s, flags=re.MULTILINE | re.IGNORECASE)
                if error_lines:
                    error_lines = [error for error in error_lines if not error_in_white_list(error)]
                    if error_lines:
                        nodes_with_errors.update({node_name.search(log_file)[1]: error_lines})
    
>       assert not nodes_with_errors, 'Errors were found in the "cluster.log" file of ' \
                                      'these nodes: \n- ' + '\n- '.join(nodes_with_errors)
E       AssertionError: Errors were found in the "cluster.log" file of these nodes: 
E         - worker_1
E         - worker_10
E         - worker_7
E         - worker_19
E         - worker_21
E         - worker_13
E         - worker_11
E         - worker_25
E         - worker_6
E         - worker_2
E         - worker_18
E         - worker_9
E         - worker_23
E         - worker_14
E         - worker_15
E         - worker_4
E         - worker_3
E         - worker_24
E         - worker_22
E         - worker_12
E         - worker_20
E         - worker_17
E         - worker_5
E         - worker_16
E         - worker_8
E       assert not {'worker_1': [b'2023/09/07 20:05:18 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B291_manager_1] [Main] Error se...nager_12] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok', ...], ...}

test_cluster_error_logs/test_cluster_error_logs.py:57: AssertionError
==================================================================== short test summary info ====================================================================
FAILED test_cluster_connection/test_cluster_connection.py::test_cluster_connection - Failed: The following nodes disconnected from master at any point:
FAILED test_cluster_error_logs/test_cluster_error_logs.py::test_cluster_error_logs - AssertionError: Errors were found in the "cluster.log" file of these nodes: 
============================================================ 2 failed, 4 passed in 168.47s (0:02:48) ============================================================

test_cluster_connection/test_cluster_connection.py
The failure of this test was already reported here.
test_cluster_error_logs/test_cluster_error_logs.py
The failure is due to the following error type in the majority of the workers (already reported here):

2023/09/07 20:05:21 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B291_manager_16] [Main] Error sending sendsync response to local client: Error 3020 - Timeout sending request: ok

davidjiglesias · 2023-09-11T11:59:06Z

I see some failed tests (reliability), but in the conclusions above you mention there are no errors.

Selutario · 2023-09-12T16:26:29Z

Errors in test_cluster_connection/test_cluster_connection.py are expected since the cluster is restarted as part of the API performance test:

2023/09/07 20:29:19 INFO: wazuh 172.31.72.150 "PUT /cluster/restart" with parameters {} and body {} done in 0.104s: 200
2023/09/07 20:29:25 INFO: Checking RBAC database integrity...
2023/09/07 20:29:25 INFO: /var/ossec/api/configuration/security/rbac.db file was detected

We'll need to review if there is any problem in remoted or in the cluster related to groups sync using sendsync, as you already mentioned.

Everything else looks good to me.

davidjiglesias · 2023-09-14T06:37:27Z

LGTM!

fdalmaup added level/task type/test labels Sep 7, 2023

fdalmaup self-assigned this Sep 7, 2023

fdalmaup mentioned this issue Sep 7, 2023

Release 4.6.0 - Alpha 1 #18858

Closed

wazuhci added this to Release 4.6.0 Sep 7, 2023

wazuhci moved this to Backlog in Release 4.6.0 Sep 7, 2023

wazuhci moved this from Backlog to In progress in Release 4.6.0 Sep 7, 2023

wazuhci moved this from In progress to Pending final review in Release 4.6.0 Sep 8, 2023

wazuhci moved this from Pending final review to In final review in Release 4.6.0 Sep 12, 2023

wazuhci moved this from In final review to Pending final review in Release 4.6.0 Sep 12, 2023

davidjiglesias closed this as completed Sep 14, 2023

wazuhci moved this from Pending final review to Done in Release 4.6.0 Sep 14, 2023

GGP1 mentioned this issue Sep 25, 2023

Release 4.6.0 - Beta 1 - Workload benchmarks metrics #19202

Closed

1 task

RamosFe mentioned this issue Oct 9, 2023

Release 4.6.0- RC 1 - Workload benchmarks metrics #19523

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 4.6.0 - Alpha 1 - Workload benchmarks metrics #18874

Release 4.6.0 - Alpha 1 - Workload benchmarks metrics #18874

fdalmaup commented Sep 7, 2023 •

edited by davidjiglesias

Loading

fdalmaup commented Sep 7, 2023 •

edited

Loading

fdalmaup commented Sep 8, 2023

davidjiglesias commented Sep 11, 2023

Selutario commented Sep 12, 2023

davidjiglesias commented Sep 14, 2023

Release 4.6.0 - Alpha 1 - Workload benchmarks metrics #18874

Release 4.6.0 - Alpha 1 - Workload benchmarks metrics #18874

Comments

fdalmaup commented Sep 7, 2023 • edited by davidjiglesias Loading

Workload benchmarks metrics information

Test configuration

Test report procedure

Conclusions 🟡

API Performance 🟡

Cluster 🟡

Reliability

Performance

Auditors validation

fdalmaup commented Sep 7, 2023 • edited Loading

Issue update

API performance results

fdalmaup commented Sep 8, 2023

Cluster

Performance

Reliability

davidjiglesias commented Sep 11, 2023

Selutario commented Sep 12, 2023

davidjiglesias commented Sep 14, 2023

fdalmaup commented Sep 7, 2023 •

edited by davidjiglesias

Loading

fdalmaup commented Sep 7, 2023 •

edited

Loading