Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tool for analyzing and reporting random CDash test failures #600

Open
achauphan opened this issue Jan 17, 2024 · 2 comments
Open

Add tool for analyzing and reporting random CDash test failures #600

achauphan opened this issue Jan 17, 2024 · 2 comments

Comments

@achauphan
Copy link
Collaborator

achauphan commented Jan 17, 2024

Related issues

Description

Random failures can bring down an entire CI iteration on a regular basis and waste resources whenever a retest is requested in order to pass the various checks of a pull request.

Spotting a randomly failing test requires a lot of manual CDash querying and analysis by the developer. However, in most cases, a developer may not have the time to trace, identify, and report the randomly failing test, and instead will opt to ignore it in favor of requesting a retest, leading to the previously stated point of wasting resources. This lack of reporting also leads to bigger issue in that it allows the randomly failing test to linger inside the code base and further affect developers in the future.

Proposed Solution

This issue proposes a new tool (which for now would live inside of TriBITS under tribits/ci_support) that can run automatically to query, scrape, analyze, and report tests that are deemed to be "randomly failing" to an operations team via email or an automated issue creation in the repository.

The definition for a randomly failing test will be a test that intermittently reports as passing or failing without any changes made to the topic or target branch being tested (topic and target tip SHA1 are the same) between CI testing iterations.

Fortunately, there is a lot of already existing work done that can be leveraged to build this tool in Python that already exists inside of tribits/ci_support. Notably, the module CreateIssueTrackerFromCDashQuery.py which can be used in the template example example_test_failure_github_issue.py along with the module CDashQueryAnalyzeReport.py which contains most of the heavy CDash querying functionality. Thus, the core work that will need to be done after utilizing the previously written modules will be to implement the algorithm that determines a random failure that is customizable on a project basis.

The goal will be for this tool to be able to look for randomly failing tests for any projects that posts their test results to CDash. The specifics of how this tool will gather the version information of the builds in CDash will be unique to each project and will require implementation on a project basis.

Ideally, this tool can be extended to analyze and report randomly failing configure, builds, and tests, however starting with randomly failing tests should lead to a similar framework that can be used for those other cases.

Requirements

  • posts a github issue upon identifying a randomly failing test (TRILFRAME-614 requirement for any post starting with an email first)
  • be able to query cdash results over a period of time
  • all functionality is tested
  • usage is documented
@achauphan achauphan self-assigned this Jan 17, 2024
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 22, 2024
Added initial set of arguments for script to take in when ran.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 24, 2024
…ub#600)

Helper module functions that construct the browser and query URLs to
cdash that can be used for downloaded the data from.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 24, 2024
…#600)

This initial script implementation takes in several cdash arguments
and filters cdash for an initial set of all failing tests for a
certain number of days. With that set of all failing tests, the script
will then get all of that test's testing history. The test's full testing
history is used to build a set of target,topic sha1 associated with failing
testing iterations.

This initial implementation current lacks the check to see if a
passing test's target,topic sha1s exist in the set of failing sha1s,
which denotes an unstable test.

Monolithic commit as this started from a lot of exploratory coding
that eventually built to this starting implementation.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 24, 2024
…#600)

Add checkIfTestUnstable() that takes in a tuple of passing sha1s and a
set of tuples containing nonpassing sha1s. This requires testing.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 25, 2024
Add a set of unit tests for getBuildIdFromTest helper function from
cdash_analyze_and_report_random_failure.py
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 25, 2024
Moved argument parsing into a function that gets called by main and
changed getBuildIdFromTest to return the last item of the split string rather
than a constant index.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Limited the build name to only 80 characters to shorten the cache file name.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Fix regex pattern to match a string literal rather than
a raw json string output which was used prior for during testing.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
…#600)

Moved random failure test files to its own seperate folder inside test/ci_support as to
not be confused with the test files associated with other script.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Test layout was copied from another script. Renamed various functions and
function calls to reflect the actual script name that is being tested.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Included summary output of analysis run and found randomly failing
tests.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Initial test case for 1 passing and 1 failing test in a test history
with identical sha1s between the two, signifying a failing test
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Initial test case for 1 passing and 1 failing test in a test history
with identical sha1s between the two, signifying a failing test
@bartlettroscoe
Copy link
Member

CC: @sebrowne

@achauphan , one thing that occurred to me is that this tool will need to allow the usage of build-name modifier to take in the build name from CDash and provide a name used to determine sequential builds for the Trilinos PR and nightly testing system. For example, all of the Trilinos build names have the prefix PR-<prID>-test- and the suffix -<jenkinsJobID> that must be removed from the build name to get the core build name. For example, the builds:

  • PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1731
  • PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1732

are sequence of the same build but CDash actually does not recognize that because the build names are different. To identify a related sequence builds, you need to at least remove the suffix -<jenkinsJobID> to give:

  • PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables
  • PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables

Then, if the target and topic branches are the same, and if a test goes from passing to failing, then you can classify this as a random test failure.

You can provide the means for adjusting the build names using the Strategy Design Pattern.

So the two areas of variability for such a tool that will be project-specific (and therefore need to be abstracted out and pulled in as Strategy objects) are:

  1. How to extract the version of the project for the purposes of comparing builds. (In the case of Trilinos with merge commits, you can do that by concatenating the target and topic branch SHA1s scrapped from the configure output and put tino into a string like <sha1-target>-<sha1-topic>.) Then the Python code just needs to compare the string values for this "version" to determine if the versions are the same (and it does not matter how that "version" was constructed or even what it represents).

  2. How to edit the build names so we can determine sequences of the same build configurations. (In the case of Trilinos, at least remove the suffix -<jenkinsJobID>.)

Those can be two separate strategy objects given to the Python class(es) that are doing the data processing and analysis.

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Each test in a fail test's history will have their own testname_buildname
directory inside of the build_summary_cache directory. This was done to
better group build summary cache files with their associated test and build names.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Added a not random failure system test case beginning from one failed test
and 5 tests in its history where all tests contain merge commits with
non-matching parents.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Renamed variables and cache files to be shorter in cases where the
expected source or direction is related to CDash or CDash tests.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Renamed variables and cache files to be shorter in cases where the
expected source or direction is related to CDash or CDash tests.
achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024
Renamed variables and cache files to be shorter in cases where the
expected source or direction is related to CDash or CDash tests.
achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024
Use dummy strings that are more easily identifiable for test input files
for parent commit hashes of the respective build summary output.
achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024
…TSPub#600)

Normalize the groupName string before usage in a url request rather than
having the user input an already normalized string. This has the added benefit
of being able to use the groupName string without url normal characters
for output in upcoming summary lines.
achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024
achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024
Removed individual printing of RandomFailureSummary in-line and instead
use a str() function for RandomFailureSummary object
achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024
Added functionality to build an html file containing analysis results
and the ability to report the results via email.
achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 6, 2024
Added functionality to build an html file containing analysis results
and the ability to report the results via email.
achauphan added a commit that referenced this issue Feb 13, 2024
…600)

Added cdash_analyze_and_report_random_failures_UnitTests ctest test using
tribits_add_advanced_test().
achauphan added a commit that referenced this issue Feb 13, 2024
This check was left in from an initial starting script. After adding
cdash_analyze_and_report_random_failures_UnitTests.py as a ctest test,
this check would cause CI ctest runs to fail as TRIBITS_DIR is set
on a project basis during testing.
achauphan added a commit that referenced this issue Feb 13, 2024
At a glance, the names of test cases such as "rft_0_ift_2" are not
understandable without knowing what the acronyms mean. The directories
of the new test case names will continue to use the acronyms as that
better depicts the contents of the test files present.
achauphan added a commit that referenced this issue Feb 13, 2024
The context of the script is a cdash tool so most of the variable
names do not need that additional context in their names.
achauphan added a commit that referenced this issue Feb 13, 2024
#600)

Create driver class CDashAnalyzeReportRandomFailuresDriver inside module file
CDashAnalyzeReportRandomFailures.py that will contain the main general functionality
of the random test failure tool.

The driver class accepts two strategy classes passed from the
example script. These strategy classes ExampleVersionInfoStrategy and
ExampleBuildNameStrategy contain the project specific implementation
that is generically used inside of the driver class.
achauphan added a commit that referenced this issue Feb 13, 2024
#600)

This large commit is copying over the main() function and its associated
helper functions into CDashAnalyzeReportRandomFailures.py inside
the CDashAnalyzeReportRandomFailuresDriver class. This is part
of the effort to refactor cdash_analyze_and_report_random_failures.py
to be more generic.
achauphan added a commit that referenced this issue Feb 13, 2024
…600)

There were mixed use cases of 'targetTopic' or 'topicTarget', this renames
all cases to use 'targetTopic' approach.
achauphan added a commit that referenced this issue Feb 13, 2024
Moved example_cdash_analyze_and_report_random_failures.py to test/ci_support
achauphan added a commit that referenced this issue Feb 13, 2024
Trilinos specific driver `trilinos_cdash_analyze_and_report_random_failures.py` based on
`example_cdash_analyze_and_report_random_failures.py` that contains the Trilinos specific
implementations of `VersionInfoStrategy` and `ExtractBuildNameStrategy`.
achauphan added a commit that referenced this issue Feb 13, 2024
Example class did not include the 'Example' prefix.
achauphan added a commit that referenced this issue Feb 13, 2024
This is for testing the CDashAnalyzeReportRandomFailures.py
runDriver().
achauphan added a commit that referenced this issue Feb 13, 2024
Adjusted spacing between classes and added newline character at
the end of file.
achauphan added a commit that referenced this issue Feb 13, 2024
This reverts commit dbe94f4.

Reverting this commit as this specific driver implementation shouldn't be
existing inside of TriBITS. Rather it should be added to the Trilinos
repo after snapshotting TriBITS in.
achauphan added a commit that referenced this issue Feb 13, 2024
Deleted the original `cdash_analyze_and_report_random_failures.py` script after
moving its main functionality into a separate class inside `CDashAnalyzeReportRandomFailures.py`.

To run the script, one must start from `example_cdash_analyze_and_report_random_failures.py` located
in `test/ci_support` and supply an implementation of the two strategy objects used by the
`CDashAnalyzeReportRandomFailures.py` driver class.
achauphan added a commit that referenced this issue Feb 13, 2024
Removed unit tests related to the old script, `cdash_analyze_and_report_random_failures.py`.
These tests will be put back as unittests for the module file `CDashAnalyzeReportRandomFailures.py`.

This change will keep `cdash_analyze_and_report_random_failures_UnitTests.py` focused on
the system tests for how the class `CDashAnalyzeReportRandomFailuresDriver` is used.
achauphan added a commit that referenced this issue Feb 13, 2024
Added tests for `CDashAnalyzeReportRandomFailuresDriver` member functions in
`CDashAnalyzeReportRandomFailures.py`.
achauphan added a commit that referenced this issue Feb 13, 2024
Previous filename compression technique was to always trim the buildname to
only the first 80 characters as to avoid "filename too long" errors.

Cache file or directory names are built in the format of `testName_buildName`
The above method does not protect against the case where testName may be
very long.

This implementation uses an existing function named `getCompressedFileNameIfTooLong` in
`CDashQueryAnalyzeReport.py` module file which will form a hash of the passed in string
if it is deamed too long.

This will also help mitigate the chances of a filename collision as previously it was
possible for a trimmed buildName to result in the same `testName_buildName` filename
if testName was the same test and had the correct length.
achauphan added a commit that referenced this issue Feb 13, 2024
Optional usageHelp string that can be passed to `CDashAnalyzeReportRandomFailuresDriver`
that is outputted with when the main script is given the `--help` argument.
achauphan added a commit that referenced this issue Feb 13, 2024
Used to specify the testing day start time unique to each CDash
project.
achauphan added a commit that referenced this issue Feb 13, 2024
bartlettroscoe added a commit that referenced this issue Feb 14, 2024
bartlettroscoe added a commit to trilinos/Trilinos that referenced this issue Feb 14, 2024
@bartlettroscoe
Copy link
Member

@achauphan and @sebrowne, the Trilinos PR that brings in TriBITS PR #603 is:

We can work on further refactorings and feature enhancements later.

I can see were this may be useful for some metrics for other projects that submit to CDash so I will do those refactorings as needed.

achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 15, 2024
Added argument to specify a prefix string for the built html
page title and the email subject. This can help with the
tool's email searchability.
bartlettroscoe added a commit that referenced this issue Feb 19, 2024
…-02-15

CDash Random failure tool patch 2024-02-15 (#600)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants