Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: cdash_analyze_and_report.py: Many possible improvements #578

Open
bartlettroscoe opened this issue May 8, 2023 · 3 comments
Open

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented May 8, 2023

This EPIC is a long list of possible improvements that I had when activity developing the tool cdash_analyze_and_report.py for an internal project.

These needs to be broken out into individual Issues/Stories before they actually get worked.

Things to implement before we call this (an unofficial) version 1.0:

  • Create class AddBuildHistoryToBuildDictFunctor to add build history and links to build dicts (existing or missing builds).

  • Create function createCDashBuildsHtmlTableStr() for lists of build dicts updated using AddBuildHistoryToBuildDictFunctor. (Make work for the "Builds Missing: bm=???" table and the "Builds with Configure Failures: cf=???" and "Builds with Build Failures: bf=???" tables.)

  • Use createCDashBuildsHtmlTableStr() and AddBuildHistoryToBuildDictFunctor to update the tables "Builds Missing: bm=???", "Builds with Configure Failures: cf=???" and "Builds with Build Failures: bf=???".

  • Add single table of all of the builds with any failures and provide the headers "Builds with Update Failures: uf=???", "Builds with Configure Failures: cf = ???", "Builds with Build Failures: bf = ????", and "Builds with Test Failures: btf = ???". But the number of days failed will be w.r.t. configure and build failures, not test failures (provide a note at the bottom of the table "NOTE: non-pass build days refers to days that the listed build had configure or build errors and does not include days with test failures. Test failures are shown in the below tables.")

At this point, the summary email will be able to detect and report failures of all kinds and we will be able to see new failures as they occur and every test with a listed issue tracker will be accounted for and listed in the email in some way. From this point on, we will add new types of checks and other features as we have time in order to make our job of staying on top of these builds and tests easier.

Nice things to add:

  • Expand logic to remove duplicate tests that have the same site, buildname, and buildstarttime. Resolve conflicts by taking a passing test first, then a failing test, then a not-run test, and from there it is arbitrary (but pick say the min runtime). This will make the tool robust for days where test results from prior days get delayed and get dumped all at once on the same testing day. See TriBITS GitHub cdash_analyze_and_report.py: Make robust to duplicate tests #303.

  • Add --help-topic=<category> with categories 'overview', 'inputs', 'outputs', 'expected-builds', 'tests-with-issue-trackers', ..., and 'all'. This will help to explain what the script does in basic terms (and may be all of the documentation that is needed).

  • Create the start of some documentation as a file cdash_analyze_and_report.md in TriBITS/tribits/doc/ci_support/ to document the script intents and basic setup and usage info.

  • Add tests for common error cases for cdash_analyze_and_report.py:

    • Duplicate entries in expectedBuilds.csv and good error messages (edits of *.csv file)
    • Duplicate entries in testsWithIssueTrackers.csv and good error messages (edits of *.csv file)
    • Duplicate builds returned from cdash/index.php
    • Duplicate tests returned from cdash/queryTests.php
    • Empty required args and error messages (error setting up driver script)
  • Add the option --extra-exclude-test-filters "field1=<field1>&compare1=<compare1>&value1=<value1>&..." and use it in the global query of nonpassing tests but also use it for the test history link and queries. But it would be good to also get the test history without theses extra filter fields as well to show the the test was "Missing/Failed" like the current implementation was doing. And it would be nice to have access to both of these test history queries. The problem is how to do that? Perhaps we could add the queryTests.php without the extra query fields to the '0' in the "Consecutive Missing Days" column? The fact that a test is both "missing" and "failed" with '0' missing consecutive days, clicking on that '0' might be a good way to provide access to this information.

  • Add option --show-missing-tests-for-missing-builds=[on|off] (default 'off') that will not filter out missing tests for missing builds. These could be reported in the existing "Tests with issue trackers Missing: twim=???" table or in a new table "Tests with Issue Trackers Missing Builds: twimb=???" This is needed for the update of GitHub issues with the Grover tool to show the full status and history of all of the tests for an issue tracker.

  • Add option --ignore-missing-tests-with-issue-trackers=[on|off] (default 'off') that can be used to implement scripts for a subset of expected builds or a subset of CDash SubProjects. Any tests in the "tests-with-issue-trackers" *.csv file that did not match expected builds or did not match the list of non-passing tests would be ignored. This would be used, for example, to implement scripts that analyze the set of builds of interest to SPARC or EMPIRE (or for any customer). In that case, you only want to see tests reported that were non-passing and see which had issue tackers and which did not. But we don't care to see tests with issue trackers that might be passing (that is not the job of that use case for the tool).

  • Add optional column fail_regularity with values 'daily', 'frequently', and 'rarely' in the tests with issue tracker CSV file. This will allow for different logic for how test results are displayed when passing. For values 'daily' and 'frequently', always list the tests in the table 'twip'. For 'rarely' randomly failing tests, passing tests will not be listed in the table 'twip' but will be listed in the tables 'twim' and 'twif' if they are missing or failing (which should be rare). (Otherwise, the presence of these tests in 'twip' is just spam. And when they do rarely fail, we will see that they already have issue trackers assigned to them.) But add option --show-all-tests-with-issue-trackers=[on|off] (default 'off') that if 'on' will list 'rarely' failing tests the table 'twif' (so that we can use that info to communicate with the user). At the same time, add a new column to the test tables Rand with values 'daily', 'frequently', and 'rarely'. (Put in a soft line-break in 'frequ-ently' to allow line to wrap to save space column horizontal space.) For tests without issue trackers, that entry will just be empty.

  • Add a new column expected_fail_regex to the "tests-with-issue-trackers" CSV file that will define a regex (could be of the form (<regex1>|<regex2>|...)) for expected failure text. If the test fails and it matches a failing test's detailed STDOUT, then the test will be listed in the table "Tests with issue trackers Failed: twif=???" or the table "Tests with issue trackers allowed to fail Failed: twiatff=???" (if allow_to_fail=atf is also set). But if the test did not match the regex, then it would get listed in the table "Tests without issue trackers Failed: twoif=???". For example, for the tests MueLu_UnitTests[Blocked]Epetra_MPI_4 in MueLu_UnitTests[Blocked][Epetra|Tpetra]_MPI_4 failing randomly on several ATDM builds trilinos/Trilinos#3897, we need to define the regex to be "FAILED.*Hierarchy_double_int_int_Kokkos_Compat_KokkosOpenMPWrapperNode_Write_UnitTest". If that regex matches the failing test output, then we classify the test as a known failure in the tables twif or twiatff (depending on the value of the 'allow_to_fail' field). But if that regex does not match, then we list the test failure in 'twoif' to grab people's attention. One idea of how to documenet this regex is to embed it in the test history queryTests.php set of filters. That would only show failures that matched the fail criteria and not other types of failures. So technically, the same test (same 'site', 'buildname', and 'testname') could match regex criteria for multiple github issues and therefore appear more than once in a tests with issue tracker table! (Therefore, uniqueness would be the four fields 'site', 'buildname', 'testname', and 'expected_fail_regex'. This would impact many of the data-structures.)

  • Add ability to put in wild-cards (regexes) in the "tests-with-issue-trackers" CSV file to allow matching any site and/or any build and/or a set of tests. This would mainly be useful for randomly failing tests with fail_regularity values 'daily' and 'frequently'. We would need to decide if we will search for and list out missing and passing tests in the table 'twim' and 'twip'. That is possible by simply matching the set of builds that did submit test results to CDash that testing day. That would make it easier to stay on top of these randomly failing tests. But this does have the risk that it will mask new failures and make us ignore them. But as long as the number of randomly failing tests is not to large and one also uses the expected_fail_regex field, then this should not be a big risk. Note that in this case, we will need to add a limit to the number of tests for which we get detailed test results and display in the 'twif' table to avoid having this match a large number of tests, bog down the script and overwhelm the email with a massive table. We could do this by adding the option --limit-rows-failed-tests-with-issue-trackers = <limitRowsFailedTestsWithIssueTracker> (set a value like 500 or something).

  • Add optional column allow_to_fail with values 'atf' and 'natf' in the tests-with-issue-trackers CSV file. Then add new table "Tests with issue trackers allowed to fail Failed: twiaff=???". Non-passing tests in this category would not trigger a global FAIL and would be listed at the very bottom of the HTML page/email.

  • Add optional column allow_to_timeout with values 'att' and 'natt' in the tests-with-issue-trackers CSV file. Then add new table "Tests with issue trackers allowed to timeout Timed-out: twiatt=???". Failing tests with details "Timeout" in this category would not trigger a global FAIL and would be listed at the very bottom of the HTML page/email. This would be useful for tests that we want to let timeout here and there but not let fail. If one of these tests fails, it will be reported as a regular test failure or tests without issue trackers in the table 'twoif' (since the issue tracker should be for a timing out test, not a failing test so we want to get people's attention.).

  • Update the top "Builds on CDash (num/expected=???/???)" line in first paragraph to also list the number of total tests not run, failed and passed as "Builds on CDash (Builds: num/expected=???/???; Tests: Total=???, NotRun=???, Failed=???, Passed=???)". That is useful stats to put the depth of testing in perspective. Also, update the line "Non-passing tests on CDash (Nopassing=???)" (you can see the breakdown by "NotRun" and "Failed" in the above "Builds on CDash" line).

  • Add option --exclude-builds=[<site0>:]<build0>,[<site1>:]<build1>,... (default empty '') that will allow running an existing driver script but remove matching builds from the expected builds, the set of downloaded builds, and from the sets of tests with issue trackers. This would allow, for example, excluding some problematic builds for a given testing day to see what things look like without those builds. This might also be a cheap way to exclude some builds temporarily without having to edit the CDash filters or any of the *.cvs files.

  • Add ability to not have some missing builds result in a global failure. For example, it would be good that if some build on 'serrano' or 'chama' was missing in the current testing day but was 100% passing yesterday and has no tests with issue trackers entries that match that build, then it would be okay to assume that everything is okay with that missing build and not have it trigger a global failure. These builds could be listed in a new table "Builds Allowed to be Missing: bam = ???" at the bottom of the email and not in red. But I think we would only want to allow such builds to be missing for a day or two and not for too long. And I think we would want to be able to specify which builds would be allowed to be missing and how long they would be allowed to be missing. We could do this by adding new column allow_num_days_missing to the list-of-expected-builds CSV file. We would set up the builds on 'serrano' and 'chama' to set 'allow_num_days_missing=1' or '2' (and the other builds would have empty fields or '0' which means don't allow any missing days).

  • Add new table "Build Warnings: bw=???" in the summary emails for builds where warnings are considered errors. This could be done by adding a new column warnings_are_errors in the Expected Builds *.csv file with values 'wae' (warnings are errors) and 'wane' (warnings are not errors). If any of these builds have any warnings, then the build would be considered to be failed and would be listed in the 'bw' table and the global build would be considered to be FAILED. This is needed for SPARC Trilinos Integration builds for the builds that have -Werror like the clang-9.0.1 build and the gnu-7.2.0 build (but have -Werror removed for the SPARC Trilinos Integration build).

  • Experiment with adding hyper-links from top list of build and test sets in second paragraph to the lower tables with the detailed results. (Test this in the html file and the generated email and see if it work with gmail and Outlook.)

  • Add logic to eliminate duplicate test dict entries when getting test history. For example, some of the tests are getting duplicated on CDash for some of the ATDM Trilinos builds (see https://gitlab.kitware.com/snl/project-1/issues/77). We see cases where tests have failed for 36 out of the last 30 days getting reported in the columns "Consecutive nopass days" and "Nopass last X Days"! These duplicate tests are already getting pulled out of the initial set of non-passing tests. So that same Python code could be used to eliminate duplicates from the test history list as well (easy to do after sorting by buildstarttime).

  • Change logic so that builds with configure failures 'cf' are not listed in the set of missing builds 'bm' (even if they have tests missing) and therefore don't list the associated tests with issue trackers in the set 'twim'. (See example for promoted builds from 2019-03-12.)

  • Add option --show-all-passing-builds=[on|off] (default 'off') that will result in producing a table "Builds with everything passing: bep=???" with columns to show pass/faill history of the last 30 days. (This option would be turned on for the script invocation for the "Specialized ATDM Trilinos Cleanup Builds" email so that we could see when a build was ready to promote to the "ATDM" CDash group.)

  • Add check for passing tests in the initial cdash/api/vi/queryTests.php download that should contain only non-passsing (i.e. "Failed" and "Not Run") tests. (This would be useful for people making mistakes setting up their filter argument --cdash-nonpassed-tests-filters.)

  • Add support for catching and reporting update failures in a new table "Update failures: uf=???". (Those builds would currently be listed as missing expected builds 'bm' where the build exists on CDash but there is no build results but you could not tell why. Also, tests associated with these builds should not be listed in the table 'twim'.)

  • Add option --ingore-passing-tests-with-issue-trackers=[on|off] (default 'off') that will not print a table 'twip'. This could be used for customer-specific scripts like for SPARC and EMPIRE to reduce the clutter of the generated emails to only focus on test failures. But we may not want to use an option like this. It would be a preference.

  • Assert consistent repo versions: (This must be done in order to ensure that all of the builds are testing the same version of the repo. But this may not be critical if we set up the nightly builds of ATDM Trilinos and SPARC to pull from 'nightly' branches correctly):

  • Add option --require-same-tested-repo-version=[on|off] that would require that all of the builds pulled down off of CDash all have the same repo version(s) or return global fail.
  • Add option --multi-repo-version-file-name=<mutirepoVersionFileName> for the name of a notes file extracted from the cdash-dev-view/api/v1/viewNotes.php?buildid=<buildid> page that, together with the 'update_txt' build dict field, is used to assert that the versions are the same. (This is not needed for Trilinos 'develop' but it will be needed for SPARC and the file name in that case is called 'SPARCRepoVersion.txt'. But if needed for Trilinos, it is called 'TrilinosRepoVersion.txt' and that will need to be checked in case some builds use a different version of SEACAS.)
  • Add option --repo-version-output-file=<repoVersionFile> that on output, will write the version of the repos if they all matched. If --multi-repo-version-file-name=<mutirepoVersionFileName> is given, then this will be that downloaded file. Otherwise, this will be the value (git SHA1) returned from the 'update_txt' field that matches for all of the builds. If the version did not match, then the file will be written as empty text to denote that the versions did not match.
  • Add printout of total run-time for script (at bottom of the HTML page and the email body). (One can get this with 'time' locally but we want this shown in the email body at the bottom.)

  • Add option --wipe-test-data-cache-dir=[on|off] (with default 'on') that will delete the <cacheDir>/test_history/ dir before getting new results from CDash when --use-cached-cdash-data=on (otherwise we will fill up the disk by running many nights in a row). (This will allow for workflows where one sets --wipe-test-data-cache-dir=off and then any history already pulled down will be reused by multiple scripts where the set of tests may overlap.) Or, consider automatically trimming down test data in <cacheDir>/test_history/ that is older than --remove-test-history-older-than-days=<num_days>? Alternatively, create a simple set of bash shell commands to delete files older than X days to the automated driver scripts.

Attaching files to the emails:

  • Split Python print() into STDOUT and an internal string buffer to save the output and then attach a copy of that output as a plain text file to the email. That would allow people to see the logic and more details that went into the generation of the summary email. (That would help save needing to add a bunch of extra info to the email body for special cases.)

  • Attach the generated HTML file to the email (see https://stackoverflow.com/questions/3362600/how-to-send-email-attachments). This will allow viewing output in a browser if that is better for some reason or if someone just wants to archive that file for some reason.

  • Attach the file created by the argument --write-failing-tests-without-issue-trackers-to-file=<file> to the email message. That will make it easy to download and then use to create GitHub issues and update the TrilinosATDMStatus/*.csv files without having to run the script again locally.

  • Attach the input *.csv files to the email to show the set of expected builds, the tests with issue trackers, etc.

Adding more color to tables:

  • Color rows red in tables 'bm', 'cf', 'bf', 'twoif' and 'twoinr' where the number of recent consecutive non-pass days is greater than or equal to --color-consecutive-nonpass-days-rows-red=<colorRedNumConsecNonpassDays> (default 2). This shows that a failure is likely not a rare random (system) failure and therefore needs to be triaged.

  • Color rows red in the table 'twoif' for tests where the number of non-passing days in the last X days is greater to or equal than --color-nonpass-tests-last-x-days-rows-red=<colorRowsNonpasssTestsLastXDays> (default 3). This would help to spot randomly failing tests that might otherwise go unnoticed and don't fail consecutive days and therefore would not be flagged due to the option --color-consecutive-failed-days-rows-red=<colorRedNumConsecFailedDays>.

  • Color rows red in the table 'twoif' for tests where the same test is failing in greater to or equal than --color-nonpass-tests-across-x-builds-rows-red=<colorRowsNonpassNumBulds> (default 2) and total number of failures for all tests listed in 'twoif' is less then the number tables rows to display. This would help spot immediately when a new failing test is likely not random since it fails in multiple builds but would not color rows red when there were catastrophic failures across many builds.

  • Color the build name red for tables of tests where there were any build failures in the build. This would help to connect tests in builds that have build failures with those builds and make that relationship more obvious.

  • Color test rows green for 'twip' table for non-randomly failing tests (i.e. 'fail_regularity=daily') that have been consecutively passing for more than X days (option --color-rows-green-consecutive-pass-days = <colorGreenConsecPassDays> with default 3). For fail_regularity=frequently test that have been passing for more than X days (option --color-rows-green-consecutive-pass-days-rand-test=<colorGreenConsecRandPassDaysRandTest> with default 30), color the rows green in the table 'twip'. This shows that an issue may be updated and closed (if all of the tests for an issue are passing). Should also put in a legend below the table that says "* Green row are passing or more consecutive days for non-randomly failing tests." NOTE: Don't add this until we add the 'fail_regularity' field to the tests-with-issue-trackers csv file. Otherwise, we will be marking randomly failing tests as green.

  • Likewise, color test rows green for tests in the table "Tests with issue trackers randomly failing: twirfp=???" for frequently randomly failing tests (i.e. 'fail_regularity=frequently') that have been consecutively passing for more than X days (option --color-rows-green-consecutive-pass-days-rand-fail=<colorGreenConsecPassDaysRandFail> with default 30). (Or, add a field 'color_green_when_passing_x_consec_days' to the tests-with-issue-trackers .csv file to allow this to be set on a test-by-test basis, depending on how frequently the tests fail.) This shows when an issue may be updated and closed. Should also put in a legend below the table that says " Green row are passing or more consecutive days for frequently randomy failing tests." NOTE: Don't add this until we add the 'randomy_failing' field to the tests with issue-trackers csv file. Otherwise, we will be marking randomly failing tests as green.

Misc stuff that may not be needed or all that useful:

  • Change the link on the "Build Name" in build and test tables to be the index.php with filters matching site, buildname, and buildstarttime. (The current link is the history for 30 days which can confuse some people a little.)

  • Add argument --random-system-errors-file=<csv_file> for a list of random system failures with columns ['site', 'buildname', 'system_failure_regex', 'max_num_fails_per_test', 'max_num_total_matching_tests', 'issue_tracker_url', 'issue_tracker']. For 'mutrino' we would have [ 'mutrino', '.+', 'srun: error: Unable to create step for job [0-9]+: Address already in use', '4' ]. For the CUDA 10.1 builds, we would have "Segmentation fault: address not mapped to object at address". These would be listed in a separate table "Tests with Random System Failures: twrsf=???" at the bottom of the email. A failing test would only be listed in the table 'twrsf' if all of the following were true:

    • a) 'site' and 'buildname' matched (perhaps a regex match).
    • b) 'system_failure_regex' matched the detailed test output.
    • c) Test failed less than 'max_num_fails_per_test' times in recent history
      of X days that matches that regex.
    • d) Total number of tests that match this criteria for that build site and
      build is not greater than 'max_num_total_matching_tests' (e.g. 4)
    • e) Test is not included in the list of tests with issue trackers

    (This would eliminate the spam related to the random 'srun' failures on 'mutrino' and not cause a global failure but would still not allow catastrophic failures to get ignored. It could also be used to filter out the CUDA 10.1 fails showing "Segmentation fault: address not mapped to object at address".)

    NOTE: A likely better approach for handling system failures like this is to filter them out of the set of non-passing tests in the first place and the test history using the field --extra-exclude-test-filters.

  • Add argument --filter-builds-with-x-total-test-failures=<num_total_failing_tests> that will filter out the tests from the "Test without issue trackers Failed: twoif=???" table for builds that have more than <num_total_failing_tests> (e.g. 100) failing tests. Such tests typically fail due to system issues on a given node (like on 'waterman' or 'serrano') and we don't want these tests to flood out failing tests from other builds that may have new failing tests. NOTE: The builds with large numbers of failing tests will still be seen in the list of builds "Builds with Test Failures: btf = ???".

  • Add argument --cdash-filter-templates-file=<python_dict_file> that contains a python dict that gives the templates for the CDash filters for several different queries that the script has to use. This allows the script to accommodate several different versions of CDash where the filter fields change from version to version. One field would be 'query-test-history-filters-template' with the value "...<site_to_replace>...<buildname_to_replace>...<testname_to_replace>...<date-from>...<date-to>..." where the tokens <site_to_replace>, <buildname_to_replace>, <testname_to_replace>, <date-from>, and <date-to> would be replaced with the actual site-name, build-name, test-name, date-from, and date-to in the CDash query. This would be used to get test history from the page cdash/api/v1/queryTests.php. Another field would be index-build-history-filters-template with the value "...<group_to_replace>...<site_to_replace>...<buildname_to_replace>...<date-from>...<date-to>...". This latter one would be used to get build history from the page cdash/api/v1/index.php. Another field would be index-build-site-build-starttime-filters-template with the value "... <site_to_replace> ... <buildname_to_replace> ... <buildstarttime_to_replace>". This would be used to create browser URL links to a given build on the cdash/index.php page. If the path <python_dict_file> was a relative path, then it would be w.r.t. the directory where the cdash_analyze_and_report.py script was located. (But this is only needed if future versions of CDash change their URL query structure.)

  • Add table "Tests without issue trackers newly passing: twoinp=???" for tests without issue trackers that are newly passing the current testing day. (That will help spot progress or might just show when a randomly failing test goes from failing to passing. But this would be a terrible option if there were a bunch of failures the previous day, like due to a build failure. Therefore, we may not want to implement this.)

@skyreflectedinmirrors
Copy link

skyreflectedinmirrors commented May 9, 2023

I am interested in summary statistics, e.g., # of passing tests, # of failed / missing tests, and a % pass rate, broken down over time.

Specifically, my use-case tracks tickets on a number of internal, and customer-facing JIRA instances, and I would like to (at a glance) be able to share build quality information with folks that I strongly suspect will never actually click through to the dashboard. Ideally, I'd like to have both the information for the current build, and some sort of time-average (or plot, but I noted you said "standard Python 3.x" in #577 @bartlettroscoe, which probably precludes matplotlib) of the build quality results over say, the last 30 days. Some far future goal would be JIRA integration (assuming such a thing can be done programmatically using minimal dependencies) to automatically pull in ticket status / priorities / etc., but I imagine I'd be the only user there for awhile.

In addition, I also have tests in our test-suite broken down via subproject / ctest-label into things like "runtime" / "compiler" / "hip_and_omp_interop", etc., which form natural groupings for us internally to see how various components we ship are doing. I would also probably want to add a similar type of reporting as described for the JIRA instances above (build quality, time history).

Update: See new issue for this:

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented May 9, 2023

@arghdos, would you consider opening a new issue for the discussion above? I fear it will get lost in such a large epic issue.

@skyreflectedinmirrors
Copy link

Will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

No branches or pull requests

2 participants