From 47f53c7141922e64ca2ed79311c590c9d97529df Mon Sep 17 00:00:00 2001 From: Nicholas Kumia <85196563+nickumia-reisys@users.noreply.github.com> Date: Thu, 9 Mar 2023 10:17:18 -0500 Subject: [PATCH 1/2] new: initial pass at tabular checks --- .github/ISSUE_TEMPLATE/o-and-m.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/o-and-m.md b/.github/ISSUE_TEMPLATE/o-and-m.md index 516311cd4..76c6993f8 100644 --- a/.github/ISSUE_TEMPLATE/o-and-m.md +++ b/.github/ISSUE_TEMPLATE/o-and-m.md @@ -60,11 +60,18 @@ $ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls ## Acceptance criteria You are responsible for all [O&M responsibilities](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities) this week. We've highlighted a few so they're not forgotten. -- [ ] [Audit log updated](https://docs.google.com/spreadsheets/d/1z6lqmyNxC7s5MiTt9f6vT41IS2DLLJl4HwEqXvvft40/edit) for [AU-6 Log auditing](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#au-6-log-auditing) (**Friday**). -- [ ] Any [New Relic alerts](https://alerts.newrelic.com/accounts/1601367/incidents) have been addressed or GH issues created. -- [ ] Weekly [Duplicate check](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#duplicate-check) has been done, and any pertinent issues created. -- [ ] Weekly [Nessus scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#nessus-host-scan-report-from-isso) has been triaged. -- [ ] Weekly [Snyk scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#automated-dependency-updates-ad-hoc-github-prs) is complete. +| Task | Friday | Monday | Tuesday | Wednesday | Thursday | Friday | Monday | Tuesday | Wednesday | Thursday | Weekly/Monthly | +|---------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| +| Check Deployments | | | | | | | | | | | ➖ | +| Check Restarts | | | | | | | | | | | ➖ | +| Check [Snyk Scans](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#automated-dependency-updates-ad-hoc-github-prs) | | | | | | | | | | | ➖ | +| Check Catalog Auto Tasks | | | | | | | | | | | ➖ | +| Check Harvesting Emails | | | | | | | | | | | ➖ | +| [New Relic Alerts](https://alerts.newrelic.com/accounts/1601367/incidents) Triaged | | | | | | | | | | | ➖ | +| Check Catalog Solr | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | | +| [Audit Log](https://docs.google.com/spreadsheets/d/1z6lqmyNxC7s5MiTt9f6vT41IS2DLLJl4HwEqXvvft40/edit) [*AU-6*](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#au-6-log-auditing) | | ➖ | ➖ | ➖ | ➖ | | ➖ | ➖ | ➖ | ➖ | ➖ | +| [Catalog Dupe Check](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#duplicate-check) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | | +| [Invicti Scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#netsparker-compliance-scan-report-from-isso) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | | + - [ ] Weekly [resources.data.gov link scan](https://app.circleci.com/pipelines/github/GSA/resources.data.gov?branch=main) -- [ ] If received, the monthly [Netsparker scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#netsparker-compliance-scan-report-from-isso) has been triaged. - [ ] Finishing the shift: Log the [number of alerts](https://docs.google.com/spreadsheets/d/1u1hSUAQW6FWzphog122stfB6MB9Wiq0NROT3PeicRoM/edit#gid=939071144) From 67ade4abc0a4bc0d8d420d722ef25a124e6305d5 Mon Sep 17 00:00:00 2001 From: Nicholas Kumia <85196563+nickumia-reisys@users.noreply.github.com> Date: Thu, 9 Mar 2023 10:40:57 -0500 Subject: [PATCH 2/2] new: cleanup the rest of the issue template --- .github/ISSUE_TEMPLATE/o-and-m.md | 52 +++---------------------------- 1 file changed, 4 insertions(+), 48 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/o-and-m.md b/.github/ISSUE_TEMPLATE/o-and-m.md index 76c6993f8..58d42ce3d 100644 --- a/.github/ISSUE_TEMPLATE/o-and-m.md +++ b/.github/ISSUE_TEMPLATE/o-and-m.md @@ -7,55 +7,10 @@ assignees: '' --- As part of day-to-day operation of Data.gov, there are many [Operation and Maintenance (O&M) responsibilities](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities). Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an [O&M Triage role](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#om-triage-rotation). One person on the team is assigned the Triage role which rotates each sprint. _This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time._ -## Routine Tasks -These repositories will automatically create failure tickets, so no need to check the Actions - - [Inventory Restart Action](https://github.com/GSA/inventory-app/actions/workflows/restart.yml) - - [Inventory deploy Action](https://github.com/GSA/inventory-app/actions/workflows/deploy.yml) - - [Catalog Restart Action](https://github.com/GSA/catalog.data.gov/actions/workflows/restart.yml) - - [Catalog Deploy Action](https://github.com/GSA/catalog.data.gov/actions/workflows/publish.yml) - - [Check Stuck Harvest Jobs](https://github.com/GSA/catalog.data.gov/actions/workflows/check-stuck-harvest-jobs.yml) - -### Snyk Scans -For Catalog and Inventory, snyk will create PR's if a dependency needs to be updated. - - [Inventory Snyk Scan](https://github.com/GSA/inventory-app/actions/workflows/snyk.yml) - - [Catalog Snyk Scan](https://github.com/GSA/catalog.data.gov/actions/workflows/snyk.yml) - -If either of these actions failed and a PR was created, review and approve/triage it as needed - -If either of these actions failed and a PR was not created, an unfixable vulnerability was found, check the Snyk UI Console to triage the vulnerability. - -## Daily Routine - -### GH Actions -Check Action tabs for each _active_ repositories, as these will not create issues automatically on failure - - [Catalog DB-Solr-Sync Action](https://github.com/GSA/catalog.data.gov/actions/workflows/db-solr-sync-automated.yml) The actions should finish in minutes. Examine the amount of datasets affected if it takes long to finish. - - [Tracking Update Action](https://github.com/GSA/catalog.data.gov/actions/workflows/tracking-update.yml) The action should take 1 - 2 hours to finish on prod. Examine the amount of datasets affected or Solr index speed if the time is way off. - -### Miscs -- Verify harvesting jobs are running, go through Error reports to catch unusual errors that need attention [[Wiki doc](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#harvest-job-report-daily-email-report)] +## Miscs - Watch for user email requests -- Triage DMARC Report from Google (daily) sent to datagovhelp@gsa.gov (only for catalog in prod). - Watch in [#datagov-alerts](https://gsa-tts.slack.com/archives/C4RGAM1Q8) and [Vulnerable dependency notifications (daily email reports)](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#vulnerable-dependency-notifications-daily-email-reports) for critical alerts. -## Weekly Routine -### Solr -- Verify each Solr Leader/Followers are functional - -Use this command to find Solr URLs and credentials in the `prod` space. - -``` -$ cf t -s prod -$ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username" -``` - -- Verify their Start time is in sync with Solr Memory Alert history at path `/solr/#/` -- Verify each follower stays with Solr leader at path `/solr/#/ckan/core-overview` -- Verify each Solr is responsive by running a few queries at `/solr/#/ckan/query` -- Inspect each Solr's logging for abnormal errors at `/solr/#/~logging` - -- Examine the Solr Memory Utilization Graph to catch any abnormal incidences. - -- Log in to `tts-jump` AWS account with role `SSBDev@ssb-production`, go to custom [SolrAlarm dashboard](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#dashboards:name=CatalogProdSolr;start=PT72H) to see the graph for the past 72 hours. There should not be any Solr instance that has MemoryUtilization go above 90% threshold without getting restarted. Each Solr should not restart too often (more than a few times a week) ## Acceptance criteria You are responsible for all [O&M responsibilities](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities) this week. We've highlighted a few so they're not forgotten. @@ -66,9 +21,10 @@ You are responsible for all [O&M responsibilities](https://github.com/gsa/data.g | Check Restarts | | | | | | | | | | | ➖ | | Check [Snyk Scans](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#automated-dependency-updates-ad-hoc-github-prs) | | | | | | | | | | | ➖ | | Check Catalog Auto Tasks | | | | | | | | | | | ➖ | -| Check Harvesting Emails | | | | | | | | | | | ➖ | +| Check [Harvesting Emails](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#harvest-job-report-daily-email-report) | | | | | | | | | | | ➖ | | [New Relic Alerts](https://alerts.newrelic.com/accounts/1601367/incidents) Triaged | | | | | | | | | | | ➖ | -| Check Catalog Solr | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | | +| Triage DMARC Report from Google | | | | | | | | | | | ➖ | +| Check [Catalog Solr](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#solr) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | | | [Audit Log](https://docs.google.com/spreadsheets/d/1z6lqmyNxC7s5MiTt9f6vT41IS2DLLJl4HwEqXvvft40/edit) [*AU-6*](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#au-6-log-auditing) | | ➖ | ➖ | ➖ | ➖ | | ➖ | ➖ | ➖ | ➖ | ➖ | | [Catalog Dupe Check](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#duplicate-check) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | | | [Invicti Scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#netsparker-compliance-scan-report-from-isso) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | |