Skip to content

Commit

Permalink
Merge pull request #4230 from GSA/om-table-check
Browse files Browse the repository at this point in the history
Streamline O&M Checks
  • Loading branch information
jbrown-xentity authored Mar 9, 2023
2 parents c6cad60 + 406c746 commit b40b133
Showing 1 changed file with 15 additions and 54 deletions.
69 changes: 15 additions & 54 deletions .github/ISSUE_TEMPLATE/o-and-m.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,67 +7,28 @@ assignees: ''
---
As part of day-to-day operation of Data.gov, there are many [Operation and Maintenance (O&M) responsibilities](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities). Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an [O&M Triage role](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#om-triage-rotation). One person on the team is assigned the Triage role which rotates each sprint. _This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time._

## Routine Tasks
These repositories will automatically create failure tickets, so no need to check the Actions
- [Inventory Restart Action](https://github.com/GSA/inventory-app/actions/workflows/restart.yml)
- [Inventory deploy Action](https://github.com/GSA/inventory-app/actions/workflows/deploy.yml)
- [Catalog Restart Action](https://github.com/GSA/catalog.data.gov/actions/workflows/restart.yml)
- [Catalog Deploy Action](https://github.com/GSA/catalog.data.gov/actions/workflows/publish.yml)

### Snyk Scans
For Catalog and Inventory, snyk will create PR's if a dependency needs to be updated.
- [Inventory Snyk Scan](https://github.com/GSA/inventory-app/actions/workflows/snyk.yml)
- [Catalog Snyk Scan](https://github.com/GSA/catalog.data.gov/actions/workflows/snyk.yml)

If either of these actions failed and a PR was created, review and approve/triage it as needed

If either of these actions failed and a PR was not created, an unfixable vulnerability was found, check the Snyk UI Console to triage the vulnerability.

## Daily Routine

### GH Actions
Check Action tabs for each _active_ repositories, as these will not create issues automatically on failure
- All Automated Catalog CKAN Tasks have been consolidated into a [single action](https://github.com/GSA/catalog.data.gov/actions/workflows/ckan_auto.yml). A few unique features of these actions
- An error issue will be created if any of the tasks have a non-zero exit code.
- https://github.com/GSA/catalog.data.gov/issues/848 is an always open informational issue that is updated each night with a link to the newest run. Inspect the link and update the comment on the issue with the number of datasets changed.
- If the DB-Solr-Sync action takes more than 30 mins, it will automatically raise an Error Issue (similar to a non-zero exit code).
- https://github.com/GSA/catalog.data.gov/issues/847 is an always open informational issue that is updated each night with a link to the newest run. Inspect the link and update the comment on the issue with the number of datasets changed.
- If the Tracking-Update action takes more than 210 mins, it will automatically raise an Error Issue (similar to a non-zero exit code).

### Miscs
- Verify harvesting jobs are running, go through Error reports to catch unusual errors that need attention [[Wiki doc](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#harvest-job-report-daily-email-report)]
## Miscs
- Watch for user email requests
- Triage DMARC Report from Google (daily) sent to [email protected] (only for catalog in prod).
- Watch in [#datagov-alerts](https://gsa-tts.slack.com/archives/C4RGAM1Q8) and [Vulnerable dependency notifications (daily email reports)](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#vulnerable-dependency-notifications-daily-email-reports) for critical alerts.

## Weekly Routine
### Solr
- Verify each Solr Leader/Followers are functional

Use this command to find Solr URLs and credentials in the `prod` space.

```
$ cf t -s prod
$ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
```

- Verify their Start time is in sync with Solr Memory Alert history at path `/solr/#/`
- Verify each follower stays with Solr leader at path `/solr/#/ckan/core-overview`
- Verify each Solr is responsive by running a few queries at `/solr/#/ckan/query`
- Inspect each Solr's logging for abnormal errors at `/solr/#/~logging`

- Examine the Solr Memory Utilization Graph to catch any abnormal incidences.

- Log in to `tts-jump` AWS account with role `SSBDev@ssb-production`, go to custom [SolrAlarm dashboard](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#dashboards:name=CatalogProdSolr;start=PT72H) to see the graph for the past 72 hours. There should not be any Solr instance that has MemoryUtilization go above 90% threshold without getting restarted. Each Solr should not restart too often (more than a few times a week)

## Acceptance criteria
You are responsible for all [O&M responsibilities](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities) this week. We've highlighted a few so they're not forgotten.

- [ ] [Audit log updated](https://docs.google.com/spreadsheets/d/1z6lqmyNxC7s5MiTt9f6vT41IS2DLLJl4HwEqXvvft40/edit) for [AU-6 Log auditing](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#au-6-log-auditing) (**Friday**).
- [ ] Any [New Relic alerts](https://alerts.newrelic.com/accounts/1601367/incidents) have been addressed or GH issues created.
- [ ] Weekly [Duplicate check](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#duplicate-check) has been done, and any pertinent issues created.
- [ ] Weekly [Nessus scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#nessus-host-scan-report-from-isso) has been triaged.
- [ ] Weekly [Snyk scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#automated-dependency-updates-ad-hoc-github-prs) is complete.
| Task | Friday | Monday | Tuesday | Wednesday | Thursday | Friday | Monday | Tuesday | Wednesday | Thursday | Weekly/Monthly |
|---------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| Check Deployments | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> ||
| Check Restarts | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> ||
| Check [Snyk Scans](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#automated-dependency-updates-ad-hoc-github-prs) | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> ||
| Check Catalog Auto Tasks | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> |<ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> |<ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> ||
| Check [Harvesting Emails](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#harvest-job-report-daily-email-report) | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> ||
| [New Relic Alerts](https://alerts.newrelic.com/accounts/1601367/incidents) Triaged | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> ||
| Triage DMARC Report from Google | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> ||
| Check [Catalog Solr](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#solr) ||||||||||| <ul><li>[ ] Week 1</li><li>[ ] Week 2</li></ul> |
| [Audit Log](https://docs.google.com/spreadsheets/d/1z6lqmyNxC7s5MiTt9f6vT41IS2DLLJl4HwEqXvvft40/edit) [*AU-6*](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#au-6-log-auditing) | <ul><li>[ ] </li></ul> ||||| <ul><li>[ ] </li></ul> ||||||
| [Catalog Dupe Check](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#duplicate-check) ||||||||||| <ul><li>[ ] Week 1</li><li>[ ] Week 2</li></ul> |
| [Invicti Scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#netsparker-compliance-scan-report-from-isso) ||||||||||| <ul><li>[ ] Week 1</li><li>[ ] Week 2</li></ul> |

- [ ] Weekly [resources.data.gov link scan](https://app.circleci.com/pipelines/github/GSA/resources.data.gov?branch=main)
- [ ] If received, the monthly [Netsparker scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#netsparker-compliance-scan-report-from-isso) has been triaged.
- [ ] Finishing the shift: Log the [number of alerts](https://docs.google.com/spreadsheets/d/1u1hSUAQW6FWzphog122stfB6MB9Wiq0NROT3PeicRoM/edit#gid=939071144)

0 comments on commit b40b133

Please sign in to comment.