Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline O&M Checks #4230

Merged
merged 3 commits into from
Mar 9, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 15 additions & 54 deletions .github/ISSUE_TEMPLATE/o-and-m.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,67 +7,28 @@ assignees: ''
---
As part of day-to-day operation of Data.gov, there are many [Operation and Maintenance (O&M) responsibilities](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities). Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an [O&M Triage role](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#om-triage-rotation). One person on the team is assigned the Triage role which rotates each sprint. _This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time._

## Routine Tasks
These repositories will automatically create failure tickets, so no need to check the Actions
- [Inventory Restart Action](https://github.com/GSA/inventory-app/actions/workflows/restart.yml)
- [Inventory deploy Action](https://github.com/GSA/inventory-app/actions/workflows/deploy.yml)
- [Catalog Restart Action](https://github.com/GSA/catalog.data.gov/actions/workflows/restart.yml)
- [Catalog Deploy Action](https://github.com/GSA/catalog.data.gov/actions/workflows/publish.yml)

### Snyk Scans
For Catalog and Inventory, snyk will create PR's if a dependency needs to be updated.
- [Inventory Snyk Scan](https://github.com/GSA/inventory-app/actions/workflows/snyk.yml)
- [Catalog Snyk Scan](https://github.com/GSA/catalog.data.gov/actions/workflows/snyk.yml)

If either of these actions failed and a PR was created, review and approve/triage it as needed

If either of these actions failed and a PR was not created, an unfixable vulnerability was found, check the Snyk UI Console to triage the vulnerability.

## Daily Routine

### GH Actions
Check Action tabs for each _active_ repositories, as these will not create issues automatically on failure
- All Automated Catalog CKAN Tasks have been consolidated into a [single action](https://github.com/GSA/catalog.data.gov/actions/workflows/ckan_auto.yml). A few unique features of these actions
- An error issue will be created if any of the tasks have a non-zero exit code.
- https://github.com/GSA/catalog.data.gov/issues/848 is an always open informational issue that is updated each night with a link to the newest run. Inspect the link and update the comment on the issue with the number of datasets changed.
- If the DB-Solr-Sync action takes more than 30 mins, it will automatically raise an Error Issue (similar to a non-zero exit code).
- https://github.com/GSA/catalog.data.gov/issues/847 is an always open informational issue that is updated each night with a link to the newest run. Inspect the link and update the comment on the issue with the number of datasets changed.
- If the Tracking-Update action takes more than 210 mins, it will automatically raise an Error Issue (similar to a non-zero exit code).

### Miscs
- Verify harvesting jobs are running, go through Error reports to catch unusual errors that need attention [[Wiki doc](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#harvest-job-report-daily-email-report)]
## Miscs
- Watch for user email requests
- Triage DMARC Report from Google (daily) sent to [email protected] (only for catalog in prod).
- Watch in [#datagov-alerts](https://gsa-tts.slack.com/archives/C4RGAM1Q8) and [Vulnerable dependency notifications (daily email reports)](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#vulnerable-dependency-notifications-daily-email-reports) for critical alerts.

## Weekly Routine
### Solr
- Verify each Solr Leader/Followers are functional

Use this command to find Solr URLs and credentials in the `prod` space.

```
$ cf t -s prod
$ cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
```

- Verify their Start time is in sync with Solr Memory Alert history at path `/solr/#/`
- Verify each follower stays with Solr leader at path `/solr/#/ckan/core-overview`
- Verify each Solr is responsive by running a few queries at `/solr/#/ckan/query`
- Inspect each Solr's logging for abnormal errors at `/solr/#/~logging`

- Examine the Solr Memory Utilization Graph to catch any abnormal incidences.

- Log in to `tts-jump` AWS account with role `SSBDev@ssb-production`, go to custom [SolrAlarm dashboard](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#dashboards:name=CatalogProdSolr;start=PT72H) to see the graph for the past 72 hours. There should not be any Solr instance that has MemoryUtilization go above 90% threshold without getting restarted. Each Solr should not restart too often (more than a few times a week)

## Acceptance criteria
You are responsible for all [O&M responsibilities](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities) this week. We've highlighted a few so they're not forgotten.

- [ ] [Audit log updated](https://docs.google.com/spreadsheets/d/1z6lqmyNxC7s5MiTt9f6vT41IS2DLLJl4HwEqXvvft40/edit) for [AU-6 Log auditing](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#au-6-log-auditing) (**Friday**).
- [ ] Any [New Relic alerts](https://alerts.newrelic.com/accounts/1601367/incidents) have been addressed or GH issues created.
- [ ] Weekly [Duplicate check](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#duplicate-check) has been done, and any pertinent issues created.
- [ ] Weekly [Nessus scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#nessus-host-scan-report-from-isso) has been triaged.
- [ ] Weekly [Snyk scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#automated-dependency-updates-ad-hoc-github-prs) is complete.
| Task | Friday | Monday | Tuesday | Wednesday | Thursday | Friday | Monday | Tuesday | Wednesday | Thursday | Weekly/Monthly |
|---------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| Check Deployments | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | ➖ |
| Check Restarts | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | ➖ |
| Check [Snyk Scans](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#automated-dependency-updates-ad-hoc-github-prs) | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | <ul><li>[ ] Catalog</li><li>[ ] Inventory</li></ul> | ➖ |
| Check Catalog Auto Tasks | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> |<ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> |<ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | <ul><li>[ ] DB-Solr Sync</li><li>[ ] Tracking Update</li><li>[ ] Stuck Jobs</li></ul> | ➖ |
| Check [Harvesting Emails](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#harvest-job-report-daily-email-report) | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | <ul><li>[ ] Catalog</li></ul> | ➖ |
| [New Relic Alerts](https://alerts.newrelic.com/accounts/1601367/incidents) Triaged | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | ➖ |
| Triage DMARC Report from Google | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | <ul><li>[ ] </li></ul> | ➖ |
| Check [Catalog Solr](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#solr) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | <ul><li>[ ] Week 1</li><li>[ ] Week 2</li></ul> |
| [Audit Log](https://docs.google.com/spreadsheets/d/1z6lqmyNxC7s5MiTt9f6vT41IS2DLLJl4HwEqXvvft40/edit) [*AU-6*](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#au-6-log-auditing) | <ul><li>[ ] </li></ul> | ➖ | ➖ | ➖ | ➖ | <ul><li>[ ] </li></ul> | ➖ | ➖ | ➖ | ➖ | ➖ |
| [Catalog Dupe Check](https://github.com/GSA/data.gov/wiki/Operation-and-Maintenance-Responsibilities#duplicate-check) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | <ul><li>[ ] Week 1</li><li>[ ] Week 2</li></ul> |
| [Invicti Scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#netsparker-compliance-scan-report-from-isso) | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | ➖ | <ul><li>[ ] Week 1</li><li>[ ] Week 2</li></ul> |

- [ ] Weekly [resources.data.gov link scan](https://app.circleci.com/pipelines/github/GSA/resources.data.gov?branch=main)
- [ ] If received, the monthly [Netsparker scan](https://github.com/gsa/data.gov/wiki/Operation-and-Maintenance-Responsibilities#netsparker-compliance-scan-report-from-isso) has been triaged.
- [ ] Finishing the shift: Log the [number of alerts](https://docs.google.com/spreadsheets/d/1u1hSUAQW6FWzphog122stfB6MB9Wiq0NROT3PeicRoM/edit#gid=939071144)