Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2023-07-07 #4375

Closed
10 tasks
hkdctol opened this issue Jun 30, 2023 · 3 comments
Closed
10 tasks

O+M 2023-07-07 #4375

hkdctol opened this issue Jun 30, 2023 · 3 comments
Assignees
Labels
O&M Operations and maintenance tasks for the Data.gov platform

Comments

@hkdctol
Copy link
Contributor

hkdctol commented Jun 30, 2023

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Miscs

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Check Production State/Actions

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

@hkdctol hkdctol added the O&M Operations and maintenance tasks for the Data.gov platform label Jun 30, 2023
@hkdctol hkdctol moved this to 📟 Sprint Backlog [7] in data.gov team board Jun 30, 2023
@nickumia-reisys nickumia-reisys moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Jul 3, 2023
@nickumia-reisys
Copy link
Contributor

Day 1 Summary

  • Deployed CKAN 2.10 to inventory prod
  • Inventory prod Site Alarm broke on NR (haven't been able to figure out why)
  • Worked on inventory flask + werkzeug vulnerabilities
    • A lot of requirement updates are necessary
    • flask-multistatic is not compatible with the non-vulnerable flask+ werkzeug versions. I made a temporary repo to implement the fix that makes it compatible (but don't know if we want to do that). CKAN removed flask-multistatic as a dependency on main branch (the next version of CKAN will have a different fix for that)
    • Got all but 2 tests passing
  • State Json harvest source broken on catalog (will connect with @FuhuXia on where that stands from last week)
  • Surprisingly the end of month harvest jobs did not cause any stuck jobs this time 😮

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Jul 5, 2023

Day 2 Summary

  • There was a glitch in FEMA-R02 harvest source. It failed with HTTPSConnectionPool(host='hazards.fema.gov', port=443): Max retries exceeded with url: /filedownload/metadata/R02/599696-AcquireBaseMap.xml (Caused by ProxyError('Cannot connect to proxy.', timeout('The read operation timed out'))). Re-running the harvest job did not reproduce the error.
  • Phila data json harvest source has been failing since the beginning of June. Apparently, they redirected www.opendataphilly.org to opendataphilly.org and we weren't allowing the root domain.
  • Determined that there is a bug in CKAN 2.10 where old API Tokens are no longer work. (reference)

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Jul 7, 2023

Day 3 + 4 Summary

  • Catalog experienced an outage for ~30 mins because all 3 solr followers restarted at the same time (reference)
    • Removed Follower 0 from Catalog prod traffic. Going to see if the requests are actually causing the restarts or if it is the replication.
  • Alaska harvest sources are partially failing, might be intermittent issues
    image
  • Investigated an error that's been appearing in the catalog logs (reference). It seems related to beaker, but the beaker secrets seem to be set up properly.
    image

@github-project-automation github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Jul 11, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

2 participants