Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image data refresh's "update media popularity constants" step hung on last run #1357

Closed
1 task
AetherUnbound opened this issue Nov 23, 2022 · 8 comments
Closed
1 task
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟥 priority: critical Must be addressed ASAP 💾 tech: postgres Involves PostgreSQL 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Collaborator

Description

The refresh_popularity_metrics_and_constants.update_media_popularity_constants_view task for the image data refresh DAG (which typically takes about 10 hours) ended up running for over 9 days. The query was still running on the postgres backend at that time. We took action to pause the DAG and kill the query. If we're able to reproduce this behavior, we need to investigate why it's happening.

Additional context

Resolution

  • 🙋 I would be interested in resolving this bug.
@AetherUnbound AetherUnbound added 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 💾 tech: postgres Involves PostgreSQL 🛠 goal: fix Bug fix 🟥 priority: critical Must be addressed ASAP labels Nov 23, 2022
@AetherUnbound
Copy link
Collaborator Author

I've marked this as "critical" because we are unable to run the image data refresh until this issue is resolved. We may be able to enable the DAG and skip the popularity recalculation steps since those only run once a month, but if there's a root cause for this we should work to try and find it.

@AetherUnbound
Copy link
Collaborator Author

It looks like I failed to update the image_popularity_metrics table after WordPress/openverse-catalog#795, meaning that rawpixel was not present in that table. My first assumption was that the extra time calculating metrics was the result of rawpixels being present in the table, so the fact that nothing was changed with the popularity metrics and this task still hung is quite confusing. I'm inclined to see if we could attempt to run the data refresh on Monday of next week and monitor its progress; I'd like to think this was just a very odd fluke 😕

@zackkrida
Copy link
Member

@AetherUnbound Starting the data refresh today sounds fine. Could you do that and let folks know in Slack when it's been kicked off?

Two questions as well:

  • I don't believe so, but is there any way the missing rawpixel data could have caused the hangup?
  • Is there any way to test the data refresh that would give us faster results, for example in staging or something? We'll have to wait several days to determine if this was a fluke, right?

@AetherUnbound
Copy link
Collaborator Author

I don't believe so, but is there any way the missing rawpixel data could have caused the hangup?

Since we're just stuffing that as a field into the meta_data JSON blob (and the popularity calculation knows nothing about it because it's not in the image_popularity_metrics table), it should not affect runtime any more than some other unrelated field being present there.

Is there any way to test the data refresh that would give us faster results, for example in staging or something? We'll have to wait several days to determine if this was a fluke, right?

We could potentially try running the matview refresh command directly on postgres, but that would require one of us to have a connection to postgres open for the duration of the query which seems less feasible than simply kicking off the DAG.

Good questions!

@AetherUnbound
Copy link
Collaborator Author

I've started the image data refresh and will report back here with the results tomorrow.

@AetherUnbound
Copy link
Collaborator Author

Oddly enough, the data refresh was able to move right past that step after 13 hours!
image

I'll go ahead and close this issue, as it was likely a fluke.

@AetherUnbound AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2022
@zackkrida
Copy link
Member

I wonder if it was related to our increased scraping traffic from the past few weeks. Interesting!

@AetherUnbound
Copy link
Collaborator Author

I would be really surprised, since this is all happening on the catalog database which should be isolated from user traffic!

@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat moved this from 📋 Backlog to ✅ Done in Openverse Backlog Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟥 priority: critical Must be addressed ASAP 💾 tech: postgres Involves PostgreSQL 🐍 tech: python Involves Python
Projects
Archived in project
Development

No branches or pull requests

2 participants