Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Django ASGI #2843

Closed
2 tasks done
sarayourfriend opened this issue Aug 17, 2023 · 19 comments
Closed
2 tasks done

Django ASGI #2843

sarayourfriend opened this issue Aug 17, 2023 · 19 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧭 project: thread An issue used to track a project and its progress 🧱 stack: api Related to the Django API
Milestone

Comments

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Aug 17, 2023

Start Date ETA Project Lead Actual Ship Date
2023-07-20 YYYY-MM-DD @sarayourfriend 2023-12-04

Description

See #2566. This project thread is only a stand-in to track the effort put towards the issues recorded in the milestone (linked below).

Documents

Issues

https://github.com/WordPress/openverse/milestone/19

@sarayourfriend sarayourfriend added the 🧭 project: thread An issue used to track a project and its progress label Aug 17, 2023
@sarayourfriend sarayourfriend self-assigned this Aug 17, 2023
@AetherUnbound AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels Aug 17, 2023
@sarayourfriend
Copy link
Collaborator Author

Work is ongoing to implement issues in the linked milestone. All but the last issue can be worked on concurrently.

@openverse-bot
Copy link
Collaborator

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@sarayourfriend
Copy link
Collaborator Author

We've progressed just under halfway through the milestone issues. Depending on capacity, I believe we should get to the point by the end of next week where we'll be on the final two issues where we actually implement the asynchronous endpoints and deployment.

The most recent and significant success was when we deployed the API using the baked-in Gunicorn configuration, instead of duplicating configuration in the ECS task definition. This wasn't strictly necessary, but will make it much easier for us to deploy the ASGI version of the service with zero downtime and without needing to do complex switchovers or multiple deployments in order to safely switch.

@dhruvkb dhruvkb added this to the Django ASGI milestone Sep 11, 2023
@openverse-bot
Copy link
Collaborator

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@sarayourfriend
Copy link
Collaborator Author

Our plan is now to convert the app to ASGI, then apply the changes for sharing aiohttp client and the async image proxy route. Each of these have PRs open, but progress on the ASGI deployment depends on changes to the infrastructure and is also blocked by #3029 which is higher priority (hoping it solves the memory leak).

Right now I'm working on getting an infrastructure PR ready for review that will allow making changes to the task count and resources for staging followed by production, to allow us to progressively deploy the ASGI switch in each environment. Once that is up and reviewed, it will unblock actually doing the deployment of the ASGI switch-over. It might be possible to do that by end of this week, if reviews are quick, but I'm going on holiday next week so if it isn't this week then it'll have to wait until next month, unless someone else takes it over (which I really wouldn't mind help on cc @WordPress/openverse-api). Being part-time makes it difficult to get fast progress on these "one-maintainer" projects, and I'd like to get this wrapped up so I can move on to helping with the content moderation workflow work, once those IPs are ready.

@openverse-bot
Copy link
Collaborator

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@sarayourfriend
Copy link
Collaborator Author

We've deployed the API using the canary approach, so now we can work on planning the roll out of the ASGI application. I need to rebase the ASGI PR due to dependabot and changes to uvloop in the existing application.

@AetherUnbound
Copy link
Collaborator

We also need to deploy the canary approach to production, we have not done so yet since the canary service would be missing from our dashboards and we've wanted to have as much information available as possible while we investigate recent API response time issues.

@sarayourfriend
Copy link
Collaborator Author

Ah! I missed that in what happened while I was AFK. I'll ask for some more information today during our 1:1 🙏

@sarayourfriend
Copy link
Collaborator Author

@AetherUnbound Do you think it would be a good idea to put a pause on this line of work for a week or two while we continue to work through the ES issues? I'm not personally comfortable making such dramatic changes to the way the application runs while we're still working to debug an obscure issue.

@AetherUnbound
Copy link
Collaborator

I agree, I'll move this one to on-hold until we have a solid handle on that issue.

@openverse-bot
Copy link
Collaborator

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@sarayourfriend
Copy link
Collaborator Author

The project is no longer on hold. This is the plan of action, as I see it now, in order to finish this out:

  1. Close https://github.com/WordPress/openverse-infrastructure/pull/625 and create a new PR to enable the canary service in production. The existing PR has changes that are no longer necessary as of https://github.com/WordPress/openverse-infrastructure/pull/646, which added the canaries to dashboards, monitors, and also fixed the alias routing issue.
  2. Rebase Run the app as ASGI #3011 and prepare it for merging. Request re-reviews for good measure, especially if there are any significant changes caused by the rebase
  3. Apply https://github.com/WordPress/openverse-infrastructure/pull/616 to the staging API to increase the number of staging API workers and merge the PR
  4. Merge and deploy Run the app as ASGI #3011 to staging, confirm staging API stability
  5. Create a new PR to resize the task resources for the staging API by removing the api_task_resources variable and leaving it to the module default in staging; confirm staging API stability
  6. Increase the number of production API tasks
  7. Cut an API release and deploy the ASGI app to production; confirm API stability
  8. Reduce API task resources by removing the api_task_resources variable and leaving it to the module default in production; confirm API stability

That completes the initial ASGI deployment, but does not finish everything, as we still need to finish making changes to the application to actually benefit from ASGI at all. So after all that deployment is done and confirmed to work as expected:

  1. Rebase and deploy Add aiohttp client sharing #3024 through to production
  2. Rebase and deploy Add ADRF and make the thumbnail view async #3020 through to production
  3. Finally, convert photon.get to async and deploy that through to production

We should expect to cut multiple release through this to ensure we test each step with production load and reduce the number of changes related to this work going out to production with each release.

@sarayourfriend
Copy link
Collaborator Author

I'm going to rebase PRs today to get this ready to start rolling out next week.

@sarayourfriend
Copy link
Collaborator Author

sarayourfriend commented Nov 2, 2023

Edit: I'm updating this comment to a task list for preparing the deployment I've described above and will update it as I get PRs ready to go.

@sarayourfriend
Copy link
Collaborator Author

@stacimc @AetherUnbound and I deployed the production canary today and then Staci and I deployed the ASGI app to staging with resized workers. The deployment went well! We increased the number of tasks, then deployed the ASGI workers to staging, then decreased the task resources to a quarter of their previous amount. The end result is: slightly higher memory and CPU consumption by percent. I haven't looked at the raw numbers to see how they actually stack up, but this is what I expected. Asynchronous runtimes are often memory and CPU hogs. The previous idle memory usage in staging was about 9%. Now it's at about 13%, so 1.44x of the original. CPU was around 1% and is now ~5%, a 5x increase. If take those same ratios and apply them to production: current stable memory in production is around 30%. 1.44 of 30% is around 43%. CPU rises to peaks of 10% during high traffic periods, with spikes to 16% during deployments. 5x of 10% is 50% and of 16% it is 80%. However, we also need to consider the thumbnails service. Its memory usage is lower than the main API service, so no worries there, but the CPU goes to 34% during peak traffic when looking at 1 hour averages over the last 4 weeks. 5x of 34% is 170%, so the thumbnails very well could not fit within the plain 1/4 resize. To be safe, when we resize the production tasks, I'll propose that we resize the main API service tasks to match the staging tasks (.25 vCPU and 1GB memory) but to use .5 vCPU with 1GB memory for the thumbnails. That would put the thumbnails peaking around 85% CPU, which is high, but if it turns out to be stable, is a very nice level of resource consumption (remembering that we pay for everything we've reserved, regardless of whether we use it). Because of the asynchronous nature of the thumbnails endpoint, and because we're prioritising getting that route fully async, I wonder if we'll be able to reduce the number of tasks for the thumbnails service even temporarily. We could also see whether reducing the overall number of API tasks is feasible. Overall the changes should save us money compared to our current deployments, but I don't know how reserving .5 vCPU tasks for the thumbnails service will affect that calculus. I doubt it would put us over our current spend, but I'll check the pricing calculator tomorrow when I'm preparing the PR for the production task resize to confirm.

We plan to deploy the changes to production on Monday.

@sarayourfriend
Copy link
Collaborator Author

We deployed the ASGI app to production. This involved increasing the task count to 20, deploying the ASGI app, then decreasing task resources to 1/4 of the original vCPU and 1/2 of the original memory allocation.

Things appeared okay at first, but we're seeing what looks like a memory leak on the production thumbnails.

#3353 is the issue for tracking that work. Right now we're deploying gc debug logging to production thumbnails to see if we can uncover anything.

@sarayourfriend
Copy link
Collaborator Author

sarayourfriend commented Nov 22, 2023

@AetherUnbound The new PR (#3388) will close the last issue in this project thread. However, I'm unclear whether the project thread should also encompass getting rid of the thumbnails service. What do you think? I don't know exactly what will be involved other than just redirecting thumbnails traffic back to the main API service. Shall I add one final issue to do that, and then we can look to do that sometime in the next two weeks? We could try reducing the number of thumbnail tasks first to make sure the ASGI implementation is able to handle artificially increased production throughput. That would give me more confidence to think that a unified service will now be able to handle every request at once.

But the main question is: should this project close once we've "asgi-fied" the application and confirmed that works, or once we've actually unified the API service again? We could move this to "shipped" after #3388 and then only consider it closed once the service is unified?

@AetherUnbound
Copy link
Collaborator

We could move this to "shipped" after #3388 and then only consider it closed once the service is unified?

I think that's an excellent plan! And yes, if removing the extra thumbnail service really only entails redirecting traffic (and the cleanup of anything else we spun up, like Sentry projects, for the thumbnails), then I think we should try to do that before EOY and before marking this as success 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧭 project: thread An issue used to track a project and its progress 🧱 stack: api Related to the Django API
Projects
Archived in project
Status: ✅ Success
Development

No branches or pull requests

4 participants