Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async Django app #2663

Closed
wants to merge 8 commits into from
Closed

Async Django app #2663

wants to merge 8 commits into from

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Jul 17, 2023

Fixes

Fixes #2566 by @zackkrida

Description

This is still a WIP. I haven't done anything to unit tests at all, so I'm sure many/all of them will fail. Wanted to get this up early though, I'm excited that it was possible at all to get it working!

Adds adrf along with uvicorn and django-async-redis. Also removes the synchronous wsgi-basic-auth which we weren't using in production at all.

I'll leave notes on other specific changes inline.

I decided to take the approach of separating the async routes from the sync routes. We can move routes one-by-one this way without needing to wrap entire routes in sync_to_async. It duplicates a small amount of code this way and exposes how bare-bones adrf is, but I don't think it's a big detriment. Future work to re-implement routes in the async view can happen at leisure. They'll coexist side-by-side without issue for now.

The other big change is to convert the image_proxy modules to async. This was not as straight-forward as I had hoped due to some strange event loop behaviour when using async Redis for the tallies DB in image_proxy.get. If you change that to use django_async_redis.get_redis_connect (and await it), you'll get a strange hanging event loop and subsequent closure issue on every other thumbnail request caused by redis-py. I dug around for a solution but couldn't find anything for now. Luckily, we do not need to solve this right away. For now, using sync_to_async is a safe way to adapt the synchronous redis client for usage in our async views. It might be worth switching to a different approach for redis connection management interactions than django's cache module. Django channels (https://channels.readthedocs.io/en/stable/introduction.html) could be a much better way to manage these connections. It's redis support already has full async and it comes with heaps of other cool features we could use like background workers (for things like letting thumbnail generation run beyond the timeout of a single request and be cached in Photon for subsequent requesters).

If we deploy this (or something based on it), it would be in multiple stages:

  1. Merge this change and continue using the WSGI app with an "async" thumbnails route (still handled by a single worker)
  2. Update the deployment template to use uvicorn and the ASGI app for true async thumbnails route. Other routes automatically get wrapped with sync_to_async and handled in the loop.

We can use gunicorn with uvicorn workers. That way we still have multiple workers but they can each handle asynchronous request lifecycles: https://www.uvicorn.org/deployment/#deployment

Even if we have sync routes handled by those workers, we won't be any worse off for those routes than we are with the current sync handlers. As we convert more routes to async, we'll get bigger improvements.

There might be some way to tell gunicorn to direct thumbnail requests to a subset of workers with sync requests to another. I'll need to dig into it more. Ultimately it shouldn't be an issue though because the sync routes are all very fast for the most part already.

Testing Instructions

just api/up should work after just build web. The dockerfile is configured to run uvicorn locally as a replacement for runserver, which doesn't properly support a full ASGI app. That means that we'll be able to fully test the ASGI app when running the app locally.

Additional testing is needed to confirm that running the app with gunicorn and uvicorn workers works correctly. This is the recommended approach from uvicorn for production deployments.

Reviewers: For now, please focus first on the overall approach rather than implementation details. In particular, splitting the views vs wrapping existing views in sync_to_async etc. If you agree with the overall approach, then feel free to dive into the implementation details. Please do test this locally, even though our test suite is pretty comprehensive.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon 🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels Jul 17, 2023
Comment on lines -20 to -21
from wsgi_basic_auth import BasicAuth

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This middleware was never used.

Comment on lines -9 to -13

from gevent import monkey


monkey.patch_all()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use geventlet anymore, so this patching was not necessary and conflicts with asyncio.

@@ -101,7 +101,6 @@ COPY --chown=opener . /api/
# Collect static assets, these are used by the next stage, `nginx`
RUN env \
SETUP_ES="False" \
STATIC_ROOT="/static" \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now correctly configured in settings for local and production deployments (which are intercepted by Nginx anyway) so there's no need to repeat the configuration here.

Comment on lines 12 to 13
if _SESSION is None or _SESSION.loop.is_closed():
_SESSION = aiohttp.ClientSession()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the event loop makes unit testing easier and is extra security for cases where the event loop closed unexpectedly. If the loop associated with the session is closed, we need a new session or it'll try to use the closed event loop.

@@ -32,9 +33,11 @@ def _get_expiry(status, default):
return config(f"LINK_VALIDATION_CACHE_EXPIRY__{status}", default=default, cast=int)


async def _head(url: str, session: aiohttp.ClientSession) -> tuple[str, int]:
async def _head(url: str, **kwargs) -> tuple[str, int]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this module are just to support the reusable aiohttp session. We should have been doing that already anyway.

@@ -5,8 +5,8 @@ DJANGO_SECRET_KEY="ny#b__$$f6ry4wy8oxre97&-68u_0lk3gw(z=d40_dxey3zw0v1"
DJANGO_DEBUG_ENABLED=True

BASE_URL=http://localhost:50280/
ENVIRONMENT=development
ALLOWED_HOSTS=api.openverse.engineering,api-dev.openverse.engineering,host.docker.internal
ENVIRONMENT=local
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use local elsewhere and it's a bit clearer than development, which has been historically muddled with staging.

Comment on lines -7 to -10
BASE_URL=http://localhost:50280/
ENVIRONMENT=development
# List of comma-separated hosts/domain names, e.g., 127.17.0.1,local.app
ALLOWED_HOSTS=localhost,172.17.0.1,host.docker.internal
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to duplicate these between env.docker. These environment files are now exclusively used for local development, so we don't need to do anything to make sure it's easy to handle live vs local deployments of it.

Comment on lines +36 to +43
@pytest.fixture
def photon_url():
return f"{settings.PHOTON_ENDPOINT}subdomain.example.com/path_part1/part2/{uuid4()}.jpg"


@pytest.fixture
def image_url(photon_url):
return photon_url.replace(settings.PHOTON_ENDPOINT, "http://")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prevent duplicated image URLs.

Comment on lines 278 to 285
def setup_http_get_exception(monkeypatch):
def do(exc):
def raise_exc(*args, **kwargs):
raise exc
mock_session = MagicMock(spec=aiohttp.ClientSession)
mock_session.get.side_effect = exc

monkeypatch.setattr("requests.get", raise_exc)
monkeypatch.setattr(
"api.utils.image_proxy.get_aiohttp_session", lambda: mock_session
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use requests anymore for this, so we need to mock out the aiohttp session instead.

Comment on lines 502 to 514
async def test_photon_get_saves_image_type_to_cache(
pook_off, image_url, redis, headers, expected_cache_val
):
image_url = image_url.replace(".jpg", "")
image = await sync_to_async(ImageFactory.create)(url=image_url)

pook.on()
mock_head = pook.head(image_url).reply(200)
if headers:
mock_head = mock_head.headers(headers)

with mock.patch("requests.head") as mock_head, pytest.raises(UnsupportedMediaType):
mock_head.return_value.headers = headers
photon_get(image_url, image.identifier)
with pytest.raises(UnsupportedMediaType):
await photon_get(image_url, image.identifier)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pook instead of mocking aiohttp is a more reliable way to prevent the head request from going through.

@sarayourfriend sarayourfriend marked this pull request as ready for review July 21, 2023 03:53
@sarayourfriend sarayourfriend requested a review from a team as a code owner July 21, 2023 03:54
Now that we use pytest-asyncio, it is easier to test the dead links in an async context and not need to worry about weird issues with the aiohttp client session being used outside of an async context. The places where the function was being called before (only one, in search) can now call `sync_check_dead_links` instead without any change to the overall behaviour of the app. It just makes the tests easier to configure now that we actually have real async contexts going on in some places.
@github-actions github-actions bot added 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Jul 24, 2023
@sarayourfriend
Copy link
Collaborator Author

Something is happening in my implementation that causes the event loop associated with the aiohttp ClientSession to close after each thumbnail request. I can't figure out what is causing this. It can be mitigated by checking if the loop on the session is closed before reusing it (and creating a new one if needed), but the .loop property on the session is deprecated, so this raises a warning, and in future versions of aiohttp the property will be removed entirely.

I've tried to figure out what it is I did that is causing this behaviour. There's virtually nothing available online that describes the issue. It is only reproducible when using the production runtime configuration (gunicorn with uvicorn workers). When running directly under uvicorn, there's no issue at all.

Using httpx instead of aiohttp doesn't solve the error. I get PoolTimeouts instead of event loop closure errors. This means it isn't something to do with aiohttp, but the implementation of the thumbnail route. I suspect it might have something to do with sync_to_async wrappers around Redis, but I just don't know.

If I switch the implementation back to using django-async-redis, I only get event loop closure errors on every other request (as opposed to being able to serve a single request followed by only getting event loop closure errors on every subsequent request).

Anyway, I'm not able to continue working on this due to AFK. I just wanted to share where I got with this project and the feasibility.

I'm of the opinion that it is absolutely feasible and that we just need to identify the underlying issue that is causing the event loop closures. All routes other than the thumbnail continues to work perfectly fine, which leads me to believe it really is something caused by the implementation of the image proxy and small asyncio usage details that I'm not aware of. The requirement of needing to share the HTTP library session is a pain, but the implementation for it does work, theoretically!

Maybe it's worth just doing everything in FastAPI, maybe even re-writing our entire app in FastAPI. Who knows. I think the Django approach would be the best one, but getting stuck on this event loop closed issue has left me confounded. Chances are that the same implementation would have all the same problems if it was running on FastAPI instead of Django... I really don't know what the effective difference would be there!

Anyway, if someone else wants to try to debug the issue, just override the command in the docker-compose for web with gunicorn --reload and the gunicorn configuration will pick up the rest. It is only reproducible when running multiple workers, so maybe there's something thread-unsafe being done in the code that is causing this? 🤷

It is worth reverting the httpx addition back to the aiohttp version. Use the same implementation of `get_aiohttp_session` but actually use `aiohttp.ClientSession` instead. Pass `session.close` instead of `aclose` to the application shutdown callback. Everything else is basically the same. You will see the issues I have noted in the PR comment of the event loop closing if you replace the `web` service command in docker-compose with `gunicorn --reload`. The rest of the relevant configuration will get picked up from the gunicorn configuration file.
@sarayourfriend
Copy link
Collaborator Author

Also, just a note, I don't think the issue is caused by splitting the views, because all other views work perfectly fine. It's just the thumbnails route that has issues. I really think it's down to the implementation of the thumbnails route, something funky going on there with the event loops.

@sarayourfriend
Copy link
Collaborator Author

Okay, the core maintainers have agreed that feasibility has been demonstrated by this PR, so I'm going to close both the PR and the issue. I'll ping on the issue to get clarification on the precise next steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Evaluate feasibility of Django ASGI and ADRF conversion
2 participants