-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Scalability issue with multiple simultaneous DIDExchange requests #3492
Comments
Ah…this is the issue you thought was resolved with the last Askar updates. I don’t think the follow up Askar changes would have impacted this, but who knows. Just to clarify, if you do fewer than the 10 parallel tenants, things work fine. Its only once you get to a threshold of tenants that the problems occur, correct? |
Indeed - I had this debug script ready when askar 0.4.2 came out, and when I upgraded it in acapy and reran these tests, I got 2 successful runs and thought it's fixed (after previously consistent test failures). The stack traces show timing out on opening askar session, and that's why I thought caching changes were just the thing that might fix it, but I haven't done enough testing on each askar version to really verify impact of all the changes. It's probably sensitive to resource constraints. In our dev environment, it happens less often with a slightly beefier node.
Exactly - I tested 2, 3, 4, 5, 6 - all working. Wanted to find the exact boundary, but alas, impatience. With 10 consecutive requests, they usually all fail. I'll have to add some more debug logs around the place to figure out what's causing the issue. Quick inspection shows a lot of PS: We've recently begun open-sourcing acapy-cloud (previously some hidden helm charts were needed to deploy everything locally), and so we're partially curious to hear if others can succeed in setting up a local acapy-cloud environment. I think it should prove to be a very useful and powerful repo to simplify working with acapy, and debugging things like this. It can definitely benefit from more users and contributors! So please, to all maintainers here, check it out and feel free to let know if you need help 🚀 |
Note: in my latest runs, I see this askar warning log pops up just before the exceptions occur:
|
Ok. After adding debug logs to the opening and closing of Askar sessions, I can see that the Issuer tenant's session that gets absolutely spammed with open-close requests. AskarProfile _setup method gets called 80+ times for the 10 didexchange requests. Logs are truncated, so can't see exactly. Right before the Timeout gets raised, there is a barrage of setup and teardown requests called in quick succession. An issue appears to be that the AskarProfileSessions are shortlived -- they are repeatedly re-initialized. I would expect in a given request context for there to be one long-lived ProfileSession object. But this does not appear to be the case, since I also notice that The issue can probably be solved by increasing the 10s timeout, and minor optimization to the amount of times the Askar sessions are opened and closed in short succession. But I presume this is just a bandaid, and pushes the scalability issue to a higher threshold. Worth it for a short term solution, but the reason for ProfileSessions being constantly re-initialised should be investigated further. |
Ah, I see now why sessions are recreated. I missed that in askar.profile, class AskarProfile(Profile):
...
def session(self, context: Optional[InjectionContext] = None) -> ProfileSession:
"""Start a new interactive session with no transaction support requested."""
return AskarProfileSession(self, False, context=context) Probably makes sense for that to be cached as part of the AskarProfile, but maybe there's a reason not to. Gonna test and see |
Caching the session (when context = None) speeds up live/status checks by ~40ms 👀 since |
I wonder if there's any difference between single-wallet/db and multi-wallet/db scenarios? I agree we should try and prevent unnecessary opening and closing of sessions. I don't think this was ever really considered a problem in the code base and is probably done excessively. I'd need to think about a caching solution but looks like you are already doing that. |
It's proving quite tricky for me to refactor, because there's a (classic) trade off between keeping too many connections open, and opening/closing too frequently. For this particular issue, it's mostly because one wallet is trying to store all these new connection records simultaneously. And something causes the DB connections to freeze / lock up. My idea was to modify the session teardown logic, so that it's a delayed background task, which will get cancelled if it needs to open again in a short time window -- but that can have the tradeoff of too many connections staying open, and preventing new sessions. A lot of new things here for me, so I'll have to explore a bit more. There's hopefully some config available in askar that can tweak the max concurrent connections. With some minor re-grouping of tasks, so they happen in one session being open, instead of trying the teardown refactoring. |
The answer is with the
We were running with 10 as the max_connections, and that's why things start acting up around 10 concurrent requests. When increasing this to 100, then it works. I can then run my test script with 60 tenants, and I just get 1 connection that didn't succeed... which is obviously much better than 10/10 failing, as before. (With 90 requests, I got about ~15 errors) So, while there is a patch / workaround for our problem, I think it's definitely worth reviewing the didexchange flow + askar session logic, so that max_connections: 5 won't result in 5/5 concurrent requests all failing. Especially, failing without the end user knowing. It's fine if the auto-accept flow fails, but the tenant should know that their I've got some in-progress work that tries to make some minor improvements, for which I'll make a draft PR next week. But I'll definitely need some more help with this one |
🐛 This is a bandaid for the didexchange scalability issues (see openwallet-foundation/acapy#3492)
With 10 concurrent requests, Askar shouldn't be holding on to any sessions for nearly long enough to create contention issues. I would guess there's a cache for the multi-tenant sessions, or a session instance is being maintained just because an HTTP connection is still open. The connections are pooled, so opening and closing sessions is meant to be cheap. When caching on the ACA-Py side it is helpful to keep Profile instances around, but not Session instances. |
So it sounds like this might be an issue in ACA-Py and its handling of connections to Askar. Who wants to dive into that code. Presumably it is relatively small section of code... |
🐛 This is a bandaid for the didexchange scalability issues (see openwallet-foundation/acapy#3492)
One thing worth mentioning, that I noticed in the didexchange methods, is that there is an class DIDXManager(BaseConnectionManager):
async def _extract_and_record_did_doc_info(self, request: DIDXRequest):
"""Extract and record DID Document information from the DID Exchange request.
Extracting this info enables us to correlate messages from these keys back to a
connection when we later receive inbound messages.
"""
if request.did_doc_attach and request.did_doc_attach.data:
self._logger.debug("Received DID Doc attachment in request")
async with self.profile.session() as session: # <-- first opening of session
wallet = session.inject(BaseWallet)
conn_did_doc = await self.verify_diddoc(wallet, request.did_doc_attach)
await self.store_did_document(conn_did_doc) # <-- let's follow into this method
class BaseConnectionManager:
async def store_did_document(self, value: Union[DIDDoc, dict]):
"""Store a DID document.
Args:
value: The `DIDDoc` instance to persist
"""
...
self._logger.debug("Storing DID document for %s: %s", did, doc)
try:
stored_doc, record = await self.fetch_did_document(did) # <-- opens a session
except StorageNotFoundError:
record = StorageRecord(self.RECORD_TYPE_DID_DOC, doc, {"did": did})
async with self._profile.session() as session: # <-- opens a session if above not found
storage: BaseStorage = session.inject(BaseStorage)
await storage.add_record(record)
else:
async with self._profile.session() as session: # <-- opens a session again if it was found
storage: BaseStorage = session.inject(BaseStorage)
await storage.update_record(record, doc, {"did": did})
await self.remove_keys_for_did(did) # <-- opens a session
await self.record_keys_for_resolvable_did(did) It looks like the design of the ProfileSession had this possibility of nested opening in mind, by keeping track of class ProfileSession(ABC):
"""An active connection to the profile management backend."""
def __init__(
self,
profile: Profile,
*,
context: Optional[InjectionContext] = None,
settings: Mapping[str, Any] = None,
):
"""Initialize a base profile session."""
self._active = False
self._awaited = False
self._entered = 0 # <--
...
async def __aenter__(self): # <-- when stepping into async with contexts
"""Async context manager entry."""
LOGGER.debug( # my added debug lines
"Profile __aenter__ called. self._active: %s for profile: %s",
self._active,
self._profile,
)
if not self._active:
LOGGER.debug(
"Setting up profile session in def __aenter__ for profile: %s.",
self._profile,
)
await self._setup()
self._active = True
LOGGER.debug("Profile session active for profile: %s.", self._profile)
self._entered += 1 # <--
LOGGER.debug(
"__aenter__ returning. self._entered: %s for profile: %s",
self._entered,
self._profile,
)
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Async context manager exit."""
self._entered -= 1 # <--
LOGGER.debug(
"Profile __aexit__ called. self._entered: %s for profile: %s",
self._entered,
self._profile,
)
if not self._awaited and not self._entered: # <--
LOGGER.debug(
"Tearing down profile session in def __aexit__ for profile: %s.",
self._profile,
)
await self._teardown() # <-- teardown only called if _entered is 0
self._active = False
LOGGER.debug("Profile session inactive for profile: %s.", self._profile) When inspecting debug lines, I can see that This is because a brand new ProfileSession is returned with each class AskarProfile(Profile):
def session(self, context: Optional[InjectionContext] = None) -> ProfileSession:
"""Start a new interactive session with no transaction support requested."""
return AskarProfileSession(self, False, context=context) So, because a new This is why I had the idea of caching and reusing the same session in the profile object somehow, if it's still active, but it gets a bit messy. My initial experimentation didn't solve the problem (just seemed to create other issues), and it'll take some more clever thinking to get a proper solution. Just sharing these as some notes, for what I think would be relevant improvements to make to ACA-Py. Besides that, there needs to be exception handling so that the client making the request will know that their |
When multiple tenants simultaneously request a DIDExchange connection with an issuer's public DID, several unhandled exceptions are raised, causing all requested connections to fail.
The handling logic and auto-complete flows associated with the DIDExchange request do not report any error to the clients that made the request, leaving their connection record in the
request-sent
state.The issuer does not receive any
request-received
records as expected - not even one of the many requests.Note: this is running the latest ACA-Py release, with askar 0.4.3
Steps to Reproduce
There are many steps required to reproduce this in acapy alone... so the simplest way to reproduce this would be to check out our
acapy-cloud
repo (previouslyaries-cloudapi-python
), where a simple test script can do all the setup and replicate it for you: https://github.com/didx-xyz/acapy-cloudAs a summary - besides all the steps for onboarding an issuer, and registering their public DID - here's how to replicate this issue:
POST /didexchange/create-request
) usinguse_public_did
to set the issuer's public DID for the request.The above steps can be achieved:
app/tests/e2e/test_many_connections.py
mise run tilt:up
, and wait for services to be up and running (visit localhost:10350)pytest app/tests/e2e/test_many_connections.py
The test should fail with "Connection 0 failed with exception" and then "expected webhook not received".
Under Multitenant-Agent logs, you'll see many exceptions being raised, one for each request.
The stack trace seems to reveal that it's to do with a timeout waiting to open an askar session:
PS: Log levels can be modified in
helm/acapy-cloud/conf/local/multitenant-agent.yaml
, e.g. setACAPY_LOG_LEVEL
todebug
Please let me know if the replication steps are successful or not, or whether you need help with the acapy-cloud mise setup.
The text was updated successfully, but these errors were encountered: