Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result counts returned to the user are inconsistent depending on whether the annotator is returning annotations #681

Open
sstemann opened this issue Aug 9, 2024 · 8 comments
Assignees

Comments

@sstemann
Copy link
Contributor

sstemann commented Aug 9, 2024

ARS in Test has a recurring but intermittent issue that has two symptoms:

  1. UI only displays the first/fastest ARA (usually Improve)
  2. UI displays no results

BUt when I look at these PKs in the ARAX GUI, I'm seeing the majority of ARAs respond.

SO then if I run the same query again, sometimes I get all expected results to the UI and sometimes i get whatever case (1 or 2) that i didnt start with. It's very frustrating for testing.

I also think we cannot go to Prod with this behavior, as its very inconsistent for users.

ARS Error 444.xlsx

@ShervinAbd92
Copy link
Collaborator

Hi @sstemann I investigated all the Pks you shared in this file. for partially shown UI results, what happens is that the first ARA returns and node annotator behaves normal, but as soon as the second ARA send results to annotator we get the following error format
Connection broken: IncompleteRead(??? bytes read, ??? more expected)', IncompleteRead(??? bytes read, ??? more expected))
for some cases we are getting this error from the get go. so i believe maybe annotator has to optimize their config settings to be able to handle different incoming sizes? what do you think @newgene
Screen Shot 2024-08-09 at 12 02 59 PM

@sierra-moxon sierra-moxon changed the title Recurring ARS Error 444 Result counts returned to the user are inconsistent depending on whether the annotator is returning annotations Aug 9, 2024
@sierra-moxon
Copy link
Member

from TAQA:
could be an annotator stability issue? right now the UI is not disambiguation between fatal and non-fatal error codes and won't display results if the annotator is not returning a 200.

@MarkDWilliams
Copy link
Collaborator

It's not quite clear to me yet if the connection is failing to finish due to an issue on the ARS side or the Annotator is failing to finish transmitting the data over. In the past, when we've seen similar errors on the ARS side, it was data size/timeout limitations in the configuration of the deployment. However, the sizes of payloads that the Annotator is dealing with aren't much bigger than the ARA returns that are coming back, and those seem to be getting received and written to the database fine.

@maximusunc
Copy link

Based on the automated tests from this past weekend (8/11), this issue is also observed in CI.

@ShervinAbd92
Copy link
Collaborator

Update: ARS is going to implement brotli compression feature on nodes before sending them to the annotator. this is suggested to help with large data size that annotator seems to have trouble processing currently

@sstemann
Copy link
Contributor Author

we're not moving forward with the new Annotator Service in Fugu.

@ShervinAbd92 has implemented a "local" annotator in CI and we may deploy ARS out-of-cycle. I believe this is not the permanent solution.

@sierra-moxon
Copy link
Member

from TAQA: intermittent and hard to debug - ITRB and Annotator interaction/debugging on a dev instance.
shim is a locally annotator service in ARS in CI (to avoid ITRB issues), we are moving that forward to TEST.
parallel track, annotator team is working to stabilize the standalone service, waiting on Guppy deployment to TEST to test that.
plan: push the ARS contained annotator service to PROD off cycle, then go back to working on standalone service.

@ctrl-schaff
Copy link

In terms of notes on the debugging effort for annotator I've disabled the compression middleware we were applying due to the CPU bottlenecking it seemed to create when handling the responses. I've taken a batch of the queries from our annotation logs and I'm now trying to systemically recreate the issue by querying against a local instance of the server versus the ci environment to highlight any differences in responses. As I find more information from testing I'll keep you informed

@sstemann sstemann transferred this issue from NCATSTranslator/Feedback Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants