Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(api): Remove concurrency from LegacyContextPlugin #16098

Merged
merged 3 commits into from
Sep 18, 2024

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Aug 22, 2024

Overview

This removes a potential source of deadlocks and race conditions.

I originally thought this would help with EXEC-417. I'm not sure it actually will at this point, but maybe we should still do it.

Background

When the robot executes a Python protocol, the ProtocolEngine and higher-level HTTP stuff live in the main thread, and the user's Python script runs in a worker thread. For older (apiLevel≤2.13) protocols, LegacyContextPlugin transports progress reports from the worker thread to the main thread.

When we initially built all of this, we had severe performance problems where the main thread would get bogged down serializing a big HTTP response or something, and then the backpressure through LegacyContextPlugin would stall the protocol thread, causing visible jankiness in robot motion. We did a bunch of stuff to try to fix this. One fix was to insert an infinite non-blocking queue between the two threads to remove the backpressure.

I strongly suspect that that fix is unnecessary today. As evidence: It only ever applied to older, apiLevel≤2.13, protocols. Newer, apiLevel≥2.14 ones, have always run through a different pipeline, which still has backpressure. And the newer pipeline is apparently not perceived to have janky motion.

Removing the fix would remove concurrency, which would be a meaningful simplification. For instance, I've seen hints that this part of run cleanup is racy, because it depends on whether all the pickUpTip/dropTip commands got consumed from the queue by the time we reach there.

Test Plan and Hands on Testing

  • On an OT-2, run some protocols that are movement-intensive (e.g. lots of touch_tip()s) and have apiLevel ≤ 2.13. Click around in the app and induce some network requests. There might be some jankiness, but make sure it's not severe. Really, I think we just need to make sure it's not any worse than apiLevel ≥ 2.14.
    • I'm noticing jank lasting ~0-1 seconds when you: navigate to the device details page (understandable); navigate through the device settings (weird); and navigate to the device list page (concerning). All of this applies equally to apiLevel≥2.14, so if we're concerned about this, retaining this non-blocking queue in LegacyContextPlugin is not the correct solution.

Changelog

  • Delete the non-blocking queue between the Python protocol thread . Replace it with simple inter-thread function calls.

Review requests

Is this a good idea?

Risk assessment

Medium. There is a risk that this is covering up enduring jankiness problems, and removing it will bring the problems back. This may not be obvious to us in the office because we're more focused on Flexes, which don't run apiLevel≤2.13 protocols.

But there's also risk to having the concurrency, so... 🤷

@SyntaxColoring SyntaxColoring requested a review from a team as a code owner August 22, 2024 15:59
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a nice simplification, and I'm okay with doing it. I will say that OT-2 usage is on the decline, and this can't happen on the Flex, so it's possible that maybe most OT-2 users are using the older protocol API level which is why we don't see the problem.

@SyntaxColoring
Copy link
Contributor Author

Closing for now: this isn't immediately helpful for anything, so why risk it.

If that changes, we can reopen.

@SyntaxColoring SyntaxColoring deleted the non_concurrent_action_dispatch branch August 28, 2024 19:39
@SyntaxColoring SyntaxColoring restored the non_concurrent_action_dispatch branch September 18, 2024 14:47
@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Sep 18, 2024

Reopening after discussion with @sanni-t. #16171 noticed some concurrency problems. We want to improve the situation by:

  1. Making sure higher-level code cleans things up that need to be cleaned up (fix(robot-server): matching RTPs check during analysis of non-RTP protocols  #16171 and future work)
  2. Removing unnecessary concurrency and unnecessary sensitivity to cleanup order (this PR)
  3. Making sure server shutdown cleans things up in the correct order (a follow-up PR)

Copy link
Member

@sanni-t sanni-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely helps improve our background tasks. I'm all for it. 🙌
Once it's merged, we should do some thorough benchmarking tests to document the changes in performance.

@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Sep 18, 2024

I ran a quick test with opentrons_simulate on my laptop and 10000 touch_tip() calls. I didn't find a performance difference with/without this PR.

requirements = {"robotType": "OT-2", "apiLevel": "2.13"}


def run(protocol):
    tip_rack = protocol.load_labware("opentrons_96_tiprack_300ul", 1)
    well_plate = protocol.load_labware("biorad_96_wellplate_200ul_pcr", 2)
    pipette = protocol.load_instrument(
        "p300_single_gen2", mount="right", tip_racks=[tip_rack]
    )
    pipette.pick_up_tip()
    for _ in range(10000):
        pipette.touch_tip(well_plate["H12"])
    pipette.return_tip()

@SyntaxColoring SyntaxColoring merged commit 6451bdb into edge Sep 18, 2024
39 checks passed
@SyntaxColoring SyntaxColoring deleted the non_concurrent_action_dispatch branch September 18, 2024 17:41
SyntaxColoring added a commit that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants