Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow custom exit function through dependency injection #389

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

honzajavorek
Copy link
Contributor

@honzajavorek honzajavorek commented Feb 5, 2025

In some contexts it is not desirable to call sys.exit() directly from the Actor object. This is already apparent from the surrounding code, which has to deal with such situations: IPython, pytest, or nested loop. I found yet another such situation: Twisted reactor.

I think this leads to a refactoring where the fail() and exit() functions are not responsible for ending the program, as explicit is better than implicit. However, such change would be a larger task I'm not up to, and it would possibly be a breaking change. I'll let the maintainers to discuss and decide the best solution.

My change is minimal and allows an escape route well within the existing API. If we use dependency injection and move sys.exit from being hardcoded to being a parameter of both fail() and exit(), the interface stays the same for existing code, but I'd get the ability to call these methods in contexts, where I don't want them to terminate the program:

from twisted.internet import asyncioreactor, defer
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerRunner
from scrapy.utils.defer import deferred_from_coro
from twisted.internet.task import react

asyncioreactor.install()

settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl(reactor):
    yield deferred_from_coro(Actor.init())
    yield runner.crawl(MySpider)
    yield deferred_from_coro(Actor.exit())
    print("Done!")

react(crawl, [])

Currently, the above code won't print Done!, because the program ends and there is no straightforward way to prevent it. I invented a workaround like this:

builtins.__IPYTHON__ = True  # deception, Actor.exit() won't call sys.exit()
yield deferred_from_coro(Actor.exit())

If this PR is accepted and a new version of SDK released, I could do this:

yield deferred_from_coro(Actor.exit(sys_exit=lambda _: None))

In tests, for example, it would be possible to pass a mock and check that the sys_exit function got called with a correct exit code, etc.

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem in this, but let's wait for @janbuchar opinion as well.

@honzajavorek honzajavorek had a problem deploying to fork-pr-integration-tests February 5, 2025 09:55 — with GitHub Actions Failure
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this initiative! In all honesty, I'm not sure if we should call sys.exit at all, but this seems like a nice, non-breaking change. But there's stuff to discuss first.

src/apify/_actor.py Outdated Show resolved Hide resolved
@@ -231,17 +231,19 @@ async def exit(
event_listeners_timeout: timedelta | None = EVENT_LISTENERS_TIMEOUT,
status_message: str | None = None,
cleanup_timeout: timedelta = timedelta(seconds=30),
sys_exit: Callable = sys.exit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be passed in when the context manager is used (async with Actor)? Otherwise, I'd be more inclined to just not calling sys.exit from Actor.exit. From my point of view, the proposed solution just lets you do Actor.exit(sys_exit=something) instead of Actor.exit(); something().

Comment on lines 271 to +278
if is_running_in_ipython():
self.log.debug(f'Not calling sys.exit({exit_code}) because Actor is running in IPython')
self.log.debug(f'Not calling sys_exit({exit_code}) because Actor is running in IPython')
elif os.getenv('PYTEST_CURRENT_TEST', default=False): # noqa: PLW1508
self.log.debug(f'Not calling sys.exit({exit_code}) because Actor is running in an unit test')
self.log.debug(f'Not calling sys_exit({exit_code}) because Actor is running in an unit test')
elif hasattr(asyncio, '_nest_patched'):
self.log.debug(f'Not calling sys.exit({exit_code}) because Actor is running in a nested event loop')
self.log.debug(f'Not calling sys_exit({exit_code}) because Actor is running in a nested event loop')
else:
sys.exit(exit_code)
sys_exit(exit_code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this logic could be simplified a fair bit.

  • we don't need to check if we're running under pytest - we can instead pass in a mock sys_exit
  • didn't @vdusek manage to remove the nesting?
  • we probably still have to check for ipython 😕

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the nest-asyncio ...

We managed to resolve the problem of executing the Actor & Spider together. Here is more context and the core result structure - scrapy/scrapy#6639 (reply in thread).

However, there's still one remaining challenge - executing asyncio coroutines from the synchronous Scheduler (and HTTP cache storage) code. This is in-progress.

Once that's addressed, I'll open a new PR removing nest-asyncio completely.

@honzajavorek
Copy link
Contributor Author

Just for a context, I agree the library probably shouldn't call sys.exit() at all.

But I'm just a passing person who happened to need this in order to contribute to solving the nested loops 😄 I'm already deep in the mud over there. I can't work on two mud pools in parallel. Hence I've made this minimal change, with pincers and gloves, and I send it to you as a gift 💝

If I was a maintainer of the code, I'd take a bulldozer and yeah, sure, after a week of work and a few breaking changes, there would be no sys.exit() left and no conditions checking on pytest etc. 😅 🚜

@janbuchar
Copy link
Contributor

But I'm just a passing person who happened to need this in order to contribute to solving the nested loops 😄 I'm already deep in the mud over there. I can't work on two mud pools in parallel. Hence I've made this minimal change, with pincers and gloves, and I send it to you as a gift 💝

And we will forever be grateful for that. If you don't have the resources to work on this further, we can totally take over and polish it, but we value any input that you can provide.

If I was a maintainer of the code, I'd take a bulldozer and yeah, sure, after a week of work and a few breaking changes, there would be no sys.exit() left and no conditions checking on pytest etc. 😅 🚜

I am a maintainer and I'm still not sure we should remove it completely, so let's try and come up with a nice incremental upgrade for now.

@honzajavorek honzajavorek had a problem deploying to fork-pr-integration-tests February 5, 2025 11:20 — with GitHub Actions Failure
@honzajavorek
Copy link
Contributor Author

honzajavorek commented Feb 5, 2025

Mr. @vdusek, I see what you did there! 😂

Screenshot 2025-02-05 at 16 34 53

If you're okay to institutionalize my use case like this, I'm happy to close this 😄


Regarding about what's right and wrong... my hunch would be that the methods do too many things. They finalize and close the actor's resources, and they take care of ending the program at the same time. In some contexts, it's handy that I don't have to deal with the exit code myself (well, is it? more on that later), but in other contexts, it's a problem that I cannot opt out. So I think those should be separate steps:

  • Actor.finalize() without exit_code, but with all the rest
  • Actor.exit() which sets the exit code and calls sys.exit
  • Actor.init() takes a parameter whether it should end the program or not, saves it as a property
  • Actor.aexit() reads the property and either ends the program or not
  • I'm not sure whether Actor.fail() has to exist at all?

Hmm, and do we need ending the program at all? If Python program raises, it ends with non-zero exit code by default. Do we need to handle the program termination, or do we leave it up to the user of the SDK? I don't see a good reason to handle it at all, tbh. I struggle to see the added value, but I don't have a scraping platform and can't read user support and don't know about all the use cases. To me, ending the program as it is now feels like "we'll do this as a convenience to the user", but given the number of ifs for pytest, ipython... and the fact that raising an exception results in the same thing, I feel like in the end it's more of an inconvenience, which needs workarounds.

@B4nan
Copy link
Member

B4nan commented Feb 5, 2025

FYI the JS implementation allows to override the exit code explicitly, as well as opt out of the exit call completely, we should have it here too.

https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1978-L1981

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account the JS version actor.ts#L1978-L1981, I would say we should align the Py-SDK behavior with it. Which means add a flag for turning off the calling of sys.exit, rather than adding an custom exit function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants