Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cold (?) starting curl_cffi session with ThreadPoolExecutor crashes silently #422

Open
3 tasks done
fireattack opened this issue Oct 29, 2024 · 3 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@fireattack
Copy link

fireattack commented Oct 29, 2024

Please check the following items before reporting a bug, otherwise it may be closed immediately.

  • This is NOT a site-related "bugs", e.g. some site blocks me when using curl_cffi,
    UNLESS it has been verified that the reason is missing pieces in the impersonation.
  • A code snippet that can reproduce this bug is provided, even if it's a one-liner.
  • Version information will be pasted as below.

Describe the bug

When using concurrent.futures.ThreadPoolExecutor to run multiple threads with the same impersonate='chrome' session (from requests_curl.Session(impersonate='chrome')), the script often crashes during the first few session.get() calls. However, no exception is raised when this happens.

The bug is difficult to reproduce consistently, but I find a pattern:

If the script hasn't been run for a long period (e.g., 10 hours), it always crashes the first time it's executed afterward (same computer, no restarts).
After that initial crash, any subsequent runs work fine without issues.

To Reproduce
My script is similar to the one below:

import concurrent.futures
from curl_cffi import requests as requests_curl
import requests

UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36'

session = requests_curl.Session(impersonate='chrome')
session.headers.update({"User-Agent": UA})
session.proxies = {'http': None, 'https': None} # disable system proxy

def parse(url):
    try:
        r = session.get(url, headers={"referer": "https://google.com"}, timeout=10)
    except Exception as e:
        print(f'Error: {e}')
        return None
    return r.status_code


def process():
    threads = 6
    futures = {}
    tasks = list(range(100000, 100000 + 100))

    with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as ex:
        for i in tasks:
            url = f'https://example.com/event/{i}/embed'
            futures[ex.submit(parse, url)] = url
        count = 0
        for fut in concurrent.futures.as_completed(futures):
            url = futures[fut]
            count += 1
            print(f'Finished: {count}/{len(tasks)}', end='\r')
        print()

process()
print('All done!')

Expected behavior

It should run without issue.

Observed:

>python bug.py
Finished: 3/100
>python bug.py
Finished: 100/100
All done!

Versions

  • OS: Windows 10
  • curl_cffi version 0.7.1
  • pip freeze dump dump.txt

Additional context

  • Which session are you using? async or sync? sync
  • If using async session, which loop implementation are you using?
  • If you have tried, does this work with other http clients, e.g. requests, httpx or real browsers. have never seen similar crash before.
@fireattack fireattack added the bug Something isn't working label Oct 29, 2024
@fireattack fireattack changed the title Cold starting curl_cffi session with with ThreadPoolExecutor crash silently Cold (?) starting curl_cffi session with with ThreadPoolExecutor crashes silently Oct 29, 2024
@perklet
Copy link
Collaborator

perklet commented Oct 29, 2024

After that initial crash, any subsequent runs work fine without issues.

Do you mean that the second time you run the script, it works? It's really weird that a exited process would have effect on a non-related second process.

@fireattack
Copy link
Author

fireattack commented Oct 29, 2024

Yes, that's what I mean. I agree this is definitely weird. But it only happens after I switched to use curl_cffi so I have to assume it's related to it.

@fireattack
Copy link
Author

fireattack commented Oct 29, 2024

A small correction: it's not always the very first session.get(). As shown in newly added console output example, it fails after 3 threads finished successfully.

Also I'm using it against a CF-protected website. This is irrelevant, I can reproduce with https://example.com

@fireattack fireattack changed the title Cold (?) starting curl_cffi session with with ThreadPoolExecutor crashes silently Cold (?) starting curl_cffi session with ThreadPoolExecutor crashes silently Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants