-
-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy support to Trafilatura #330
Comments
Apparently urllib3 has chosen not to read environment proxy variables urllib3/urllib3#1785 |
In trafilatura/trafilatura/downloads.py Line 104 in 82043f7
if use_proxy:
HTTP_POOL = urllib3.ProxyManager(retries=RETRY_STRATEGY, timeout=config.getint('DEFAULT', 'DOWNLOAD_TIMEOUT'), ca_certs=certifi.where(), num_pools=NUM_CONNECTIONS, proxy_url=PROXY_HOST, proxy_headers=PROXY_HEADERS)
else:
HTTP_POOL = urllib3.PoolManager(retries=RETRY_STRATEGY, timeout=config.getint('DEFAULT', 'DOWNLOAD_TIMEOUT'), ca_certs=certifi.where(), num_pools=NUM_CONNECTIONS) PROXY_**** variables could come from The behavior of ProxyManager is the same as PoolManager: https://urllib3.readthedocs.io/en/stable/reference/urllib3.poolmanager.html#urllib3.ProxyManager What do you think? |
Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution would be to use another software for downloads and to process the resulting HTML files with Trafilatura. That being said, if you can find a easy way to perform HTTP requests with a proxy then it could be an interesting additional feature. |
Second this. There are tons of efficient ways for downloading. Trafilatura should stay focused on its main task: extraction. |
Currently, I'm having trouble accessing some websites and I believe that using a proxy might help solve this issue.
If there is no natively proxy support in Trafilatura (didn't find in docs), I would like to suggest adding this functionality for future versions.
The text was updated successfully, but these errors were encountered: