Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy support to Trafilatura #330

Closed
andremacola opened this issue Apr 23, 2023 · 4 comments · Fixed by #682
Closed

Proxy support to Trafilatura #330

andremacola opened this issue Apr 23, 2023 · 4 comments · Fixed by #682
Labels
enhancement New feature or request

Comments

@andremacola
Copy link
Contributor

Currently, I'm having trouble accessing some websites and I believe that using a proxy might help solve this issue.

If there is no natively proxy support in Trafilatura (didn't find in docs), I would like to suggest adding this functionality for future versions.

@andremacola
Copy link
Contributor Author

Apparently urllib3 has chosen not to read environment proxy variables urllib3/urllib3#1785

@andremacola
Copy link
Contributor Author

andremacola commented Apr 23, 2023

In

HTTP_POOL = urllib3.PoolManager(retries=RETRY_STRATEGY, timeout=config.getint('DEFAULT', 'DOWNLOAD_TIMEOUT'), ca_certs=certifi.where(), num_pools=NUM_CONNECTIONS) # cert_reqs='CERT_REQUIRED'
we could use something like:

if use_proxy:
    HTTP_POOL = urllib3.ProxyManager(retries=RETRY_STRATEGY, timeout=config.getint('DEFAULT', 'DOWNLOAD_TIMEOUT'), ca_certs=certifi.where(), num_pools=NUM_CONNECTIONS, proxy_url=PROXY_HOST, proxy_headers=PROXY_HEADERS)
else:
    HTTP_POOL = urllib3.PoolManager(retries=RETRY_STRATEGY, timeout=config.getint('DEFAULT', 'DOWNLOAD_TIMEOUT'), ca_certs=certifi.where(), num_pools=NUM_CONNECTIONS)

PROXY_**** variables could come from config or maybe from os env variables or params in trafilatura.fetch_url() if the user want to manipulate some random use of proxies.

The behavior of ProxyManager is the same as PoolManager: https://urllib3.readthedocs.io/en/stable/reference/urllib3.poolmanager.html#urllib3.ProxyManager

What do you think?

@adbar adbar added the enhancement New feature or request label Apr 24, 2023
@adbar
Copy link
Owner

adbar commented Apr 24, 2023

Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution would be to use another software for downloads and to process the resulting HTML files with Trafilatura.

That being said, if you can find a easy way to perform HTTP requests with a proxy then it could be an interesting additional feature.

@fortyfourforty
Copy link

Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case solution would be to use another software for downloads and to process the resulting HTML files with Trafilatura.

Second this. There are tons of efficient ways for downloading. Trafilatura should stay focused on its main task: extraction.
It's not a good idea to make it more bloated with unnecessary features.
Just use scraping tools for scraping. Trafilatura should not be a Swiss army knife tool.

@adbar adbar closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024
@adbar adbar reopened this Aug 27, 2024
@adbar adbar linked a pull request Aug 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants