Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use proxy #99

Closed
vladsvd opened this issue May 16, 2023 · 7 comments
Closed

Use proxy #99

vladsvd opened this issue May 16, 2023 · 7 comments

Comments

@vladsvd
Copy link

vladsvd commented May 16, 2023

thanks for the great library. I did not find how to use a proxy, is it possible?

@otsch
Copy link
Member

otsch commented May 17, 2023

Hey @vladsvd 👋
currently it's not possible, but it's on my roadmap for the next months.
Sidenote: as the library uses the PSR-18 ClientInterface and neither the Client, nor the PSR-7 Request have any functionality dedicated to proxying, it'll only be possible using the default guzzle client or the headless browser.

@ruerdev
Copy link

ruerdev commented Sep 20, 2023

@otsch Hi! Just curious: did you already have a look at this? I will probably need this very soon, so otherwise I will start implementing it myself. Thanks :)

@otsch
Copy link
Member

otsch commented Sep 20, 2023

Hey @ruerdev 👋
I thought I'll implement it as part of Hacktoberfest 😅 the hacktoberfest.com website says it already starts in 6 days.

If that's too late for you, you can pass your own guzzle instance with a proxy to your crawler's loader like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use GuzzleHttp\Client;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        $httpClient = new Client(['proxy' => 'http://your-proxy/';]);

        return new HttpLoader($userAgent, $httpClient);
    }

    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent('Mozilla/5.0 (compatible)');
    }
}

If you want to use multiple proxies and rotate them, I think it'd be best to wait for the implementation. Or would you maybe want to try to implement it in the library?
Also: do you want to use it with a normal HTTP client, or with the headless chrome browser?

@ruerdev
Copy link

ruerdev commented Sep 21, 2023

Hi @otsch

Thanks for your quick reply and example! I am actually already using that method, I do indeed need a little bit more control over the proxy that is used. I want to use different proxy locations and rotating proxies for the URLs I might crawl on a website. I am using the HTTP client for now, but will also use the headless browser in some cases in the future.

It's fine for me to wait a little bit. I was just checking if it was something you are willing to spend some time on :)

otsch added a commit that referenced this issue Sep 28, 2023
Add new methods `HttpLoader::useProxy()` and
`HttpLoader::useRotatingProxies([...])` to define proxies that the
loader shall use. They can be used with a guzzle HTTP client instance
(default) and when the loader uses the headless chrome browser. Using
them when providing some other PSR-18 implementation will throw an
exception.
(see #99)

Also, fix the `HttpLoader::load()` implementation won't throw any
exception, because it shouldn't kill a crawler run. When you want any
loading error to end the whole crawler execution
`HttpLoader::loadOrFail()` should be used. Also adapted the phpdoc in
the `LoaderInterface`.
@otsch
Copy link
Member

otsch commented Sep 28, 2023

@ruerdev so, the feature is on it's way: #120

With that you can then do:

$crawler = HttpCrawler::make()->withUserAgent('MyUserAgent');

$crawler->getLoader()->useProxy('http://127.0.0.1:8001');

// or

$crawler->getLoader()->useRotatingProxies([
    'http://127.0.0.1:8001',
    'http://127.0.0.1:8002',
    'http://127.0.0.1:8003',
]);

useRotatingProxies() iterates through the defined proxies. Works both with guzzle HTTP client and with the headless chrome browser. Does that fit your needs?

@ruerdev
Copy link

ruerdev commented Sep 28, 2023

@otsch That should work great, thanks! The only suggestion I have to enhance it further is to include the option to retry a request using a different proxy in case of a failure.

@otsch
Copy link
Member

otsch commented Sep 28, 2023

🤔 OK, thanks. I'd say this wouldn't be a feature especially for the proxy use-case, but generally. There is already a feature that does retry when receiving special error responses (429 too many requests and 503 service unavailable) https://www.crwlr.software/packages/crawler/v1.1/the-crawler/politeness#wait-and-retry
It is handled by the RetryErrorResponseHandler of the HttpLoader https://github.com/crwlrsoft/crawler/blob/main/src/Loader/Http/Politeness/RetryErrorResponseHandler.php
Maybe we could make this a bit more flexible to make it possible to have automatic retries also for all other error response codes (with shorter wait times than for the 429 and 503 responses). Doing the retry with a different proxy would automatically be the case. But I'll create a new issue for this topic and close this one for now.

@otsch otsch closed this as completed Sep 28, 2023
otsch added a commit that referenced this issue Sep 28, 2023
Add new methods `HttpLoader::useProxy()` and
`HttpLoader::useRotatingProxies([...])` to define proxies that the
loader shall use. They can be used with a guzzle HTTP client instance
(default) and when the loader uses the headless chrome browser. Using
them when providing some other PSR-18 implementation will throw an
exception.
(see #99)

Also, fix the `HttpLoader::load()` implementation won't throw any
exception, because it shouldn't kill a crawler run. When you want any
loading error to end the whole crawler execution
`HttpLoader::loadOrFail()` should be used. Also adapted the phpdoc in
the `LoaderInterface`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants