Use proxy #99

vladsvd · 2023-05-16T21:53:33Z

thanks for the great library. I did not find how to use a proxy, is it possible?

otsch · 2023-05-17T09:13:46Z

Hey @vladsvd 👋
currently it's not possible, but it's on my roadmap for the next months.
Sidenote: as the library uses the PSR-18 ClientInterface and neither the Client, nor the PSR-7 Request have any functionality dedicated to proxying, it'll only be possible using the default guzzle client or the headless browser.

ruerdev · 2023-09-20T13:53:45Z

@otsch Hi! Just curious: did you already have a look at this? I will probably need this very soon, so otherwise I will start implementing it myself. Thanks :)

otsch · 2023-09-20T14:25:12Z

Hey @ruerdev 👋
I thought I'll implement it as part of Hacktoberfest 😅 the hacktoberfest.com website says it already starts in 6 days.

If that's too late for you, you can pass your own guzzle instance with a proxy to your crawler's loader like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use GuzzleHttp\Client;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        $httpClient = new Client(['proxy' => 'http://your-proxy/';]);

        return new HttpLoader($userAgent, $httpClient);
    }

    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent('Mozilla/5.0 (compatible)');
    }
}

If you want to use multiple proxies and rotate them, I think it'd be best to wait for the implementation. Or would you maybe want to try to implement it in the library?
Also: do you want to use it with a normal HTTP client, or with the headless chrome browser?

ruerdev · 2023-09-21T15:03:07Z

Hi @otsch

Thanks for your quick reply and example! I am actually already using that method, I do indeed need a little bit more control over the proxy that is used. I want to use different proxy locations and rotating proxies for the URLs I might crawl on a website. I am using the HTTP client for now, but will also use the headless browser in some cases in the future.

It's fine for me to wait a little bit. I was just checking if it was something you are willing to spend some time on :)

Add new methods `HttpLoader::useProxy()` and `HttpLoader::useRotatingProxies([...])` to define proxies that the loader shall use. They can be used with a guzzle HTTP client instance (default) and when the loader uses the headless chrome browser. Using them when providing some other PSR-18 implementation will throw an exception. (see #99) Also, fix the `HttpLoader::load()` implementation won't throw any exception, because it shouldn't kill a crawler run. When you want any loading error to end the whole crawler execution `HttpLoader::loadOrFail()` should be used. Also adapted the phpdoc in the `LoaderInterface`.

otsch · 2023-09-28T14:20:53Z

@ruerdev so, the feature is on it's way: #120

With that you can then do:

$crawler = HttpCrawler::make()->withUserAgent('MyUserAgent');

$crawler->getLoader()->useProxy('http://127.0.0.1:8001');

// or

$crawler->getLoader()->useRotatingProxies([
    'http://127.0.0.1:8001',
    'http://127.0.0.1:8002',
    'http://127.0.0.1:8003',
]);

useRotatingProxies() iterates through the defined proxies. Works both with guzzle HTTP client and with the headless chrome browser. Does that fit your needs?

ruerdev · 2023-09-28T16:14:52Z

@otsch That should work great, thanks! The only suggestion I have to enhance it further is to include the option to retry a request using a different proxy in case of a failure.

otsch · 2023-09-28T16:39:04Z

🤔 OK, thanks. I'd say this wouldn't be a feature especially for the proxy use-case, but generally. There is already a feature that does retry when receiving special error responses (429 too many requests and 503 service unavailable) https://www.crwlr.software/packages/crawler/v1.1/the-crawler/politeness#wait-and-retry
It is handled by the RetryErrorResponseHandler of the HttpLoader https://github.com/crwlrsoft/crawler/blob/main/src/Loader/Http/Politeness/RetryErrorResponseHandler.php
Maybe we could make this a bit more flexible to make it possible to have automatic retries also for all other error response codes (with shorter wait times than for the 429 and 503 responses). Doing the retry with a different proxy would automatically be the case. But I'll create a new issue for this topic and close this one for now.

Add new methods `HttpLoader::useProxy()` and `HttpLoader::useRotatingProxies([...])` to define proxies that the loader shall use. They can be used with a guzzle HTTP client instance (default) and when the loader uses the headless chrome browser. Using them when providing some other PSR-18 implementation will throw an exception. (see #99) Also, fix the `HttpLoader::load()` implementation won't throw any exception, because it shouldn't kill a crawler run. When you want any loading error to end the whole crawler execution `HttpLoader::loadOrFail()` should be used. Also adapted the phpdoc in the `LoaderInterface`.

otsch mentioned this issue Sep 28, 2023

Enable the use of Proxies #120

Merged

otsch closed this as completed Sep 28, 2023

otsch mentioned this issue Sep 28, 2023

Flexible Auto-Retries for any kind of error responses (4xx, 5xx) #121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use proxy #99

Use proxy #99

vladsvd commented May 16, 2023

otsch commented May 17, 2023

ruerdev commented Sep 20, 2023 •

edited

Loading

otsch commented Sep 20, 2023

ruerdev commented Sep 21, 2023

otsch commented Sep 28, 2023

ruerdev commented Sep 28, 2023

otsch commented Sep 28, 2023

Use proxy #99

Use proxy #99

Comments

vladsvd commented May 16, 2023

otsch commented May 17, 2023

ruerdev commented Sep 20, 2023 • edited Loading

otsch commented Sep 20, 2023

ruerdev commented Sep 21, 2023

otsch commented Sep 28, 2023

ruerdev commented Sep 28, 2023

otsch commented Sep 28, 2023

ruerdev commented Sep 20, 2023 •

edited

Loading