-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use proxy #99
Comments
Hey @vladsvd 👋 |
@otsch Hi! Just curious: did you already have a look at this? I will probably need this very soon, so otherwise I will start implementing it myself. Thanks :) |
Hey @ruerdev 👋 If that's too late for you, you can pass your own guzzle instance with a proxy to your crawler's loader like this: use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use GuzzleHttp\Client;
use Psr\Log\LoggerInterface;
class MyCrawler extends HttpCrawler
{
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
{
$httpClient = new Client(['proxy' => 'http://your-proxy/';]);
return new HttpLoader($userAgent, $httpClient);
}
protected function userAgent(): UserAgentInterface
{
return new UserAgent('Mozilla/5.0 (compatible)');
}
} If you want to use multiple proxies and rotate them, I think it'd be best to wait for the implementation. Or would you maybe want to try to implement it in the library? |
Hi @otsch Thanks for your quick reply and example! I am actually already using that method, I do indeed need a little bit more control over the proxy that is used. I want to use different proxy locations and rotating proxies for the URLs I might crawl on a website. I am using the HTTP client for now, but will also use the headless browser in some cases in the future. It's fine for me to wait a little bit. I was just checking if it was something you are willing to spend some time on :) |
Add new methods `HttpLoader::useProxy()` and `HttpLoader::useRotatingProxies([...])` to define proxies that the loader shall use. They can be used with a guzzle HTTP client instance (default) and when the loader uses the headless chrome browser. Using them when providing some other PSR-18 implementation will throw an exception. (see #99) Also, fix the `HttpLoader::load()` implementation won't throw any exception, because it shouldn't kill a crawler run. When you want any loading error to end the whole crawler execution `HttpLoader::loadOrFail()` should be used. Also adapted the phpdoc in the `LoaderInterface`.
@ruerdev so, the feature is on it's way: #120 With that you can then do: $crawler = HttpCrawler::make()->withUserAgent('MyUserAgent');
$crawler->getLoader()->useProxy('http://127.0.0.1:8001');
// or
$crawler->getLoader()->useRotatingProxies([
'http://127.0.0.1:8001',
'http://127.0.0.1:8002',
'http://127.0.0.1:8003',
]);
|
@otsch That should work great, thanks! The only suggestion I have to enhance it further is to include the option to retry a request using a different proxy in case of a failure. |
🤔 OK, thanks. I'd say this wouldn't be a feature especially for the proxy use-case, but generally. There is already a feature that does retry when receiving special error responses (429 too many requests and 503 service unavailable) https://www.crwlr.software/packages/crawler/v1.1/the-crawler/politeness#wait-and-retry |
Add new methods `HttpLoader::useProxy()` and `HttpLoader::useRotatingProxies([...])` to define proxies that the loader shall use. They can be used with a guzzle HTTP client instance (default) and when the loader uses the headless chrome browser. Using them when providing some other PSR-18 implementation will throw an exception. (see #99) Also, fix the `HttpLoader::load()` implementation won't throw any exception, because it shouldn't kill a crawler run. When you want any loading error to end the whole crawler execution `HttpLoader::loadOrFail()` should be used. Also adapted the phpdoc in the `LoaderInterface`.
thanks for the great library. I did not find how to use a proxy, is it possible?
The text was updated successfully, but these errors were encountered: