-
Notifications
You must be signed in to change notification settings - Fork 26
Google Search Limitation #16
Comments
I've already set up a service that solves it. It waits for 3-10 seconds and then makes the request to Google to prevent blocking. I'm considering an option to introduce Would you be interested in implementing this? |
In my use case, I want to avoid using search API's but I also want to perform (relatively) quick and numerous searches. That's why I faced the problem with Google detecting my bot script and then responding with a captcha-page. So, according to your solution, the problem is that :
I propose you the following solution :
This gives : Edit : I mixed up code blocks in
|
This is what I suggest:
Calling the function: If you're making many queries There is a major issue with using your method: I made a whole bunch of requests and got myself blocked, now irrespective of where I open Google (Firefox, requests or selenium) I get the same captcha response from Google. I'm not sure if you've solved it, please let me know if you have. These are my recommendations:
|
Sorry, the reason you received the response with the captcha is that I mixed two lines of code. This is now corrected. The effect is that fails will trigger a wait timer with exponential backoff before trying browser automation if it really does not work with a normal request. So, if you want to retest... |
Anyway, this is of course your project, so I will not make you loose time with trying to convince you that it could be a better way to proceed. This is only a question of use case. In my own use case, I need to trigger asynchronous tasks with Google/Bing/[whatever search engine] searches being sure that it will send back a result without taking care about the adaptation of a wait time. So, for my application, this should be transparently handled by the library as, e.g. :
So, I suppose your solution to use a Note : Why don't you simply write Regarding your enumerated list of remarks :
If the points related to these remarks are requirements for your project, then I understand that my proposition is not a solution for you. About the dependencies, |
Regarding your requirements, your solution is surely what you need. Anyway, I have scripted the fallback function in my application and, with your solution, I can simply handle the exponential backoff myself through your Do you need collaboration for any further implementation ? |
I'm just trying to keep things simple and satisfy a general issue. In many cases there are students who are new to scraping and they'll be blocked for a long time if they don't back off and make rapid requests (and have a static IP). New comers will make thousands of requests every second, trust me, I've been there. So I'd like the library to make it clear to them that there are serious issues. By providing this option we are pretty much handing over a weapon and hoping that they don't shoot themselves. It's not a technical issue but rather an issue of making sure that people don't get into trouble. So I'd really like it if you implemented this as an alternative to *.search. (and add a huge warning in the docs about it's usage!) Maybe a shell function that is responsible for maintaining state and stuff. It determines whether to call Sound good to you? |
This is something that I'm hoping to avoid. The above mentioned shell function would be useful in this case too. We can import these packages only when required and let users perform regular search even if they don't have FF installed. So this would be something like an advanced option. |
Problem: By performing a lot of searches consecutively, Google detects the bot nature of the Python script and changes its responses into alternative pages with Captcha control.
Solution: While Google starts sending alternative pages, fall back to using Splinter library to perform browser automation with a Human-like behavior by spacing requests with a random wait timer. This is far slower but makes the script continue to work.
Note: If you are interested, let me know as I already implemented this solution. Note that this bypasses Google's control and this will certainly work during a limited time.
The text was updated successfully, but these errors were encountered: