-
-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordering by most common #90
Comments
interesting comment! one option is to add a field "prevalence"/"priority"/ that reflects this information, and that could be used to generate the regexp in the right order. WDYT? |
Sure - that would be a good solution. |
@plbowers have you got any hard benchmark figures proving that your method would indeed be significantly faster? |
This would be a strange optimisation to do unless more than 50% of your User-Agent tests match the crawler list. This is because for non-crawer traffic both regex groups state 'no match', so you would e optimising for something that occurs rarely; assuming your User-Agent traffic is 95%+ non-crawler. If you are looking for lowering latency, then you should look to using a language (or maybe PHP has a C an extension) that lets you compile the concatinated version of RE (which then means ordering is irrelevent): For example: Some languages cache the compiled version automatically for you (I cannot see if PHP does too):
|
I tried multiple cases using https://godoc.org/go.kelfa.io/kelfa/pkg/crawlerflagger (it's written in Go). It exposes 2 ways to query the crawler-user-agents list:
I tried to match the 1st entry, the 100th entry, the 200th entry, the 300th entry, the 400th entry, and a non-existent entry, those are the results:
So it seems to suggest that for "instances" based match (at least in Go) the order has absolutely no relevance, while it has relevance for "pattern" based in match (at least in Go). |
Interesting, do you also test with a single pattern concatenating all patterns with |
At the moment there are 400+ regexp (one per entry) and then a switch to analyse the case that matches. The reason which made me implement it in this way is that I'm not really sure how to identify which case then is matching. Basically it would be possible to decide if at least one pattern is matched by the input string, but not which one. |
@monperrus Since most of the bots user agent has bot|crawler|spider ,we could group all the bots useragents with bot|crawl|spider to a single pattern like this regex might help. Improving this regex to a single pattern will reduce the number of patterns to be matched . |
A generic regex like that is a good idea but you do have to be very careful not to create false positives. You can’t have The best way to increase the performance of as regex such as this, is to remove common strings from the source user-agent. As you can see here... We saw a 55% speed increase doing this. |
Grouping patterns is on the user side, as in @JayBizzle 's example. Note that we'd be happy to merge example code snippets for grouping in the README. |
Most of the time people using this code will be hoping to identify bots as quickly as possible. Attempting to put them in order according to most commonly identified bots would speed up the process, allowing to optimize and get out quickly.
I did a very quick optimization using the frequency reported on this page:
https://deviceatlas.com/blog/list-of-web-crawlers-user-agents
And then I put all your patterns (concatenated with |) into 2 preg_match() calls:
if (preg_match(/most|common|patterns/, $_SERVER['HTTP_USER_AGENT'] || preg_match(/less|common|patterns/, $_SERVER['HTTP_USER_AGENT']) {
// is a bot
} else {
// isn't a bot
}
Providing a script to produce that might be a help...?
The text was updated successfully, but these errors were encountered: