-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 9d88509
Showing
12 changed files
with
345 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
The MIT License (MIT) | ||
===================== | ||
|
||
Permission is hereby granted, free of charge, to any person | ||
obtaining a copy of this software and associated documentation | ||
files (the “Software”), to deal in the Software without | ||
restriction, including without limitation the rights to use, | ||
copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the | ||
Software is furnished to do so, subject to the following | ||
conditions: | ||
|
||
The above copyright notice and this permission notice shall be | ||
included in all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, | ||
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES | ||
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT | ||
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, | ||
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING | ||
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR | ||
OTHER DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# crawler-cleaner 🕷️🧹✨ | ||
|
||
Processes json web log input from stdin (one json log object per line), | ||
removing any user agents determined to be crawlers/bot/scrapers. | ||
|
||
The project uses the | ||
[crawler-user-agents](https://github.com/monperrus/crawler-user-agents) | ||
project as the user agent database. | ||
|
||
By default crawler-cleaner looks for the user agent in the top level | ||
`http_user_agent` field in each json log. This may be configured using the | ||
`-user-agent-key` flag (but must be top level). | ||
|
||
Detected crawler logs can be discarded (default) or written to a separate | ||
file/stream. JSON parse errors (error messages between json logs etc) | ||
can also be written to a separate file. The following strings have | ||
special meaning for the output files; | ||
|
||
* `0`, `/dev/null`, `null` - discard output | ||
* `-`, `/dev/stdout`, `stdout` - write output to stdout | ||
- `+`, `/dev/stderr`, `stderr` - write output to stderr | ||
|
||
## Example usage | ||
|
||
``` | ||
$ ./crawler-cleaner -help | ||
Usage of ./crawler-cleaner: | ||
-crawler-output string | ||
File to write crawler output to (default "/dev/null") | ||
-error-output string | ||
File to write unparsable json iput to (default "/dev/null") | ||
-extra-crawler-agents-file string | ||
File containing additional crawler user agent patterns, one per line | ||
-non-crawler-output string | ||
File to write non-crawler output to (default "/dev/stdout") | ||
-user-agent-key string | ||
Json key for user agent (default "http_user_agent") | ||
``` | ||
|
||
``` | ||
$ cat web.log | ./crawler-cleaner -crawler-output ./crawlers.log \ | ||
-non-crawler-output ./legit.log -error-output errors.log | ||
``` | ||
|
||
or | ||
|
||
``` | ||
$ <web.log ./crawler-cleaner -non-crawler-output stderr > ./legit.log 2> ./crawlers.log | ||
``` | ||
|
||
## Reviewing results | ||
|
||
After running, it's useful to examine the user agents in both non-crawler | ||
and crawler outputs to identify any adjustments needed. Example command | ||
to view counts of user agents using [`jq`](https://jqlang.github.io/jq/): | ||
|
||
``` | ||
<non-crawler.log jq '.http_user_agent' | sort | uniq -c | sort -n | less | ||
``` | ||
|
||
## Adding extra crawler agents | ||
|
||
To add crawler agents to the set obtained from | ||
[crawler-user-agents](https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json), | ||
you may create a text file (default `./extra-user-agents.txt` and add user agent patterns, | ||
one per line. Note that these patterns can be regular expressions, but forward slashes `/` do | ||
not need to be escaped as they are in the `crawler-user-agents.json` file | ||
(i.e. use `meta-externalagent/` and not `meta-externalagent\\/`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
aiohttp | ||
Apache-HttpClient | ||
^curl | ||
Go-http-client | ||
http_get | ||
httpx | ||
libwww-perl | ||
node-fetch | ||
okhttp | ||
python-requests | ||
Python-urllib | ||
[wW]get |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
package main | ||
|
||
import ( | ||
"bufio" | ||
"bytes" | ||
_ "embed" | ||
"encoding/json" | ||
"flag" | ||
"fmt" | ||
"io" | ||
"os" | ||
"regexp" | ||
"slices" | ||
"strings" | ||
|
||
"github.com/monperrus/crawler-user-agents" | ||
) | ||
|
||
var ( | ||
userAgentKeyConfig string | ||
extraCrawlerAgentsFile string | ||
nonCrawlerOutput string | ||
crawlerOutput string | ||
errorOutput string | ||
) | ||
|
||
//go:embed agent-allowlist.txt | ||
var agentAllowListBytes []byte | ||
|
||
var allowedAgentOverrides = func() []string { | ||
var allowedAgentOverrides []string | ||
s := bufio.NewScanner(bytes.NewReader(agentAllowListBytes)) | ||
s.Split(bufio.ScanLines) | ||
for s.Scan() { | ||
allowedAgent := strings.TrimSpace(s.Text()) | ||
if len(allowedAgent) == 0 { | ||
continue | ||
} | ||
allowedAgentOverrides = append(allowedAgentOverrides, allowedAgent) | ||
} | ||
return allowedAgentOverrides | ||
}() | ||
|
||
// adapted from crawler-user-agents/validate.go | ||
var crawlerRegexps = func() []*regexp.Regexp { | ||
regexps := make([]*regexp.Regexp, 0, len(agents.Crawlers)) | ||
for _, crawler := range agents.Crawlers { | ||
if !slices.Contains(allowedAgentOverrides, crawler.Pattern) { | ||
regexps = append(regexps, regexp.MustCompile(crawler.Pattern)) | ||
} | ||
} | ||
return regexps | ||
}() | ||
|
||
// adapted from crawler-user-agents/validate.go | ||
// Returns if User Agent string matches any of crawler patterns. | ||
var isCrawler = func(userAgent string) bool { | ||
for _, re := range crawlerRegexps { | ||
if re.MatchString(userAgent) { | ||
return true | ||
} | ||
} | ||
return false | ||
} | ||
|
||
const defaultExtraCrawlerAgentsFile = "extra-crawler-agents.txt" | ||
const defaultUserAgentKey = "http_user_agent" | ||
|
||
func getWriter(w string) *os.File { | ||
switch w { | ||
case "-", "/dev/stdout", "stdout": | ||
//stdout | ||
return os.Stdout | ||
case "+", "/dev/stderr", "stderr": | ||
//stderr | ||
return os.Stderr | ||
case "0", "/dev/null", "null": | ||
//devnull | ||
return os.NewFile(0, os.DevNull) | ||
default: | ||
//file path | ||
file, err := os.OpenFile(w, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644) | ||
if err != nil { | ||
panic(err) | ||
} | ||
return file | ||
} | ||
} | ||
|
||
func addExtraCrawlerAgents(extraAgentsReader io.Reader) { | ||
extraAgentsScanner := bufio.NewScanner(extraAgentsReader) | ||
extraAgentsScanner.Split(bufio.ScanLines) | ||
|
||
for extraAgentsScanner.Scan() { | ||
extraAgent := strings.TrimSpace(extraAgentsScanner.Text()) | ||
if len(extraAgent) == 0 { | ||
continue | ||
} | ||
crawlerRegexps = append(crawlerRegexps, regexp.MustCompile(extraAgent)) | ||
} | ||
} | ||
|
||
func cleanCrawlers(userAgentKey string, logReader io.Reader, nonCrawlerWriter io.Writer, | ||
crawlerWriter io.Writer, errorWriter io.Writer) { | ||
s := bufio.NewScanner(logReader) | ||
for s.Scan() { | ||
var v map[string]interface{} | ||
if err := json.Unmarshal(s.Bytes(), &v); err != nil { | ||
// json parse error | ||
fmt.Fprintln(errorWriter, s.Text()) | ||
continue | ||
} | ||
|
||
agent, ok := v[userAgentKey] | ||
// assume its ok if we don't have agent info | ||
if ok && isCrawler(agent.(string)) { | ||
// crawler detected | ||
fmt.Fprintln(crawlerWriter, s.Text()) | ||
} else { | ||
// crawler not detected | ||
fmt.Fprintln(nonCrawlerWriter, s.Text()) | ||
} | ||
} | ||
} | ||
|
||
func main() { | ||
flag.StringVar(&extraCrawlerAgentsFile, "extra-crawler-agents-file", "", "File containing additional crawler user agent patterns, one per line") | ||
flag.StringVar(&userAgentKeyConfig, "user-agent-key", defaultUserAgentKey, "Json key for user agent") | ||
flag.StringVar(&nonCrawlerOutput, "non-crawler-output", "/dev/stdout", "File to write non-crawler output to") | ||
flag.StringVar(&crawlerOutput, "crawler-output", "/dev/null", "File to write crawler output to") | ||
flag.StringVar(&errorOutput, "error-output", "/dev/null", "File to write unparsable json iput to") | ||
flag.Parse() | ||
|
||
//use default extra agents if not specified and default file exists | ||
if len(extraCrawlerAgentsFile) == 0 { | ||
if _, err := os.Stat(defaultExtraCrawlerAgentsFile); err == nil { | ||
extraCrawlerAgentsFile = defaultExtraCrawlerAgentsFile | ||
} | ||
} | ||
|
||
//load extra agents file if set | ||
if len(extraCrawlerAgentsFile) > 0 { | ||
if _, err := os.Stat(extraCrawlerAgentsFile); err == nil { | ||
extraAgents, err := os.Open(extraCrawlerAgentsFile) | ||
if err != nil { | ||
fmt.Println(err) | ||
} | ||
defer extraAgents.Close() | ||
|
||
addExtraCrawlerAgents(extraAgents) | ||
} else { | ||
fmt.Println("Error loading extra agents file", extraCrawlerAgentsFile, err) | ||
} | ||
} | ||
|
||
nonCrawlerWriter := getWriter(nonCrawlerOutput) | ||
defer nonCrawlerWriter.Close() | ||
crawlerWriter := getWriter(crawlerOutput) | ||
defer crawlerWriter.Close() | ||
errorWriter := getWriter(errorOutput) | ||
defer errorWriter.Close() | ||
|
||
cleanCrawlers(userAgentKeyConfig, os.Stdin, nonCrawlerWriter, crawlerWriter, errorWriter) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
package main | ||
|
||
/* | ||
To generate test data: | ||
<testdata/web.log go run cleaner.go \ | ||
-crawler-output ./testdata/expected-crawler.log \ | ||
-non-crawler-output ./testdata/expected-non-crawler.log \ | ||
-error-output ./testdata/expected-error.log \ | ||
-extra-crawler-agents-file testdata/extra-crawler-agents.txt | ||
*/ | ||
|
||
import ( | ||
"bytes" | ||
"io" | ||
"os" | ||
"testing" | ||
) | ||
|
||
func readFile(path string) []byte { | ||
data, err := os.ReadFile(path) | ||
if err != nil { | ||
panic(err) | ||
} | ||
return data | ||
} | ||
|
||
var extraCrawlerAgentsBytes []byte = readFile("testdata/extra-crawler-agents.txt") | ||
var rawLogBytes []byte = readFile("testdata/web.log") | ||
var expectedNonCrawlerBytes []byte = readFile("testdata/expected-non-crawler.log") | ||
var expectedCrawlerBytes []byte = readFile("testdata/expected-crawler.log") | ||
var expectedErrorBytes []byte = readFile("testdata/expected-error.log") | ||
|
||
func TestCleaner(t *testing.T) { | ||
var nonCrawlerBytes, crawlerBytes, errorBytes bytes.Buffer | ||
|
||
addExtraCrawlerAgents(bytes.NewReader(extraCrawlerAgentsBytes)) | ||
|
||
cleanCrawlers(defaultUserAgentKey, bytes.NewReader(rawLogBytes), | ||
io.Writer(&nonCrawlerBytes), io.Writer(&crawlerBytes), io.Writer(&errorBytes)) | ||
|
||
if !bytes.Equal(expectedNonCrawlerBytes, nonCrawlerBytes.Bytes()) { | ||
t.Fatal("Non crawler logs did not match expected") | ||
} | ||
|
||
if !bytes.Equal(expectedCrawlerBytes, crawlerBytes.Bytes()) { | ||
t.Fatal("Crawler logs did not match expected") | ||
} | ||
|
||
if !bytes.Equal(expectedErrorBytes, errorBytes.Bytes()) { | ||
t.Fatal("Error logs did not match expected") | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
module github.com/axiom-data-science/crawler-cleaner | ||
|
||
go 1.21 | ||
|
||
require github.com/monperrus/crawler-user-agents v0.0.0-20240925083149-6c0133b66cc2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
github.com/monperrus/crawler-user-agents v0.0.0-20240925083149-6c0133b66cc2 h1:+A6DL2F8K/8xq7YOXZVAbzXjP6GXQAAJfjGdpi0Zq0I= | ||
github.com/monperrus/crawler-user-agents v0.0.0-20240925083149-6c0133b66cc2/go.mod h1:GfRyKbsbxSrRxTPYnVi4U/0stQd6BcFCxDy6i6IxQ0M= |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{"body_bytes_sent":1169,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)","scheme":"https","request_uri":"/path/to.html","remote_addr":"1.2.3.4","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:59+00:00"} | ||
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} | ||
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"add-this-crawler to be detected","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} | ||
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Yes, andThisBot As Well","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
ERROR THIS IS A GARBAGE LOG LINE (NON-JSON) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{"body_bytes_sent":1176,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5","scheme":"https","request_uri":"/other/path.html","remote_addr":"5.6.7.8","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} | ||
{"body_bytes_sent":1176,"content_type":"-","http_host":"hosting.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36","scheme":"https","request_uri":"/pathy/path.html","remote_addr":"1.2.5.6","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:57+00:00"} | ||
{"body_bytes_sent":1176,"content_type":"-","http_host":"your.host.net","http_referer":"-","http_user_agent":"python-requests/2.28.1","scheme":"https","request_uri":"/final.html","remote_addr":"1.3.5.7","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:56+00:00"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
add-this-crawler | ||
andThisBot As Well |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
{"body_bytes_sent":1169,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)","scheme":"https","request_uri":"/path/to.html","remote_addr":"1.2.3.4","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:59+00:00"} | ||
{"body_bytes_sent":1176,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5","scheme":"https","request_uri":"/other/path.html","remote_addr":"5.6.7.8","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} | ||
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} | ||
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"add-this-crawler to be detected","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} | ||
ERROR THIS IS A GARBAGE LOG LINE (NON-JSON) | ||
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Yes, andThisBot As Well","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"} | ||
{"body_bytes_sent":1176,"content_type":"-","http_host":"hosting.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36","scheme":"https","request_uri":"/pathy/path.html","remote_addr":"1.2.5.6","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:57+00:00"} | ||
{"body_bytes_sent":1176,"content_type":"-","http_host":"your.host.net","http_referer":"-","http_user_agent":"python-requests/2.28.1","scheme":"https","request_uri":"/final.html","remote_addr":"1.3.5.7","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:56+00:00"} |