Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
srstsavage committed Oct 4, 2024
0 parents commit 9d88509
Show file tree
Hide file tree
Showing 12 changed files with 345 additions and 0 deletions.
23 changes: 23 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
The MIT License (MIT)
=====================

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the “Software”), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# crawler-cleaner 🕷️🧹✨

Processes json web log input from stdin (one json log object per line),
removing any user agents determined to be crawlers/bot/scrapers.

The project uses the
[crawler-user-agents](https://github.com/monperrus/crawler-user-agents)
project as the user agent database.

By default crawler-cleaner looks for the user agent in the top level
`http_user_agent` field in each json log. This may be configured using the
`-user-agent-key` flag (but must be top level).

Detected crawler logs can be discarded (default) or written to a separate
file/stream. JSON parse errors (error messages between json logs etc)
can also be written to a separate file. The following strings have
special meaning for the output files;

* `0`, `/dev/null`, `null` - discard output
* `-`, `/dev/stdout`, `stdout` - write output to stdout
- `+`, `/dev/stderr`, `stderr` - write output to stderr

## Example usage

```
$ ./crawler-cleaner -help
Usage of ./crawler-cleaner:
-crawler-output string
File to write crawler output to (default "/dev/null")
-error-output string
File to write unparsable json iput to (default "/dev/null")
-extra-crawler-agents-file string
File containing additional crawler user agent patterns, one per line
-non-crawler-output string
File to write non-crawler output to (default "/dev/stdout")
-user-agent-key string
Json key for user agent (default "http_user_agent")
```

```
$ cat web.log | ./crawler-cleaner -crawler-output ./crawlers.log \
-non-crawler-output ./legit.log -error-output errors.log
```

or

```
$ <web.log ./crawler-cleaner -non-crawler-output stderr > ./legit.log 2> ./crawlers.log
```

## Reviewing results

After running, it's useful to examine the user agents in both non-crawler
and crawler outputs to identify any adjustments needed. Example command
to view counts of user agents using [`jq`](https://jqlang.github.io/jq/):

```
<non-crawler.log jq '.http_user_agent' | sort | uniq -c | sort -n | less
```

## Adding extra crawler agents

To add crawler agents to the set obtained from
[crawler-user-agents](https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json),
you may create a text file (default `./extra-user-agents.txt` and add user agent patterns,
one per line. Note that these patterns can be regular expressions, but forward slashes `/` do
not need to be escaped as they are in the `crawler-user-agents.json` file
(i.e. use `meta-externalagent/` and not `meta-externalagent\\/`.
12 changes: 12 additions & 0 deletions agent-allowlist.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
aiohttp
Apache-HttpClient
^curl
Go-http-client
http_get
httpx
libwww-perl
node-fetch
okhttp
python-requests
Python-urllib
[wW]get
164 changes: 164 additions & 0 deletions cleaner.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
package main

import (
"bufio"
"bytes"
_ "embed"
"encoding/json"
"flag"
"fmt"
"io"
"os"
"regexp"
"slices"
"strings"

"github.com/monperrus/crawler-user-agents"
)

var (
userAgentKeyConfig string
extraCrawlerAgentsFile string
nonCrawlerOutput string
crawlerOutput string
errorOutput string
)

//go:embed agent-allowlist.txt
var agentAllowListBytes []byte

var allowedAgentOverrides = func() []string {
var allowedAgentOverrides []string
s := bufio.NewScanner(bytes.NewReader(agentAllowListBytes))
s.Split(bufio.ScanLines)
for s.Scan() {
allowedAgent := strings.TrimSpace(s.Text())
if len(allowedAgent) == 0 {
continue
}
allowedAgentOverrides = append(allowedAgentOverrides, allowedAgent)
}
return allowedAgentOverrides
}()

// adapted from crawler-user-agents/validate.go
var crawlerRegexps = func() []*regexp.Regexp {
regexps := make([]*regexp.Regexp, 0, len(agents.Crawlers))
for _, crawler := range agents.Crawlers {
if !slices.Contains(allowedAgentOverrides, crawler.Pattern) {
regexps = append(regexps, regexp.MustCompile(crawler.Pattern))
}
}
return regexps
}()

// adapted from crawler-user-agents/validate.go
// Returns if User Agent string matches any of crawler patterns.
var isCrawler = func(userAgent string) bool {
for _, re := range crawlerRegexps {
if re.MatchString(userAgent) {
return true
}
}
return false
}

const defaultExtraCrawlerAgentsFile = "extra-crawler-agents.txt"
const defaultUserAgentKey = "http_user_agent"

func getWriter(w string) *os.File {
switch w {
case "-", "/dev/stdout", "stdout":
//stdout
return os.Stdout
case "+", "/dev/stderr", "stderr":
//stderr
return os.Stderr
case "0", "/dev/null", "null":
//devnull
return os.NewFile(0, os.DevNull)
default:
//file path
file, err := os.OpenFile(w, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644)
if err != nil {
panic(err)
}
return file
}
}

func addExtraCrawlerAgents(extraAgentsReader io.Reader) {
extraAgentsScanner := bufio.NewScanner(extraAgentsReader)
extraAgentsScanner.Split(bufio.ScanLines)

for extraAgentsScanner.Scan() {
extraAgent := strings.TrimSpace(extraAgentsScanner.Text())
if len(extraAgent) == 0 {
continue
}
crawlerRegexps = append(crawlerRegexps, regexp.MustCompile(extraAgent))
}
}

func cleanCrawlers(userAgentKey string, logReader io.Reader, nonCrawlerWriter io.Writer,
crawlerWriter io.Writer, errorWriter io.Writer) {
s := bufio.NewScanner(logReader)
for s.Scan() {
var v map[string]interface{}
if err := json.Unmarshal(s.Bytes(), &v); err != nil {
// json parse error
fmt.Fprintln(errorWriter, s.Text())
continue
}

agent, ok := v[userAgentKey]
// assume its ok if we don't have agent info
if ok && isCrawler(agent.(string)) {
// crawler detected
fmt.Fprintln(crawlerWriter, s.Text())
} else {
// crawler not detected
fmt.Fprintln(nonCrawlerWriter, s.Text())
}
}
}

func main() {
flag.StringVar(&extraCrawlerAgentsFile, "extra-crawler-agents-file", "", "File containing additional crawler user agent patterns, one per line")
flag.StringVar(&userAgentKeyConfig, "user-agent-key", defaultUserAgentKey, "Json key for user agent")
flag.StringVar(&nonCrawlerOutput, "non-crawler-output", "/dev/stdout", "File to write non-crawler output to")
flag.StringVar(&crawlerOutput, "crawler-output", "/dev/null", "File to write crawler output to")
flag.StringVar(&errorOutput, "error-output", "/dev/null", "File to write unparsable json iput to")
flag.Parse()

//use default extra agents if not specified and default file exists
if len(extraCrawlerAgentsFile) == 0 {
if _, err := os.Stat(defaultExtraCrawlerAgentsFile); err == nil {
extraCrawlerAgentsFile = defaultExtraCrawlerAgentsFile
}
}

//load extra agents file if set
if len(extraCrawlerAgentsFile) > 0 {
if _, err := os.Stat(extraCrawlerAgentsFile); err == nil {
extraAgents, err := os.Open(extraCrawlerAgentsFile)
if err != nil {
fmt.Println(err)
}
defer extraAgents.Close()

addExtraCrawlerAgents(extraAgents)
} else {
fmt.Println("Error loading extra agents file", extraCrawlerAgentsFile, err)
}
}

nonCrawlerWriter := getWriter(nonCrawlerOutput)
defer nonCrawlerWriter.Close()
crawlerWriter := getWriter(crawlerOutput)
defer crawlerWriter.Close()
errorWriter := getWriter(errorOutput)
defer errorWriter.Close()

cleanCrawlers(userAgentKeyConfig, os.Stdin, nonCrawlerWriter, crawlerWriter, errorWriter)
}
53 changes: 53 additions & 0 deletions cleaner_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
package main

/*
To generate test data:
<testdata/web.log go run cleaner.go \
-crawler-output ./testdata/expected-crawler.log \
-non-crawler-output ./testdata/expected-non-crawler.log \
-error-output ./testdata/expected-error.log \
-extra-crawler-agents-file testdata/extra-crawler-agents.txt
*/

import (
"bytes"
"io"
"os"
"testing"
)

func readFile(path string) []byte {
data, err := os.ReadFile(path)
if err != nil {
panic(err)
}
return data
}

var extraCrawlerAgentsBytes []byte = readFile("testdata/extra-crawler-agents.txt")
var rawLogBytes []byte = readFile("testdata/web.log")
var expectedNonCrawlerBytes []byte = readFile("testdata/expected-non-crawler.log")
var expectedCrawlerBytes []byte = readFile("testdata/expected-crawler.log")
var expectedErrorBytes []byte = readFile("testdata/expected-error.log")

func TestCleaner(t *testing.T) {
var nonCrawlerBytes, crawlerBytes, errorBytes bytes.Buffer

addExtraCrawlerAgents(bytes.NewReader(extraCrawlerAgentsBytes))

cleanCrawlers(defaultUserAgentKey, bytes.NewReader(rawLogBytes),
io.Writer(&nonCrawlerBytes), io.Writer(&crawlerBytes), io.Writer(&errorBytes))

if !bytes.Equal(expectedNonCrawlerBytes, nonCrawlerBytes.Bytes()) {
t.Fatal("Non crawler logs did not match expected")
}

if !bytes.Equal(expectedCrawlerBytes, crawlerBytes.Bytes()) {
t.Fatal("Crawler logs did not match expected")
}

if !bytes.Equal(expectedErrorBytes, errorBytes.Bytes()) {
t.Fatal("Error logs did not match expected")
}
}
5 changes: 5 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
module github.com/axiom-data-science/crawler-cleaner

go 1.21

require github.com/monperrus/crawler-user-agents v0.0.0-20240925083149-6c0133b66cc2
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
github.com/monperrus/crawler-user-agents v0.0.0-20240925083149-6c0133b66cc2 h1:+A6DL2F8K/8xq7YOXZVAbzXjP6GXQAAJfjGdpi0Zq0I=
github.com/monperrus/crawler-user-agents v0.0.0-20240925083149-6c0133b66cc2/go.mod h1:GfRyKbsbxSrRxTPYnVi4U/0stQd6BcFCxDy6i6IxQ0M=
4 changes: 4 additions & 0 deletions testdata/expected-crawler.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{"body_bytes_sent":1169,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)","scheme":"https","request_uri":"/path/to.html","remote_addr":"1.2.3.4","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:59+00:00"}
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"add-this-crawler to be detected","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Yes, andThisBot As Well","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
1 change: 1 addition & 0 deletions testdata/expected-error.log
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ERROR THIS IS A GARBAGE LOG LINE (NON-JSON)
3 changes: 3 additions & 0 deletions testdata/expected-non-crawler.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"body_bytes_sent":1176,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5","scheme":"https","request_uri":"/other/path.html","remote_addr":"5.6.7.8","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
{"body_bytes_sent":1176,"content_type":"-","http_host":"hosting.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36","scheme":"https","request_uri":"/pathy/path.html","remote_addr":"1.2.5.6","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:57+00:00"}
{"body_bytes_sent":1176,"content_type":"-","http_host":"your.host.net","http_referer":"-","http_user_agent":"python-requests/2.28.1","scheme":"https","request_uri":"/final.html","remote_addr":"1.3.5.7","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:56+00:00"}
2 changes: 2 additions & 0 deletions testdata/extra-crawler-agents.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
add-this-crawler
andThisBot As Well
8 changes: 8 additions & 0 deletions testdata/web.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{"body_bytes_sent":1169,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)","scheme":"https","request_uri":"/path/to.html","remote_addr":"1.2.3.4","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:59+00:00"}
{"body_bytes_sent":1176,"content_type":"-","http_host":"some.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5","scheme":"https","request_uri":"/other/path.html","remote_addr":"5.6.7.8","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"add-this-crawler to be detected","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
ERROR THIS IS A GARBAGE LOG LINE (NON-JSON)
{"body_bytes_sent":1059,"content_type":"-","http_host":"other.host.com","http_referer":"-","http_user_agent":"Yes, andThisBot As Well","scheme":"https","request_uri":"/yet/another/path.html","remote_addr":"2.4.6.8","request_method":"GET","request_time":0.091,"sent_http_content_type":"text/html;charset=UTF-8","status":"200","time_iso8601":"2024-09-30T23:59:58+00:00"}
{"body_bytes_sent":1176,"content_type":"-","http_host":"hosting.host.com","http_referer":"-","http_user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36","scheme":"https","request_uri":"/pathy/path.html","remote_addr":"1.2.5.6","request_method":"GET","request_time":0.002,"sent_http_content_type":"text/html","status":"200","time_iso8601":"2024-09-30T23:59:57+00:00"}
{"body_bytes_sent":1176,"content_type":"-","http_host":"your.host.net","http_referer":"-","http_user_agent":"python-requests/2.28.1","scheme":"https","request_uri":"/final.html","remote_addr":"1.3.5.7","request_method":"GET","request_time":0.002,"sent_http_content_type":"application/json","status":"200","time_iso8601":"2024-09-30T23:59:56+00:00"}

0 comments on commit 9d88509

Please sign in to comment.