Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malware - Stats #1

Open
rothoma2 opened this issue Jun 16, 2024 · 6 comments
Open

Malware - Stats #1

rothoma2 opened this issue Jun 16, 2024 · 6 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed top-level-task

Comments

@rothoma2
Copy link

It is important to have statistics on some of the commonly observed Malicious Delivery Methods and file extensions.

Requirements.

A web scrapper tool, that scrapes and get publicly disclosed information from several sources (Malware Sandbox sites) and aggregated them to produce statistics such as File Extension, Malware Families etc).

Sources

Things to explore.

  • Amount of pages that can be scrapped before Rate Limit, or Captcha Kick in.
  • Parse HTML pages, and extract valuable data.

Example.

Collect last 10K malicious files ( For Windows) reported on each site, and aggregate them per File Extension.

@rothoma2 rothoma2 added top-level-task good first issue Good for newcomers help wanted Extra attention is needed labels Jun 16, 2024
@poneoneo
Copy link

To avoid barriers like captcha we should really think about deducated tool like bright data website. Some youtubers offers 10 dollars free to try you should take a look on this.

@rothoma2
Copy link
Author

@poneoneo maybe we should see if this pages have some captcha or ratelimit first. I think the project would be severly limited if we need to depend on a pay per use service, such a bright data website.

@poneoneo
Copy link

Ok @rothoma2 I will check this out. Maybe tools like playwright or selenium will be enought to behave like a real browser an overcome user-agent and captcha barrier

@Ohnoimded
Copy link
Member

Selenium -base is working fine for it with uc driver

@rothoma2
Copy link
Author

Cool, maybe someone can upload some base code and we start extending from there.

@Ohnoimded
Copy link
Member

Ohnoimded commented Jun 27, 2024

The code will bypass all checks on app.any.run but can only get till page 5 as going further is restricted by them.
Need to implement actual scraping part for extracted rows.

from seleniumbase import SB
import time
import random

with SB(uc=True) as sb:
    print("Entering Website")
    sb.open_html_file("https://app.any.run/submissions/")
    sb.click("#history-filterBtn")
    sb.click("div.btn-group:nth-child(1) > button:nth-child(1)")
    time.sleep(random.randrange(0,2))
    sb.click("div.btn-group:nth-child(1) > div:nth-child(2) > ul:nth-child(1) > li:nth-child(1) > a")
    sb.click("#historySearchBtn")
    time.sleep(random.randrange(2,3))
    for i in range(5):
        time.sleep(random.randrange(1,2))
        soup = sb.get_beautiful_soup()
        extracted_row = soup.css.select("div.history-table--content__row")
        # I haven't done the bs part yet. Something like this. 
        sb.click(".history-pagination__next")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed top-level-task
Projects
Status: No status
Development

No branches or pull requests

3 participants