Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newsletter being overrun by illicit repos #46

Open
cdhagmann opened this issue Nov 4, 2024 · 11 comments
Open

Newsletter being overrun by illicit repos #46

cdhagmann opened this issue Nov 4, 2024 · 11 comments
Labels

Comments

@cdhagmann
Copy link

Screenshot 2024-11-04 at 9 44 32 AM

The number of sketchy repos (The first 13 in the Nov 1 newsletter are illicit) has gotten to the point that GMail automatically flagged the newsletter as spam and disabling all links even after explicitly marking it as not spam.

@cdhagmann
Copy link
Author

I have noticed that most of these repos only have a .zip file in them and have a lot more tags than normal (most including the word free). I will hopefully have time later the week to look at the code and see if my skillset is enough to offer a PR.

@alexjbarnes
Copy link

I've just come here to look at the same. I really like the newsletter but it's being completely overrun as you say. I've 0 ruby skills, I'm a go dev, but think it should be pretty straightforward to filter the majority out. Most seem to contain the words free and the current year. Could just have a regexp checking for that as a start.

@jerodsanto
Copy link
Member

It's a cat/mouse game. Here you can see a list of words we already consider malware:

def malware?
%w(cheat ch3at 0ptions sk1n hack spoof sp00f spoofer sp00f3r aimbot godlike
g0dlike d4rk s1d3 roblox r0blox r0bl0x crack cracked scr1pt ap3x unl0cker
unl0ck3r h4ck m0ney 0day exploit expl0it skinchanger skin-changer swapper
stealer keylogger miner crypto-bot cryptobot wallet autoclicker clicker).any? { |i| !!(self =~ /#{i}/i) }
end

The challenge with blocking repos with the words free or the current year in the title/description is how many legit repos will you also block doing that?

@alexjbarnes
Copy link

Yes I agree, that is definitely a risk but I guess there comes a point where potentially blocking some legitimate repos may be better than having the list be 90% illegitimate repos?

@alexjbarnes
Copy link

Interestingly if you look at this morning's email. The illegitimate repos are now all 404 not found. Guessing all been removed by GitHub. Could we possibly check they still exist first. Assuming the stats in big query lag behind slightly?

@jerodsanto
Copy link
Member

Yes I agree, that is definitely a risk but I guess there comes a point where potentially blocking some legitimate repos may be better than having the list be 90% illegitimate repos?

For sure. But that goes back to the cat/mouse game. I've played it for awhile, but it never ends. This is just the latest iteration for the mouse. A few days after I block free or download the naming changes again...

Interestingly if you look at this morning's email. The illegitimate repos are now all 404 not found. Guessing all been removed by GitHub. Could we possibly check they still exist first. Assuming the stats in big query lag behind slightly?

GitHub often gets them removed by the next day, but it's rarely the case by the time we publish. Another idea I had, which I was hoping would be more fool proof, was to identify a set of repos that are spam and a set of repos that are not (given name, url, description only) and give them to an LLM, asking it to determine if a given repo is spam/malware based on those two data sets. Unfortunately, in my testing this proved... inaccurate. (that was maybe a year ago, though, so maybe they've gotten better?)

There's other rules we could enforce, such as if the repo only has one zip file, but that also requires more API calls and this code is a bit ossified already, being a ten-year old Ruby project.

@alexjbarnes
Copy link

Ah ok yes that's fair enough, appreciate that must be frustrating.
Would you be open to PR's or does it seem like a lost cause at this point? Would be a shame as I've discovered lots of good stuff through the newsletter.

One thought I had was I wondered if it would be possible to add a lag into it. So say the stats lag 24 hours behind where they are now. But then we could rely upon GitHub removing them and checking they exist.
Happy to have a look if it could be of use.

@cdhagmann
Copy link
Author

cdhagmann commented Nov 13, 2024

Could we just have a more rolling average approach? Do not show any repo that isn't at least 3 days old but all three days count toward making into the new list.

EDIT: I'll learn to read some day. I like @alexjbarnes idea and would be willing to look into implementing.

@jerodsanto
Copy link
Member

I'm not considering it a lost cause, just kinda in the dumps about it.

Will definitely accept PRs. I've considered delaying the Top Starred Repositories – First Timers and Top New Repositories lists by a day, but that kinda defeats the purpose of the email, which is to

unearth the top new and top starred projects on GitHub before they blow up

For now I'll go ahead and block a few more keywords because it is getting ridiculous again. Specifically, I'm going to add 'free', 'download' and 'crypto' to the list of malware words. That will certainly exclude some legit repos, but it's probably a trade-off worth making at this point.

What would be totally cool is some kind of API (maybe a separate project?) that I could hit with a repo URL and it returns how likely it is to be spam/malware or not, maybe with a confidence score. I'd certainly integrate something like that...

@cdhagmann
Copy link
Author

I see what the I can do RE: confidence. Also, I have noticed that most of these repos do not register as having a programming language. Do you think we could have two pipelines, one where repos without a programming language have a longer list of malware words?

@jerodsanto
Copy link
Member

That's certainly a possibility. We already have a no_language? method on the Repo class, so the malware? method could call that first and branch from there. Currently I'm implementing malware? as a String method, but that could be moved to Repo pretty easily...

Side note: Last night's email was pretty clean after adding those additional strings:

https://nightly.changelog.com/2024/11/14

That Solana repo is probably trash, but other than that...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants