Newsletter being overrun by illicit repos #46

cdhagmann · 2024-11-04T14:48:40Z

The number of sketchy repos (The first 13 in the Nov 1 newsletter are illicit) has gotten to the point that GMail automatically flagged the newsletter as spam and disabling all links even after explicitly marking it as not spam.

cdhagmann · 2024-11-04T14:57:19Z

I have noticed that most of these repos only have a .zip file in them and have a lot more tags than normal (most including the word free). I will hopefully have time later the week to look at the code and see if my skillset is enough to offer a PR.

alexjbarnes · 2024-11-13T19:00:43Z

I've just come here to look at the same. I really like the newsletter but it's being completely overrun as you say. I've 0 ruby skills, I'm a go dev, but think it should be pretty straightforward to filter the majority out. Most seem to contain the words free and the current year. Could just have a regexp checking for that as a start.

jerodsanto · 2024-11-13T19:07:15Z

It's a cat/mouse game. Here you can see a list of words we already consider malware:

nightly/lib/core_ext/string.rb

Lines 43 to 48 in d1a9e73

    
           def malware? 
        
             %w(cheat ch3at 0ptions sk1n hack spoof sp00f spoofer sp00f3r aimbot godlike 
        
             g0dlike d4rk s1d3 roblox r0blox r0bl0x crack cracked scr1pt ap3x unl0cker 
        
             unl0ck3r h4ck m0ney 0day exploit expl0it skinchanger skin-changer swapper 
        
             stealer keylogger miner crypto-bot cryptobot wallet autoclicker clicker).any? { |i| !!(self =~ /#{i}/i) } 
        
           end

The challenge with blocking repos with the words free or the current year in the title/description is how many legit repos will you also block doing that?

alexjbarnes · 2024-11-13T19:13:42Z

Yes I agree, that is definitely a risk but I guess there comes a point where potentially blocking some legitimate repos may be better than having the list be 90% illegitimate repos?

alexjbarnes · 2024-11-13T19:17:02Z

Interestingly if you look at this morning's email. The illegitimate repos are now all 404 not found. Guessing all been removed by GitHub. Could we possibly check they still exist first. Assuming the stats in big query lag behind slightly?

jerodsanto · 2024-11-13T19:29:14Z

Yes I agree, that is definitely a risk but I guess there comes a point where potentially blocking some legitimate repos may be better than having the list be 90% illegitimate repos?

For sure. But that goes back to the cat/mouse game. I've played it for awhile, but it never ends. This is just the latest iteration for the mouse. A few days after I block free or download the naming changes again...

Interestingly if you look at this morning's email. The illegitimate repos are now all 404 not found. Guessing all been removed by GitHub. Could we possibly check they still exist first. Assuming the stats in big query lag behind slightly?

GitHub often gets them removed by the next day, but it's rarely the case by the time we publish. Another idea I had, which I was hoping would be more fool proof, was to identify a set of repos that are spam and a set of repos that are not (given name, url, description only) and give them to an LLM, asking it to determine if a given repo is spam/malware based on those two data sets. Unfortunately, in my testing this proved... inaccurate. (that was maybe a year ago, though, so maybe they've gotten better?)

There's other rules we could enforce, such as if the repo only has one zip file, but that also requires more API calls and this code is a bit ossified already, being a ten-year old Ruby project.

alexjbarnes · 2024-11-13T20:10:45Z

Ah ok yes that's fair enough, appreciate that must be frustrating.
Would you be open to PR's or does it seem like a lost cause at this point? Would be a shame as I've discovered lots of good stuff through the newsletter.

One thought I had was I wondered if it would be possible to add a lag into it. So say the stats lag 24 hours behind where they are now. But then we could rely upon GitHub removing them and checking they exist.
Happy to have a look if it could be of use.

cdhagmann · 2024-11-13T20:18:12Z

Could we just have a more rolling average approach? Do not show any repo that isn't at least 3 days old but all three days count toward making into the new list.

EDIT: I'll learn to read some day. I like @alexjbarnes idea and would be willing to look into implementing.

jerodsanto · 2024-11-14T14:53:51Z

I'm not considering it a lost cause, just kinda in the dumps about it.

Will definitely accept PRs. I've considered delaying the Top Starred Repositories – First Timers and Top New Repositories lists by a day, but that kinda defeats the purpose of the email, which is to

unearth the top new and top starred projects on GitHub before they blow up

For now I'll go ahead and block a few more keywords because it is getting ridiculous again. Specifically, I'm going to add 'free', 'download' and 'crypto' to the list of malware words. That will certainly exclude some legit repos, but it's probably a trade-off worth making at this point.

What would be totally cool is some kind of API (maybe a separate project?) that I could hit with a repo URL and it returns how likely it is to be spam/malware or not, maybe with a confidence score. I'd certainly integrate something like that...

cdhagmann · 2024-11-15T13:39:49Z

I see what the I can do RE: confidence. Also, I have noticed that most of these repos do not register as having a programming language. Do you think we could have two pipelines, one where repos without a programming language have a longer list of malware words?

jerodsanto · 2024-11-15T15:36:39Z

That's certainly a possibility. We already have a no_language? method on the Repo class, so the malware? method could call that first and branch from there. Currently I'm implementing malware? as a String method, but that could be moved to Repo pretty easily...

Side note: Last night's email was pretty clean after adding those additional strings:

https://nightly.changelog.com/2024/11/14

That Solana repo is probably trash, but other than that...

jerodsanto mentioned this issue Nov 14, 2024

Changelogs Nightly Email Newsletter #41

Closed

jerodsanto added the bug label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newsletter being overrun by illicit repos #46

Newsletter being overrun by illicit repos #46

cdhagmann commented Nov 4, 2024

cdhagmann commented Nov 4, 2024

alexjbarnes commented Nov 13, 2024

jerodsanto commented Nov 13, 2024

alexjbarnes commented Nov 13, 2024

alexjbarnes commented Nov 13, 2024

jerodsanto commented Nov 13, 2024

alexjbarnes commented Nov 13, 2024

cdhagmann commented Nov 13, 2024 •

edited

Loading

jerodsanto commented Nov 14, 2024

cdhagmann commented Nov 15, 2024

jerodsanto commented Nov 15, 2024

Newsletter being overrun by illicit repos #46

Newsletter being overrun by illicit repos #46

Comments

cdhagmann commented Nov 4, 2024

cdhagmann commented Nov 4, 2024

alexjbarnes commented Nov 13, 2024

jerodsanto commented Nov 13, 2024

alexjbarnes commented Nov 13, 2024

alexjbarnes commented Nov 13, 2024

jerodsanto commented Nov 13, 2024

alexjbarnes commented Nov 13, 2024

cdhagmann commented Nov 13, 2024 • edited Loading

jerodsanto commented Nov 14, 2024

cdhagmann commented Nov 15, 2024

jerodsanto commented Nov 15, 2024

cdhagmann commented Nov 13, 2024 •

edited

Loading