-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newsletter being overrun by illicit repos #46
Comments
I have noticed that most of these repos only have a |
I've just come here to look at the same. I really like the newsletter but it's being completely overrun as you say. I've 0 ruby skills, I'm a go dev, but think it should be pretty straightforward to filter the majority out. Most seem to contain the words free and the current year. Could just have a regexp checking for that as a start. |
It's a cat/mouse game. Here you can see a list of words we already consider malware: nightly/lib/core_ext/string.rb Lines 43 to 48 in d1a9e73
The challenge with blocking repos with the words |
Yes I agree, that is definitely a risk but I guess there comes a point where potentially blocking some legitimate repos may be better than having the list be 90% illegitimate repos? |
Interestingly if you look at this morning's email. The illegitimate repos are now all 404 not found. Guessing all been removed by GitHub. Could we possibly check they still exist first. Assuming the stats in big query lag behind slightly? |
For sure. But that goes back to the cat/mouse game. I've played it for awhile, but it never ends. This is just the latest iteration for the mouse. A few days after I block
GitHub often gets them removed by the next day, but it's rarely the case by the time we publish. Another idea I had, which I was hoping would be more fool proof, was to identify a set of repos that are spam and a set of repos that are not (given name, url, description only) and give them to an LLM, asking it to determine if a given repo is spam/malware based on those two data sets. Unfortunately, in my testing this proved... inaccurate. (that was maybe a year ago, though, so maybe they've gotten better?) There's other rules we could enforce, such as if the repo only has one zip file, but that also requires more API calls and this code is a bit ossified already, being a ten-year old Ruby project. |
Ah ok yes that's fair enough, appreciate that must be frustrating. One thought I had was I wondered if it would be possible to add a lag into it. So say the stats lag 24 hours behind where they are now. But then we could rely upon GitHub removing them and checking they exist. |
Could we just have a more rolling average approach? Do not show any repo that isn't at least 3 days old but all three days count toward making into the new list. EDIT: I'll learn to read some day. I like @alexjbarnes idea and would be willing to look into implementing. |
I'm not considering it a lost cause, just kinda in the dumps about it. Will definitely accept PRs. I've considered delaying the
For now I'll go ahead and block a few more keywords because it is getting ridiculous again. Specifically, I'm going to add 'free', 'download' and 'crypto' to the list of malware words. That will certainly exclude some legit repos, but it's probably a trade-off worth making at this point. What would be totally cool is some kind of API (maybe a separate project?) that I could hit with a repo URL and it returns how likely it is to be spam/malware or not, maybe with a confidence score. I'd certainly integrate something like that... |
I see what the I can do RE: confidence. Also, I have noticed that most of these repos do not register as having a programming language. Do you think we could have two pipelines, one where repos without a programming language have a longer list of malware words? |
That's certainly a possibility. We already have a Side note: Last night's email was pretty clean after adding those additional strings: https://nightly.changelog.com/2024/11/14 That Solana repo is probably trash, but other than that... |
The number of sketchy repos (The first 13 in the Nov 1 newsletter are illicit) has gotten to the point that GMail automatically flagged the newsletter as spam and disabling all links even after explicitly marking it as not spam.
The text was updated successfully, but these errors were encountered: