Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tons of garbage on opensnp #559

Open
chaplin89 opened this issue Jun 16, 2024 · 2 comments
Open

Tons of garbage on opensnp #559

chaplin89 opened this issue Jun 16, 2024 · 2 comments

Comments

@chaplin89
Copy link

Hey, not sure if you're aware but there's really a lot of garbage there, as OpenSNP is probably not checking what users are uploading.

Here's a normalized list of file types I've found in your db:

  • 7-zip
  • Apple binary property list
  • ASCII text
  • bgzip
  • bzip2
  • Composite-documents
  • CSV
  • data?
  • empty
  • Excel
  • EXE (???)
  • gzip
  • JPEG
  • Word
  • PDF
  • PNG
  • RAR
  • RSID sidtune (?!)
  • Unicode Text
  • VCF
  • Word
  • XML
  • Zip
  • zlib

I was curious about the EXEs, at least they don't seem to contain virus. One of them are from a tool called "MyHeritage Family Builder Genealogy Software" and all the rest are called "23andme to FASTA".
It shouldn't be too hard to clean it and to put some checks after people are uploading something. I did this analysis using the file linux utility, I think it could probably be done on the server side as well? Watch out for command injection in case. A neat improvement would be to have all the files in the same format.

I'm attaching a list of files with their format: file_type.csv

Also the phenotype section doesn't seem very well monitored as someone created a "naked body phenotype" to use it to share a naked picture of himself. Not sure about the scientific value of that lol

@gedankenstuecke
Copy link
Member

Hey @chaplin89, thanks for getting in touch and that list!

In our pre-parsing of uploaded files we already try to unzip files and get rid of the "wrong" files (aka everything that doesn't look like it's a 'correct' genotyping file) (see here:

if not file_is_ok
), but for various reasons that seems to not always work out!

I'll have a think of how we can keep a better eye on it!

@chaplin89
Copy link
Author

chaplin89 commented Jun 16, 2024

I'm wondering what happens in that readline() when the input is a binary file. Seems like you're catching the exceptions during the unzip, but what about what happens later?

In any case, the way I would do this in python is probably to launch a file and see what happen. I think there's also a python lib for this, not sure about ruby.

As a side note, I see there's a system in that file where you're grepping the input file looking for e-mails. Not sure what the filename is at that point, but just friendly reminding that if the uploader can control even a single part of that filename, they'll also be able to execute code on the server.

EDIT
Side note 2: you're unzipping the file, but perhaps a better approach could be to unzip it and then re-zip with gzip? You'll save tons of space and bandwidth and on the other side you can read a gzip file without (fully) decompressing it first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants