Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biological sequence data is related in complex ways and this makes validation difficult #202

Closed
jgreener64 opened this issue Sep 23, 2019 · 4 comments · Fixed by #219
Closed

Comments

@jgreener64
Copy link

This is a cool project and the draft is looking nice.

I have one thing I would add, probably to "Tip 7: Address deep neural networks' increased tendency to overfit the dataset". I'm mentioning it here for discussion, and if wanted I would be happy to add some text describing it.

When splitting a dataset of biological sequences or structures, care should be taken that there is no evolutionary relationship between sequences in the training set, sequences in the validation set and sequences in the test set. Many people split proteins into datasets using a threshold of 30% sequence identity, i.e. the training and validation sets will not share any sequence that is 30% or more similar. However, it is known that many proteins share homology down to effectively 0% sequence identity - see Figure 1 of Chothia and Lesk 1986 for example.

Poor dataset splitting means that the method being benchmarked appears to have better performance than it does, as one is partially measuring an ability to detect homologs. This problem affects protein secondary structure prediction, tertiary contact prediction, protein design studies, in fact almost anywhere protein data is used for machine learning. One way round it is to use databases such as CATH and ECOD to split sequences based on structural and evolutionary relationships.

Tagging @Benjamin-Lee as I think he drafted this section.

@agitter
Copy link
Collaborator

agitter commented Sep 28, 2019

@jgreener64 this should definitely be covered in one of the tips. The commentary in #203 discusses this too and gives references to other datasets where this is a problem in addition to protein sequences (e.g. gene networks). #190 is also related and links a paper describing evaluation in biochemistry.

Would you like to take a pass at drafting this text? The project has been dormant and could use new engaged contributors.

@jgreener64
Copy link
Author

Great, I'll have a go at drafting some text.

@Benjamin-Lee
Copy link
Owner

@jgreener64 sorry to bump this back up but are you still interested in drafting text? If not, I can try to add in a mention of this.

Benjamin-Lee added a commit that referenced this issue Sep 16, 2020
@jgreener64
Copy link
Author

I don't think I'll have time to draft any text on this, feel free to mention it however you like of course.

Benjamin-Lee added a commit that referenced this issue Sep 16, 2020
rasbt added a commit that referenced this issue Sep 17, 2020
@Benjamin-Lee Benjamin-Lee mentioned this issue Oct 19, 2020
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants