Biological sequence data is related in complex ways and this makes validation difficult #202

jgreener64 · 2019-09-23T09:53:43Z

This is a cool project and the draft is looking nice.

I have one thing I would add, probably to "Tip 7: Address deep neural networks' increased tendency to overfit the dataset". I'm mentioning it here for discussion, and if wanted I would be happy to add some text describing it.

When splitting a dataset of biological sequences or structures, care should be taken that there is no evolutionary relationship between sequences in the training set, sequences in the validation set and sequences in the test set. Many people split proteins into datasets using a threshold of 30% sequence identity, i.e. the training and validation sets will not share any sequence that is 30% or more similar. However, it is known that many proteins share homology down to effectively 0% sequence identity - see Figure 1 of Chothia and Lesk 1986 for example.

Poor dataset splitting means that the method being benchmarked appears to have better performance than it does, as one is partially measuring an ability to detect homologs. This problem affects protein secondary structure prediction, tertiary contact prediction, protein design studies, in fact almost anywhere protein data is used for machine learning. One way round it is to use databases such as CATH and ECOD to split sequences based on structural and evolutionary relationships.

Tagging @Benjamin-Lee as I think he drafted this section.

agitter · 2019-09-28T15:18:37Z

@jgreener64 this should definitely be covered in one of the tips. The commentary in #203 discusses this too and gives references to other datasets where this is a problem in addition to protein sequences (e.g. gene networks). #190 is also related and links a paper describing evaluation in biochemistry.

Would you like to take a pass at drafting this text? The project has been dormant and could use new engaged contributors.

jgreener64 · 2019-09-30T13:43:24Z

Great, I'll have a go at drafting some text.

Benjamin-Lee · 2020-09-16T01:26:39Z

@jgreener64 sorry to bump this back up but are you still interested in drafting text? If not, I can try to add in a mention of this.

jgreener64 · 2020-09-16T08:42:00Z

I don't think I'll have time to draft any text on this, feel free to mention it however you like of course.

Co-authored-by: Casey Greene <[email protected]>

Close #202

Benjamin-Lee added a commit that referenced this issue Sep 16, 2020

Close #202

1ff7162

Benjamin-Lee mentioned this issue Sep 16, 2020

Close #202 #219

Merged

Benjamin-Lee added a commit that referenced this issue Sep 16, 2020

Apply suggestions from @cgreene review of #202

248c7ab

Co-authored-by: Casey Greene <[email protected]>

rasbt closed this as completed in #219 Sep 17, 2020

rasbt added a commit that referenced this issue Sep 17, 2020

Merge pull request #219 from Benjamin-Lee/close-202

8749b6d

Close #202

Benjamin-Lee mentioned this issue Oct 19, 2020

Mdkessler patch 10 #275

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biological sequence data is related in complex ways and this makes validation difficult #202

Biological sequence data is related in complex ways and this makes validation difficult #202

jgreener64 commented Sep 23, 2019

agitter commented Sep 28, 2019

jgreener64 commented Sep 30, 2019

Benjamin-Lee commented Sep 16, 2020

jgreener64 commented Sep 16, 2020

Biological sequence data is related in complex ways and this makes validation difficult #202

Biological sequence data is related in complex ways and this makes validation difficult #202

Comments

jgreener64 commented Sep 23, 2019

agitter commented Sep 28, 2019

jgreener64 commented Sep 30, 2019

Benjamin-Lee commented Sep 16, 2020

jgreener64 commented Sep 16, 2020