[maker-devel] How to evaluate the results of gene prediction

Tue Mar 15 14:19:32 MDT 2016

Hi Wenbo, sorry for giving you a bogus suggestion. I should have realized that wouldn’t work. The defaults for the parameters you’re asking about are all “0.5”, so half of the exons, splice sites, etc. supported by EST alignment. I think that’s your judgment as to whether those are acceptable cutoffs for training your next set of genes. We use those settings for all our training sessions, which generally give good results.

~Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330

On Mar 15, 2016, at 2:07 PM, 陈文博 <chenwenbo1020 at gmail.com<mailto:chenwenbo1020 at gmail.com>> wrote:

Hi Daniel,

Thanks for your help.

"In order to evaluate your final SNAP training files, you might try running SNAP with MAKER without any evidence and compare the distributions of AED (annotation edit distance) values with the distribution of AED values from your prior MAKER runs"

----if I run SNAP in MAKER without any evidence, the AED would be 1 for each gene models. so I can't compare it with prior run regarding the distribution of AED.

When I examine the gene models in Apollo, I noticed that the intron given by SNAP is longer than other predictors. Is there any parameter controlling this? When I using the maker2zff script to filter the input models for training SNAP, any suggestion on the "-c -e -o" parameter?

here is my parameter in the CTL file:

alt_splice=0
always_complete=1
split_hit=257022
max_dna_len=1700000

Thanks a lot!

Best,
Wenbo

2016-03-14 12:17 GMT-04:00 Daniel Ence <dence at genetics.utah.edu<mailto:dence at genetics.utah.edu>>:
Hi Wenbo, MAKER has been evaluated against gold-criteria in the MAKER, MAKER2, and MAKER-P publications. The difficulty when working with relatively unstudied organisms is that might not be gold-criteria for any given genome.

I think that the process you describe (using RNA-seq data, protein sequences, proteome sequence of related insects, and swiss-prot) would result in gene models that are probably ready for manual curation and not just as training for another ab-initio predictor (SNAP).

To answer your specific questions:

1) Evaluation of ab-initio training is in terms of accuracy, sensitivity and specificity. This si described in more detail in this review that Mark and I wrote several years ago: http://www.nature.com/nrg/journal/v13/n5/full/nrg3174.html
Augustus provides measures of accuracy, sensitivity, and specificity during it’s training procedures, although I can’t recall exactly where it provides those. I believe that Genemark provides similar reports during it’s own training process. I’m not certain about SNAP. In order to evaluate your final SNAP training files, you might try running SNAP with MAKER without any evidence and compare the distributions of AED (annotation edit distance) values with the distribution of AED values from your prior MAKER runs. I’d be surprised if two rounds of training improved the AED scores much though.

2) If you have EST evidence that complements the RNAseq data that you already used, then feel free to include it. MAKER treats loci that are partially supported by EST sequences the same as it does all other loci. MAKER evaluates the alignment evidences and chooses the ab-initio prediction that is best supported by the alignment evidence. Partial models result from loci where no complete ab-initio prediction was produced by any of the predictors that you used.

3) see above.

Let me know if that helps,
Daniel

Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330

> On Mar 13, 2016, at 8:22 PM, 陈文博 <chenwenbo1020 at gmail.com<mailto:chenwenbo1020 at gmail.com>> wrote:
>
> Hi All,
>
> I am using MAKER to annotate a insect genome. Firstly, I trained Augustus and GeneMark-ET outside of Maker using aligned RNA-seq data. Then, I gave them to Maker. The evidences included assembled RNA-seq data, protein sequences of my insect, proteome sequences of three related insects and Swiss-Prot. At last, I used the gene models generated by Maker with AED < 0.01 to train SNAP for two rounds. So my questions are:
>
> 1. how to evaluate the results of ab initio training. How can I know these gene finders were well trained?
>
> 2. Should I add EST evidences? How does Maker work on the locus where there is only partial EST evidence? Will the partial EST sequences cause gene models to be partial?
>
> 3. Is there some gold-criteria to evaluate the results of gene prediction? How to improve it?
>
> Thank you!
>
> Best regards,
> Wenbo
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20160315/8d8489f2/attachment-0003.html>