[maker-devel] Improving gene prediction with Augustus and SNAP

Thu Feb 19 12:28:22 MST 2015

I would recommend just using the trinity assembly.  The cufflinks results tend to be messy.

You shouldn’t need the est2genome or protein2genome results if you already trained using cegma results.  You can then do one MAKER run (can be on just part of the genome) where you use both SNAP and Augustus as the predictors (est2genome and protein2genome should be turned off), and then give these results back to SNAP to train with again.  This second round of bootstrap training is usually beneficial to SNAP (beyond two rounds doesn’t really help). Also don’t concatenate with previous training sets for the second round of bootstrap round of training.  The idea is that the second round of training genes will be more correct than the first round, so you want to use them instead.

When you are done, look at one of the larger contigs in a viewer like apollo and compare the raw augustus calls, raw snap calls, and the evidence aware augustus and snap calls produced by maker.  If SNAP and augustus are properly trained then they will produce similar calls, and they will also be similar to the evidence aware calls from MAKER (this convergence is the result of the training).  If one predictor seems to produce calls that are still very divergent, then just drop that predictor from the analysis. A bad predictor will make all results worse.

--Carson

> On Feb 18, 2015, at 8:30 AM, Kai Kamm <kai.kamm at ecolevol.de> wrote:
> 
> Hello
> I have just started in this field of research and I want to annotate my assembled non-bilaterian invertebrate genome with Maker (100Mb in 7000 scaffolds) .
>  
> I have red the maker tutorials but I am still a little uncertain about the iterative procedure. What I have already done is:
>  
> - trained Augustus (using the web service) on the reference genome of a closely related species and its published dataset of "best transcripts" which are mainly based on gene prediction and some EST evidence. The published ESTs themselves were rejected from Augustus as being not sufficient for training (to few long transcripts).
> - trained SNAP with the CEGMA-output of my genome
> - assembled RNA-seq data with tophat/cufflinks and generated gff-file with cufflinks2gff
> - de novo assembled RNA-seq data with Trinity
> 
> I have already done some preliminary Maker runs with initially trained Augustus, SNAP and some protein evidence which had good results.
>  
> Now my strategy is:
>  
> running maker with
> - the est2genome option using the cufflinks gff and the Trinity transcripts as EST evidence
>  
> - the protein2genome option using a protein file including all proteins of the closely related species, a less related non-bilaterian species and a collection of reviewed Swiss-Prot entries from one representative mammal and all protostomes
>  
> - Augustus and SNAP for gene prediction
>  
> When this is done I want to:
>  
> - create 2nd training set for SNAP from the merged gffs with maker2zff
> - train Augustus again with the Maker transcripts using the Augustus web service
> 
> And run Maker again
> 
> Is this a reasonable procedure? Or am I missing some important aspects here?
> Thanks in advance?
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150219/36b04468/attachment-0001.html>