[maker-devel] About split genes problem in Maker annotations

Thu Nov 17 14:05:53 MST 2016

est2genome and protein2genome should only be used for initial training. They are not predictors, rather they take an EST/protein alignment, find the longest ORF and then turn the ORF directly into a gene model.  It is good enough to build a training dataset, but the models will almost always be partial and fragmented. Also because the alignments both produce and support themselves, they always score well, so their AED values are meaningless. Once you have a predictor trained, you should turn est2genome and protein2genome off. With a trained predictor, the alignments will then serve as hints to Augustus as to where likely introns/exons will be, and this will give the desired behavior.

Note Augustus will attempt to build the most probable model given the hints and the assembly sequence. If there are any assembly issues affecting the ORF, the predictor will often skip exons or split the model in the locus. Also make sure you have built a species specific repeat library to add to the default repeat libraries used by MAKER (you can use tools like RepeatModeler to do this). Otherwise you will get spurious alignments of much of your evidecence and Augustus will generate false positive results. You may also want to add a large dataset like Uniprot/swiss-prot to the protein evidence. 

The best way to evaluate annotations and performance is to visually review annotation in tools like Apollo. It will allow you to see if evidence, gene predictions, and final models achieve consensus or if alignments don’t match (spurious alignment generally suggests a repeat masking issue or evidence quality issue) or if raw ab initio predictions don’t match (indicates insufficient training or an underlying assembly issues).

—Carson

> On Nov 16, 2016, at 8:01 PM, Prashant Narendra SHINGATE <prashantns at imcb.a-star.edu.sg> wrote:
> 
> Hi Carson,
>  
> We are annotating the genome of a fish with a relatively small genome (~450Mb) using Maker and encountering many genes that are split and predicted as multiple genes. We are using Augustus for de novo prediction. Fortunately we have full-length RNAseq for about 4000 genes (and total ~50k transcripts) from the same species, and whole-genome protein sequences from a very closely related species. 
>  
> First we trained Augustus using ~4000 full length RNAseq transcript from the same species. This trained Augustus model was used in the Maker annotation pipeline along with ~50k RNAseq transcripts (>1000bp) and whole-genome proteins sequences from a closely related species.
>  
> We first tried annotating using the options est2genome=1, protein2genome=1 and Augustus ON.  We found several genes were split and the program seemed to give weight to Augustus prediction in spite of having full-length RNAseq and protein sequences aligned to the gene predicted loci (visualized using Jbrowser). 
>  
> In the next trial we used est2genome=1, protein2genome=1 and Augustus OFF in the first step. In the second step we did reiteration by est2genome=0, protein2genome=0 and Augustus ON. Still the output contained split genes.
>  
> In the third trial we used est2genome=1, protein2genome=1 and Augustus OFF and checked the output. In this output full-length genes were predicted whenever full-length RNAseq and/or protein sequences were available. This seems to suggest that when we use Augustus, more weight is given to Augustus de novo prediction and the synthesis of evidence from RNAseq and protein sequences is not happening.
>  
> Can you please let us know why we are getting split genes in spite of having full-length RNAseq and/or protein sequences? What changes would you suggest to the protocol to overcome this problem?
>  
> We thank you very much for your help and time.
>  
> Regards,
> Prashant Shingate, PhD <mailto:prashantns at imcb.a-star.edu.sg> :: Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)
> 61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117:: http://www.imcb.a-star.edu.sg/ <http://www.imcb.a-star.edu.sg/>
> We advance science and develop innovative technology to further economic growth and improve lives. 
>  
>  
> 
> 
> Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20161117/6508095d/attachment-0002.html>