[maker-devel] Updating Reference Gene Models - Legacy - Strategy

Fri Feb 10 09:36:46 MST 2017

If the old models are poor, then I suggest you do new training using BUSCO, CEGMA, or the est2genome or protein2genome options within MAKER —>
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>

Also this thread —> https://groups.google.com/forum/#!topic/maker-devel/FWMSTdqWQqI <https://groups.google.com/forum/#!topic/maker-devel/FWMSTdqWQqI>

model_gff is for existing gene models you want to keep. So none of these should go there —> Cufflinks.gff,Stringtie.gff,Breaker.gff,Trinity.gff,Velvet.gff

model_gff will always make it into the final annotation set even without any evidence support. By putting those files there, you are basically turning every feature in each of those files into a final gene model no matter how bad it is.

Also if the original models are poor, don’t put them there either. You can doing reciprocal best blast hits with final models to old models to see how they match each other in the end. Will take a little data processing to make it work though.

For all transcript based files, you should provide those to est_gff since they are evidence alignments and not model predictions. For Breaker.gff, that should be pred_gff since it is a prediction model.

With Trinity, I suggest you provide the fasta file and allow MAKER to align and filter things rather than a GFF3. The problem with using GFF3 is you are basically short circuiting upstream prioritization and filtering saying “take this evidence as is.” Also providing same evidence from multiple sources is a bad idea. By purposely making the evidence dataset more noisy, you are forcing lower accuracy.

My suggesting would be not to use Cufflinks (it will introduce a very high false positive rate). Provide Trinity input as fasta (also make sure you use jaccard_clip option was used when assembling). And you will have to manually review models with and without Stringtie data to see if it hurts more than it helps.

Provide Breaker.gff to pred_gff, but still allow maker to run Augustus itself internally (otherwise you won’t be able to use protein evidence as hints).

Thanks,
Carson

> On Feb 10, 2017, at 8:50 AM, Alessandro Rossoni <Alessandro.Rossoni at uni-duesseldorf.de> wrote:
> 
> Dear makers of MAKER,
> first of all - thank you for this awesome program! In the context of my project, I have been running MAKER on a set of novel genomes and it worked very well :)
> 
> During the last days, I realized that the reference sequences of species A that I have been using as starting point for gene model prediction for species B, C and D are a bit flawed. How do I know that? Through visualization of novel RNA-Seq data that was mapped to the "old" gene models of species A, I came to the conclusion that the "old" gene models are not really accurate. In fact, between 52–62% of the reads map within annotated exonic regions of the genome and up to 47% map within intergenic regions. Intron/Exon boarders are pretty messed up and there are a lot more transcribed sequences in species A than previously thought.
> 
> Hence, I would love to update the gene models of species A including the new RNA-Seq evidence and hope to get more accurate gene models out of it. The more accurate gene models of species A would be then used to predict genes for species B, C and D (which are hopefully going to benefit from the more accurate input). However, I there is a little understanding issue on how to set the parameters of the maker_opts.ctl.
> 
> My plan is to produce new RNA-Seq based gene models for species A using Cufflinks, Stringtie, Breaker, Trinity and Velvet. I would pass the output to the maker_opts.ctl as:
> 
> model_gff=Cufflinks.gff,Stringtie.gff,Breaker.gff,Trinity.gff,Velvet.gff #the new gene models
> est=Species_A_reference_ests.fasta  #the old/flawed gene models
> protein=swissprot.fasta
> 
> Question 1: is this correct so far?
> 
> But what sequences do I use for training the ab-initio predictors?
> snaphmm=
> augustus_species=
> 
> Question 2: Do I use the "old/flawed" sequences that I know are not really good? I am not sure what sequences to use for training.
> 
> Any help on this issue would be amazing!
> Best,
> Ale
> 
> 
> -- 
> Alessandro W. Rossoni, M.Sc.
> Institute for Plant Biochemistry
> Heinrich-Heine-University
> 
> --
> http:///www.plant-biochemistry.hhu.de/
> E-Mail:  alessandro.rossoni at hhu-duesseldorf.de
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170210/a0bc9242/attachment-0003.html>