[maker-devel] keep_preds option?

Fri Aug 31 10:52:10 MDT 2012

Hi Christopher,

Comments below:

On Aug 31, 2012, at 6:43 AM, Christoph Hahn wrote:

> Hello maker users and developers,
> 
> I am new to gene prediction and I am trying to use maker 2.25 on a newly assembled non-model organisms draft genome. Within maker I use genemark, SNAP and Augustus. I have a few questions:
> 

Welcome!

> 1. I was wondering what the keep_preds option means exactly.
> 
> I found two slightly different explanations on the option
> #Add unsupported gene prediction to final annotation set (maker2.25)
> #Add non-overlapping ab-inito gene prediction to final annotation set (found on the net - probably older maker version)
> 

It means to keep ab initio gene predictions for which there is no physical evidence.  There are two pieces of information that are required for every MAKER annotation (by default).  MAKER runs the ab initio gene predictors and aligns transcript (EST/cDNA/mRNASeq transcripts) and protein sequences to the genome.  For each locus where one or more gene predictions exist MAKER checks to see if there is any physical evidence for gene expression at that locus (RNA/protein sequence alignments) and if there is (splice EST or protein alignments) evidence overlapping the predictions, MAKER decides which prediction best matches the evidence and promotes that prediction to an annotation.  If there is no evidence overlapping any of the predictions then those predictions are not included in the output annotation file (although they are saved).  If you set keep_preds=1 then for each locus where prediction(s) exist maker keeps one and promotes it to an annotation even though there is no physical evidence.  The description of 'non-overlapping ab-initio'  means that MAKER has clustered all ab-initio predictions at one locus and chose one representative transcript to output.

> As far as I understood keep_preds=0 only retains gene models for which the ab initio predictions agree. But how many, all three? two of three?
> keep_preds=1 instead keeps all gene models regardless if the different programs agree, right?
> 

MAKER does not take the presence of multiple ab initio predictions as evidence and thus in the absence of aligned physical evidence MAKER will not output an annotation even if all three ab initio tools predict a gene at that locus.

> In my case I get substantial differences in the number of gene models found between the two settings, while with =1 I get a number that is close to what we would expect. How would you interpret that? What would you recommend me to do? Obiously =0 is the saver option.

If you think that the number of genes you are getting from a MAKER run is too few, you could run MAKER with keep_preds=1.  After the run is finished, use a tool like IPRScan to search all MAKER predictions for protein domain content and push that IPRScan output back into the MAKER GFF file with the ipr_update_gff script.  Then as a final step you can run over the GFF file and remove any gene model that doesn't have either physical evidence (AED < 1) or protein domain content (Dbxref=PFAM:XXX etc…) sorry there's not a script prepackaged with MAKER for that yet.

> 
> 2. I tried to use EST data of an alternative organism in altest= (#EST/cDNA sequence file in fasta format from an alternate organism). The organism is quite distantly related, but its the closest I have so I thought I d give it a shot. I ran maker twice with identical settigs expect in altest and est2genome=0/1. The number of genes predicted is identical with both approaches, so I am not sure whether or not the EST data was actually used or its just to distantly related. Any easy way to assess this?

Typically EST evidence from another organism with alt_est will add little in the way of additional support (compared to just using protein evidence from say Swiss-prot) and this would be especially true if your alt_est is distantly related.  I'm not sure I really understand you alt_est/est2genome combo's to comment in more detail.  I could see four possible combinations there: which two gave identical results?

> 
> 3. I am running maker in several passes and after each pass I am training SNAP using the result of the previous pass. Then for every pass I run maker from scratch. Would you recommend to supply the gff of the previous pass in "#-----Re-annotation Using MAKER Derived GFF3
> maker_gff= #re-annotate genome based on this gff3 file", instead?
> 

No, 'Re-annotation using MAKER Derived GFF3' is used for re-annotation of a genome when you want certain parts of the previous run to be passed through unchanged, but with retraining SNAP you want MAKER to re-evaluate each annotation in light of the new predictions made by the retrained SNAP.  MAKER should run really fast in all of the runs after the first one because as long as you haven't changed the evidence files it won't have to redo any of the alignments.

B

> Thanks in advance for any thoughts/advice on these things! I appreciate your help!
> 
> much obliged,
> Christoph
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20120831/8b0671bd/attachment-0003.html>