[maker-devel] keep_preds option?

Fri Aug 31 13:03:14 MDT 2012

I concur with everything Barry said.  Also let me add that alt-ESTs do not
get polished around splice sites (exonerate won't handle them).  However
ESTs and proteins do.  This means that the utility of alt-ESTs in adding
UTR, and splice information is zero.  They just function as an anchor to
maintain gene models that might have otherwise been rejected.  This also
means alt_est=some.fasta  together with est2genome=1 will produce virtually
no additional results because est2genome requires that the splice site makes
sense.  You are better off using proteins with protein2genome=1 if you don¹t
have ESTs and want partial models for training.  Once you have a trained ab
initio gene predictor, turn the est2genome and protein2genome options off.
Otherwise they will give weird partial models that decrease the quality of
your final annotations. (partial models are ok for training but that's about
it).

If you are getting too low a gene count with keep_preds=0, then you probably
need to add more evidence.  Try adding all proteins from a couple of related
species (the protein= option accepts comma separated lists of files). If
your genome is a fungi, oomycete, or a prokaryote, then setting keep_preds=1
is usually safe.  Theses are genomes with high gene density and simple gene
structure, so ab initio predictors do really well and don't need as much
help from the evidence.  For other organisms, leave it set to 0 or you will
get a lot of false positives (false positives on some genomes with complex
gene structure can outnumber the genes by 10 to 1 if you turn this on).

Thanks,
Carson

From:  Barry Moore <barry.moore at genetics.utah.edu>
Date:  Friday, 31 August, 2012 12:52 PM
To:  Christoph Hahn <chrisi.hahni at gmail.com>
Cc:  <maker-devel at yandell-lab.org>
Subject:  Re: [maker-devel] keep_preds option?

Hi Christopher,

Comments below:

On Aug 31, 2012, at 6:43 AM, Christoph Hahn wrote:

> Hello maker users and developers,
> 
> I am new to gene prediction and I am trying to use maker 2.25 on a newly
> assembled non-model organisms draft genome. Within maker I use genemark, SNAP
> and Augustus. I have a few questions:
> 

Welcome!

> 1. I was wondering what the keep_preds option means exactly.
> 
> I found two slightly different explanations on the option
> #Add unsupported gene prediction to final annotation set (maker2.25)
> #Add non-overlapping ab-inito gene prediction to final annotation set (found
> on the net - probably older maker version)
> 

It means to keep ab initio gene predictions for which there is no physical
evidence.  There are two pieces of information that are required for every
MAKER annotation (by default).  MAKER runs the ab initio gene predictors and
aligns transcript (EST/cDNA/mRNASeq transcripts) and protein sequences to
the genome.  For each locus where one or more gene predictions exist MAKER
checks to see if there is any physical evidence for gene expression at that
locus (RNA/protein sequence alignments) and if there is (splice EST or
protein alignments) evidence overlapping the predictions, MAKER decides
which prediction best matches the evidence and promotes that prediction to
an annotation.  If there is no evidence overlapping any of the predictions
then those predictions are not included in the output annotation file
(although they are saved).  If you set keep_preds=1 then for each locus
where prediction(s) exist maker keeps one and promotes it to an annotation
even though there is no physical evidence.  The description of
'non-overlapping ab-initio'  means that MAKER has clustered all ab-initio
predictions at one locus and chose one representative transcript to output.

> As far as I understood keep_preds=0 only retains gene models for which the ab
> initio predictions agree. But how many, all three? two of three?
> keep_preds=1 instead keeps all gene models regardless if the different
> programs agree, right?
> 

MAKER does not take the presence of multiple ab initio predictions as
evidence and thus in the absence of aligned physical evidence MAKER will not
output an annotation even if all three ab initio tools predict a gene at
that locus.

> In my case I get substantial differences in the number of gene models found
> between the two settings, while with =1 I get a number that is close to what
> we would expect. How would you interpret that? What would you recommend me to
> do? Obiously =0 is the saver option.

If you think that the number of genes you are getting from a MAKER run is
too few, you could run MAKER with keep_preds=1.  After the run is finished,
use a tool like IPRScan to search all MAKER predictions for protein domain
content and push that IPRScan output back into the MAKER GFF file with the
ipr_update_gff script.  Then as a final step you can run over the GFF file
and remove any gene model that doesn't have either physical evidence (AED <
1) or protein domain content (Dbxref=PFAM:XXX etcŠ) sorry there's not a
script prepackaged with MAKER for that yet.

> 
> 2. I tried to use EST data of an alternative organism in altest= (#EST/cDNA
> sequence file in fasta format from an alternate organism). The organism is
> quite distantly related, but its the closest I have so I thought I d give it a
> shot. I ran maker twice with identical settigs expect in altest and
> est2genome=0/1. The number of genes predicted is identical with both
> approaches, so I am not sure whether or not the EST data was actually used or
> its just to distantly related. Any easy way to assess this?

Typically EST evidence from another organism with alt_est will add little in
the way of additional support (compared to just using protein evidence from
say Swiss-prot) and this would be especially true if your alt_est is
distantly related.  I'm not sure I really understand you alt_est/est2genome
combo's to comment in more detail.  I could see four possible combinations
there: which two gave identical results?

> 
> 3. I am running maker in several passes and after each pass I am training SNAP
> using the result of the previous pass. Then for every pass I run maker from
> scratch. Would you recommend to supply the gff of the previous pass in
> "#-----Re-annotation Using MAKER Derived GFF3
> maker_gff= #re-annotate genome based on this gff3 file", instead?
> 

No, 'Re-annotation using MAKER Derived GFF3' is used for re-annotation of a
genome when you want certain parts of the previous run to be passed through
unchanged, but with retraining SNAP you want MAKER to re-evaluate each
annotation in light of the new predictions made by the retrained SNAP.
MAKER should run really fast in all of the runs after the first one because
as long as you haven't changed the evidence files it won't have to redo any
of the alignments.

B

> Thanks in advance for any thoughts/advice on these things! I appreciate your
> help!
> 
> much obliged,
> Christoph
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543

_______________________________________________ maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20120831/ed670166/attachment-0003.html>