[maker-devel] keep_preds option?

Mon Sep 10 05:10:33 MDT 2012

The final annotations are really produced by GeneMark, SNAP, Augustus etc.
MAKER basically takes the physical evidence, turns it into 'hints' for these
programs (I.e. Exon and CDS scoring bonuses), and then lets them run again.
If you retrain them they will behave different.  If you add a new one you
may also get a model from the third algorithm that seems to match the
evidence better.  Sometimes one algorithm will call one gene where another
thinks it should be two genes while the third thinks it is really a larger
gene merged with exons from a neighboring gene.  MAKER will add hints to try
and improve the ab initio predictors performance and choose the model that
seems to be make sense given the evidence.  The source of a gene model is
eventually inherent from the name MAKER gives it.  If snap is in the name,
then the model was derived from snap.  If maker and snap are in the name,
then it is a model derived from snap with MAKER hints (otherwise it was
derived completely from SNAP's training with no MAKER input).

Thanks,
Carson

From:  Christoph Hahn <chrisi.hahni at gmail.com>
Date:  Wednesday, 5 September, 2012 7:59 AM
To:  Carson Holt <carsonhh at gmail.com>, Barry Moore
<barry.moore at genetics.utah.edu>
Cc:  <maker-devel at yandell-lab.org>
Subject:  Re: [maker-devel] keep_preds option?

Hello Barry and Carson,

 Thank you very much for the extensive replies!! Very helpful!!

>  
> 
>  2. I tried to use EST data of an alternative organism in altest= (#EST/cDNA
> sequence file in fasta format from an alternate organism). The organism is
> quite distantly related, but its the closest I have so I thought I d give it a
> shot. I ran maker twice with identical settigs expect in altest and
> est2genome=0/1. The number of genes predicted is identical with both
> approaches, so I am not sure whether or not the EST data was actually used or
> its just to distantly related. Any easy way to assess this?
>  
>  

> Typically EST evidence from another organism with alt_est will add little in
the way of additional support (compared to just using protein evidence from say
Swiss-prot) and this would be especially true if your alt_est is
 > distantly related.  I'm not sure I really understand you
alt_est/est2genome combo's to comment in more detail.  I could see four
possible combinations there: which two gave identical results?

 What I meant was that I ran maker once without any alt_est evidence and
est2genome=0 and a second time with alt_est=some.fasta and est2genome=1. The
result was the same. Sorry, for not making myself clear enough. I thought
that the est2genome=1 switch is is just enabling physical est evidence to be
used. Therefore, I thought neither alt_est=some.fasta, est2genome=0 nor
alt_est=nothing, est2genome=1 would make any sense. I had misunderstood
this.  

 Will follow Carsons advice and will try to use more protein evidence from
related species (in addition to uniprot). Running right now - Let s see
where that leaves me. The IPRScan approach suggested by Barry to assess gene
models without physical evidence sounds very interesting. I will definitely
look into that. 

 A question concerning an issue I just discovered:
 Ran maker twice with the same physical evidence. First time using SNAP and
Genemark, second time using SNAP, Genemark and AUGUSTUS (set to the closest
related species available - same phylum, different class). Second run gives
less gene models. IN another context I found that the second pass of Maker
using SNAP and Genemark (after training SNAP on the predictions of the first
Pass) and the same physical evidence yields less gene annotations. How can
that be given the same physical evidence?

 Thanks again for your help! It is much appreciated!

 cheers,
 Christoph

 Am 31.08.2012 21:03, schrieb Carson Holt:

>  
> I concur with everything Barry said.  Also let me add that alt-ESTs do not get
> polished around splice sites (exonerate won't handle them).  However ESTs and
> proteins do.  This means that the utility of alt-ESTs in adding UTR, and
> splice information is zero.  They just function as an anchor to maintain gene
> models that might have otherwise been rejected.  This also means
> alt_est=some.fasta  together with est2genome=1 will produce virtually no
> additional results because est2genome requires that the splice site makes
> sense.  You are better off using proteins with protein2genome=1 if you don¹t
> have ESTs and want partial models for training.  Once you have a trained ab
> initio gene predictor, turn the est2genome and protein2genome options off.
> Otherwise they will give weird partial models that decrease the quality of
> your final annotations. (partial models are ok for training but that's about
> it).
>  
> 
>  
>  
> If you are getting too low a gene count with keep_preds=0, then you probably
> need to add more evidence.  Try adding all proteins from a couple of related
> species (the protein= option accepts comma separated lists of files). If your
> genome is a fungi, oomycete, or a prokaryote, then setting keep_preds=1 is
> usually safe.  Theses are genomes with high gene density and simple gene
> structure, so ab initio predictors do really well and don't need as much help
> from the evidence.  For other organisms, leave it set to 0 or you will get a
> lot of false positives (false positives on some genomes with complex gene
> structure can outnumber the genes by 10 to 1 if you turn this on).
>  
> 
>  
>  
> Thanks,
>  
> Carson
>  
> 
>  
>  
> 
>  
>  
> 
>  
>  
> 
>  
>   
> From:  Barry Moore <barry.moore at genetics.utah.edu>
>  Date:  Friday, 31 August, 2012 12:52 PM
>  To:  Christoph Hahn <chrisi.hahni at gmail.com>
>  Cc:  <maker-devel at yandell-lab.org>
>  Subject:  Re: [maker-devel] keep_preds option?
>  
>  
> 
>  
>  
>  
> Hi Christopher, 
> 
>  
>  
> Comments below:
>  
> 
>  
>  
> On Aug 31, 2012, at 6:43 AM, Christoph Hahn wrote:
>  
>  
>>  
>> Hello maker users and developers,
>>  
>>  I am new to gene prediction and I am trying to use maker 2.25 on a newly
>> assembled non-model organisms draft genome. Within maker I use genemark, SNAP
>> and Augustus. I have a few questions:
>>  
>>  
>>  
>  
> 
>  
>  
> Welcome!
>  
>  
>>  
>> 1. I was wondering what the keep_preds option means exactly.
>>  
>>  I found two slightly different explanations on the option
>>  #Add unsupported gene prediction to final annotation set (maker2.25)
>>  #Add non-overlapping ab-inito gene prediction to final annotation set (found
>> on the net - probably older maker version)
>>  
>>  
>>  
>  
> 
>  
>  
> It means to keep ab initio gene predictions for which there is no physical
> evidence.  There are two pieces of information that are required for every
> MAKER annotation (by default).  MAKER runs the ab initio gene predictors and
> aligns transcript (EST/cDNA/mRNASeq transcripts) and protein sequences to the
> genome.  For each locus where one or more gene predictions exist MAKER checks
> to see if there is any physical evidence for gene expression at that locus
> (RNA/protein sequence alignments) and if there is (splice EST or protein
> alignments) evidence overlapping the predictions, MAKER decides which
> prediction best matches the evidence and promotes that prediction to an
> annotation.  If there is no evidence overlapping any of the predictions then
> those predictions are not included in the output annotation file (although
> they are saved).  If you set keep_preds=1 then for each locus where
> prediction(s) exist maker keeps one and promotes it to an annotation even
> though there is no physical evidence.  The description of 'non-overlapping
> ab-initio'  means that MAKER has clustered all ab-initio predictions at one
> locus and chose one representative transcript to output.
>  
>  
>>  
>> As far as I understood keep_preds=0 only retains gene models for which the ab
>> initio predictions agree. But how many, all three? two of three?
>>  keep_preds=1 instead keeps all gene models regardless if the different
>> programs agree, right?
>>  
>>  
>>  
>  
> 
>  
>  
> MAKER does not take the presence of multiple ab initio predictions as evidence
> and thus in the absence of aligned physical evidence MAKER will not output an
> annotation even if all three ab initio tools predict a gene at that locus.
>  
>  
>>  
>> In my case I get substantial differences in the number of gene models found
>> between the two settings, while with =1 I get a number that is close to what
>> we would expect. How would you interpret that? What would you recommend me to
>> do? Obiously =0 is the saver option.
>>  
>>  
>  
> 
>  
>  
> If you think that the number of genes you are getting from a MAKER run is too
> few, you could run MAKER with keep_preds=1.  After the run is finished, use a
> tool like IPRScan to search all MAKER predictions for protein domain content
> and push that IPRScan output back into the MAKER GFF file with the
> ipr_update_gff script.  Then as a final step you can run over the GFF file and
> remove any gene model that doesn't have either physical evidence (AED < 1) or
> protein domain content (Dbxref=PFAM:XXX etcŠ) sorry there's not a script
> prepackaged with MAKER for that yet.
>  
> 
>  
>  
>>  
>> 
>>  2. I tried to use EST data of an alternative organism in altest= (#EST/cDNA
>> sequence file in fasta format from an alternate organism). The organism is
>> quite distantly related, but its the closest I have so I thought I d give it
>> a shot. I ran maker twice with identical settigs expect in altest and
>> est2genome=0/1. The number of genes predicted is identical with both
>> approaches, so I am not sure whether or not the EST data was actually used or
>> its just to distantly related. Any easy way to assess this?
>>  
>>  
>  
> 
>  
>  
> Typically EST evidence from another organism with alt_est will add little in
> the way of additional support (compared to just using protein evidence from
> say Swiss-prot) and this would be especially true if your alt_est is distantly
> related.  I'm not sure I really understand you alt_est/est2genome combo's to
> comment in more detail.  I could see four possible combinations there: which
> two gave identical results?
>  
>  
>>  
>> 
>>  3. I am running maker in several passes and after each pass I am training
>> SNAP using the result of the previous pass. Then for every pass I run maker
>> from scratch. Would you recommend to supply the gff of the previous pass in
>> "#-----Re-annotation Using MAKER Derived GFF3
>>  maker_gff= #re-annotate genome based on this gff3 file", instead?
>>  
>>  
>>  
>  
> 
>  
>  
> No, 'Re-annotation using MAKER Derived GFF3' is used for re-annotation of a
> genome when you want certain parts of the previous run to be passed through
> unchanged, but with retraining SNAP you want MAKER to re-evaluate each
> annotation in light of the new predictions made by the retrained SNAP.  MAKER
> should run really fast in all of the runs after the first one because as long
> as you haven't changed the evidence files it won't have to redo any of the
> alignments.
>  
> 
>  
>  
> 
>  
>  B
>  
> 
>  
>>  
>> Thanks in advance for any thoughts/advice on these things! I appreciate your
>> help!
>>  
>>  much obliged,
>>  Christoph
>>  
>>  _______________________________________________
>>  maker-devel mailing list
>>  maker-devel at box290.bluehost.com
>>  http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>  
>>  
>  
>  
>  
>  
>  
> Barry Moore
>  
> Research Scientist
>  
> Dept. of Human Genetics
>  
> University of Utah
>  
> Salt Lake City, UT 84112
>  
> --------------------------------------------
>  
> (801) 585-3543
>  
> 
>  
>  
>  
> 
>  
>  
>  
>  
>  
>  
>  
>  _______________________________________________ maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20120910/fe7eaf98/attachment-0003.html>