[maker-devel] FW: maker-control file

Thu Mar 6 08:03:10 MST 2014

MAKER wiki —> 
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page

Thanks,
Carson

From:  Marc Höppner <marc.hoeppner at imbim.uu.se>
Date:  Thursday, March 6, 2014 at 7:40 AM
To:  Carson Holt <carsonhh at gmail.com>
Cc:  "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject:  Re: [maker-devel] FW: maker-control file

Hi Carson, 

Thanks for the detailed feedback, this has cleared up a few things. I don’t
necessarily share your view on the problematic nature of RNA-seq data -
especially with newer protocols near-perfect strandedness. We work a lot on
transcriptome assembly and with a stringent approach to transcript assembly
I think I got better results with est2genome than trying to let Maker work
with a semi-refined ab-initio model. But it can be a bit tricky to hit that
sweet spot (we did validate > 4000 models manually in order to make that
sort of assessment tho).

But I will have another look at this and see if I can get Maker to do what I
need with the approach you describe. That reminds me, I think it would be
fantastic if you guys could put together a Wiki for Maker. This is such a
useful and powerful tool, but clearly there are many things that people
should get a proper explanation on that has only ever been discussed on this
list here - best practices, experimental features etc.

Regards,

Marc

On 06 Mar 2014, at 15:29, Carson Holt <carsonhh at gmail.com> wrote:

>> Wouldn’t it be more sensible to rely on the evidence over probabilistic
>> models?
> 
> Yes.  Infact that is the backbone of MAKER.  The evidence is used to derive
> hints that are passed back into the predictors and reviewed in light of the
> evidence to decide on final models (no longer strictly probabalistic).  Take a
> look at the MAKER2 paper (Table 2 and Figure 1) and you will see that eve when
> you use the wrong species parameters in the predictor (I.e. A. thaliana to
> annotate C. elegant) you get as much as a 3 fold increase in exon level
> accuracy by using the hint feedback from MAKER.  With est2genome option you
> don’t get that hint feedback (normally probabilistic models, EST evidence, and
> protein evidence would all work together), and the models are overall poorer
> and contain more false positives (we have looked at this a lot).
> 
> 
>> The annotation would be partial, but on the other hand the chance of
>> incorporating false signals are smaller (assuming I can generate a clean set
>> of transcripts from RNA-seq data)?
> 
> False signals are abundant.  It’s just the nature of how ESTs and especially
> mRNAseq reads are generated and anchored back to the assembly.  By letting
> there be feedback between the probabilistic model and the evidence (both
> protein and EST/mRNAseq) a lot of this is eliminated.
> 
> 
>> As an example, using SNAP and Augustus on a bird genome - with augustus
>> achieving nucleotide and exon sensitivities in the 70-90% range gave a host
>> if false exons that were simply not supported by the RNAseq data, yet made it
>> into the final gene build.
> 
> You will get false positives from est2genome alone approach as well.  Models
> will be more partial, and false negative rate will be very high (often 30-70%
> false negative rate).  Also look at the MAKER2 paper Figure 1.  The false
> positive rate from ab initio alone can be quite high, but with the evidence
> feedback it is substantially reduced (especially for poorly trained
> predictors).
> 
> 
>> Is it possible to get some more details on how Maker uses ab-inito
>> predictions and reconciles them with evidence alignments? At the moment it
>> seems to me that maker gives higher weight to the ab-initio predictions,
>> which to me seems problematic.
> 
> Take a look at the MAKER, MAKER2, and MAKER-P papers.  Final genes are chosen
> based off of evidence overlap using AED (completely evidence based).  It is
> the model generation that leverages the hint based feedback.  The names of
> MAKER genes can let you know what the source of the model is.  Any time hint
> based models match the evidence better the name will have hame like this —>
> maker-<contig>-<predictor>-gene-<ID> (I.e. maker-chr1-snap-gene-0.4)
> 
> When the ab initio model matches better than the hint based model the name is
> like this —>
> <predictor>-<contig>-abinit-gene-<ID> (I.e. snap-chr1-abinit-gene-0.2)
> 
> 
> In summary, using est2genome alone (while good for generating training sets)
> undercuts the power of the evidence feedback together with the probabilistic
> models.
> 
> 
> Thanks,
> Carson
> 
> From: Marc Höppner <marc.hoeppner at imbim.uu.se>
> Date: Thursday, March 6, 2014 at 12:26 AM
> To: Carson Holt <carsonhh at gmail.com>
> Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] FW: maker-control file
> 
> Hi,
> 
> I think this is an interesting comment that I would like a few more
> information on:
> 
>> 
>> correct_est_fusion should not be used together with est2genome.  It won’t
>> fail, you just get odd results.  Actually est2genome should not ever be
>> used to generate the final annotation set.  It is a convenience method
>> that allows you to generate rough models for training gene predictors like
>> SNAP and Augustus.  But once they are trained it should be turned off,
>> because the models it produces will be partial (Ests rarely cover the
>> whole transcript) and the results will have many false potties from
>> background transcription events from your EST data.  These models are good
>> enough to train with, but make very poor final annotations. So in the end
>> you should be using correct_est_fusion=1 with the SNAP pr Augustus set and
>> not est2genome (which should already have been turned off by then).
>> 
> 
> My experience has been that the process of training gene finders, especially
> for complex genomes like vertebrates, is a very slow and painful process. And
> ultimately, the results are far from accurate, even with a sizeable, manually
> curated training set. Wouldn’t it be more sensible to rely on the evidence
> over probabilistic models? The annotation would be partial, but on the other
> hand the chance of incorporating false signals are smaller (assuming I can
> generate a clean set of transcripts from RNA-seq data)? And I’d rather
> underestimate the exon inventory slightly than putting out an annotation with
> ~ 10% false exon calls.
> 
> As an example, using SNAP and Augustus on a bird genome - with augustus
> achieving nucleotide and exon sensitivities in the 70-90% range gave a host if
> false exons that were simply not supported by the RNAseq data, yet made it
> into the final gene build. Not sure what to think about that to be honest. Is
> it possible to get some more details on how Maker uses ab-inito predictions
> and reconciles them with evidence alignments? At the moment it seems to me
> that maker gives higher weight to the ab-initio predictions, which to me seems
> problematic. 
> 
> 
> /Marc

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140306/10d5f640/attachment-0003.html>