[maker-devel] FW: maker-control file
Carson Holt
carsonhh at gmail.com
Thu Mar 6 07:29:35 MST 2014
> Wouldn’t it be more sensible to rely on the evidence over probabilistic
> models?
Yes. Infact that is the backbone of MAKER. The evidence is used to derive
hints that are passed back into the predictors and reviewed in light of the
evidence to decide on final models (no longer strictly probabalistic). Take
a look at the MAKER2 paper (Table 2 and Figure 1) and you will see that eve
when you use the wrong species parameters in the predictor (I.e. A. thaliana
to annotate C. elegant) you get as much as a 3 fold increase in exon level
accuracy by using the hint feedback from MAKER. With est2genome option you
don’t get that hint feedback (normally probabilistic models, EST evidence,
and protein evidence would all work together), and the models are overall
poorer and contain more false positives (we have looked at this a lot).
> The annotation would be partial, but on the other hand the chance of
> incorporating false signals are smaller (assuming I can generate a clean set
> of transcripts from RNA-seq data)?
False signals are abundant. It’s just the nature of how ESTs and especially
mRNAseq reads are generated and anchored back to the assembly. By letting
there be feedback between the probabilistic model and the evidence (both
protein and EST/mRNAseq) a lot of this is eliminated.
> As an example, using SNAP and Augustus on a bird genome - with augustus
> achieving nucleotide and exon sensitivities in the 70-90% range gave a host if
> false exons that were simply not supported by the RNAseq data, yet made it
> into the final gene build.
You will get false positives from est2genome alone approach as well. Models
will be more partial, and false negative rate will be very high (often
30-70% false negative rate). Also look at the MAKER2 paper Figure 1. The
false positive rate from ab initio alone can be quite high, but with the
evidence feedback it is substantially reduced (especially for poorly trained
predictors).
> Is it possible to get some more details on how Maker uses ab-inito predictions
> and reconciles them with evidence alignments? At the moment it seems to me
> that maker gives higher weight to the ab-initio predictions, which to me seems
> problematic.
Take a look at the MAKER, MAKER2, and MAKER-P papers. Final genes are
chosen based off of evidence overlap using AED (completely evidence based).
It is the model generation that leverages the hint based feedback. The
names of MAKER genes can let you know what the source of the model is. Any
time hint based models match the evidence better the name will have hame
like this —>
maker-<contig>-<predictor>-gene-<ID> (I.e. maker-chr1-snap-gene-0.4)
When the ab initio model matches better than the hint based model the name
is like this —>
<predictor>-<contig>-abinit-gene-<ID> (I.e. snap-chr1-abinit-gene-0.2)
In summary, using est2genome alone (while good for generating training sets)
undercuts the power of the evidence feedback together with the probabilistic
models.
Thanks,
Carson
From: Marc Höppner <marc.hoeppner at imbim.uu.se>
Date: Thursday, March 6, 2014 at 12:26 AM
To: Carson Holt <carsonhh at gmail.com>
Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] FW: maker-control file
Hi,
I think this is an interesting comment that I would like a few more
information on:
>
> correct_est_fusion should not be used together with est2genome. It won’t
> fail, you just get odd results. Actually est2genome should not ever be
> used to generate the final annotation set. It is a convenience method
> that allows you to generate rough models for training gene predictors like
> SNAP and Augustus. But once they are trained it should be turned off,
> because the models it produces will be partial (Ests rarely cover the
> whole transcript) and the results will have many false potties from
> background transcription events from your EST data. These models are good
> enough to train with, but make very poor final annotations. So in the end
> you should be using correct_est_fusion=1 with the SNAP pr Augustus set and
> not est2genome (which should already have been turned off by then).
>
My experience has been that the process of training gene finders, especially
for complex genomes like vertebrates, is a very slow and painful process.
And ultimately, the results are far from accurate, even with a sizeable,
manually curated training set. Wouldn’t it be more sensible to rely on the
evidence over probabilistic models? The annotation would be partial, but on
the other hand the chance of incorporating false signals are smaller
(assuming I can generate a clean set of transcripts from RNA-seq data)? And
I’d rather underestimate the exon inventory slightly than putting out an
annotation with ~ 10% false exon calls.
As an example, using SNAP and Augustus on a bird genome - with augustus
achieving nucleotide and exon sensitivities in the 70-90% range gave a host
if false exons that were simply not supported by the RNAseq data, yet made
it into the final gene build. Not sure what to think about that to be
honest. Is it possible to get some more details on how Maker uses ab-inito
predictions and reconciles them with evidence alignments? At the moment it
seems to me that maker gives higher weight to the ab-initio predictions,
which to me seems problematic.
/Marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140306/465e3b3f/attachment-0003.html>
More information about the maker-devel
mailing list