[maker-devel] FW: maker-control file

Thu Mar 6 07:40:48 MST 2014

Hi Carson,

Thanks for the detailed feedback, this has cleared up a few things. I don’t necessarily share your view on the problematic nature of RNA-seq data - especially with newer protocols near-perfect strandedness. We work a lot on transcriptome assembly and with a stringent approach to transcript assembly I think I got better results with est2genome than trying to let Maker work with a semi-refined ab-initio model. But it can be a bit tricky to hit that sweet spot (we did validate > 4000 models manually in order to make that sort of assessment tho).

But I will have another look at this and see if I can get Maker to do what I need with the approach you describe. That reminds me, I think it would be fantastic if you guys could put together a Wiki for Maker. This is such a useful and powerful tool, but clearly there are many things that people should get a proper explanation on that has only ever been discussed on this list here - best practices, experimental features etc.

Regards,

Marc

On 06 Mar 2014, at 15:29, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:

Wouldn’t it be more sensible to rely on the evidence over probabilistic models?

Yes.  Infact that is the backbone of MAKER.  The evidence is used to derive hints that are passed back into the predictors and reviewed in light of the evidence to decide on final models (no longer strictly probabalistic).  Take a look at the MAKER2 paper (Table 2 and Figure 1) and you will see that eve when you use the wrong species parameters in the predictor (I.e. A. thaliana to annotate C. elegant) you get as much as a 3 fold increase in exon level accuracy by using the hint feedback from MAKER.  With est2genome option you don’t get that hint feedback (normally probabilistic models, EST evidence, and protein evidence would all work together), and the models are overall poorer and contain more false positives (we have looked at this a lot).

The annotation would be partial, but on the other hand the chance of incorporating false signals are smaller (assuming I can generate a clean set of transcripts from RNA-seq data)?

False signals are abundant.  It’s just the nature of how ESTs and especially mRNAseq reads are generated and anchored back to the assembly.  By letting there be feedback between the probabilistic model and the evidence (both protein and EST/mRNAseq) a lot of this is eliminated.

As an example, using SNAP and Augustus on a bird genome - with augustus achieving nucleotide and exon sensitivities in the 70-90% range gave a host if false exons that were simply not supported by the RNAseq data, yet made it into the final gene build.

You will get false positives from est2genome alone approach as well.  Models will be more partial, and false negative rate will be very high (often 30-70% false negative rate).  Also look at the MAKER2 paper Figure 1.  The false positive rate from ab initio alone can be quite high, but with the evidence feedback it is substantially reduced (especially for poorly trained predictors).

Is it possible to get some more details on how Maker uses ab-inito predictions and reconciles them with evidence alignments? At the moment it seems to me that maker gives higher weight to the ab-initio predictions, which to me seems problematic.

Take a look at the MAKER, MAKER2, and MAKER-P papers.  Final genes are chosen based off of evidence overlap using AED (completely evidence based).  It is the model generation that leverages the hint based feedback.  The names of MAKER genes can let you know what the source of the model is.  Any time hint based models match the evidence better the name will have hame like this —>
maker-<contig>-<predictor>-gene-<ID> (I.e. maker-chr1-snap-gene-0.4)

When the ab initio model matches better than the hint based model the name is like this —>
<predictor>-<contig>-abinit-gene-<ID> (I.e. snap-chr1-abinit-gene-0.2)

In summary, using est2genome alone (while good for generating training sets) undercuts the power of the evidence feedback together with the probabilistic models.

Thanks,
Carson

From: Marc Höppner <marc.hoeppner at imbim.uu.se<mailto:marc.hoeppner at imbim.uu.se>>
Date: Thursday, March 6, 2014 at 12:26 AM
To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Cc: "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Subject: Re: [maker-devel] FW: maker-control file

Hi,

I think this is an interesting comment that I would like a few more information on:

correct_est_fusion should not be used together with est2genome.  It won’t
fail, you just get odd results.  Actually est2genome should not ever be
used to generate the final annotation set.  It is a convenience method
that allows you to generate rough models for training gene predictors like
SNAP and Augustus.  But once they are trained it should be turned off,
because the models it produces will be partial (Ests rarely cover the
whole transcript) and the results will have many false potties from
background transcription events from your EST data.  These models are good
enough to train with, but make very poor final annotations. So in the end
you should be using correct_est_fusion=1 with the SNAP pr Augustus set and
not est2genome (which should already have been turned off by then).

My experience has been that the process of training gene finders, especially for complex genomes like vertebrates, is a very slow and painful process. And ultimately, the results are far from accurate, even with a sizeable, manually curated training set. Wouldn’t it be more sensible to rely on the evidence over probabilistic models? The annotation would be partial, but on the other hand the chance of incorporating false signals are smaller (assuming I can generate a clean set of transcripts from RNA-seq data)? And I’d rather underestimate the exon inventory slightly than putting out an annotation with ~ 10% false exon calls.

As an example, using SNAP and Augustus on a bird genome - with augustus achieving nucleotide and exon sensitivities in the 70-90% range gave a host if false exons that were simply not supported by the RNAseq data, yet made it into the final gene build. Not sure what to think about that to be honest. Is it possible to get some more details on how Maker uses ab-inito predictions and reconciles them with evidence alignments? At the moment it seems to me that maker gives higher weight to the ab-initio predictions, which to me seems problematic.

/Marc

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140306/868effc6/attachment-0002.html>