[maker-devel] Better resolve conflicting gene models
Kai Kamm
kai.kamm at ecolevol.de
Thu Mar 5 09:47:02 MST 2015
Hello, thanks for your previous advice.
(Btw, how can one reply to an existing thread such that the reply will
be added to the same thread?)
I am trying to find the best parameters with Maker for the annotation of
my genome. I have run Maker with several combinations of parameters and
predictors on my three biggest scaffolds and looked at the results in
Jbrowse. Overall most predictions seem fine, but there are some genes
with conflicts and I have no idea why.
I have:
- 100Mb assembled genome
- Trinity RNAseq assembly
- cufflinks data (in my case don't seem to be messy as suggested, rather
a good complement to the trinity data))
- protein evidence (related and unrelated species)
- repeat library from repeat modeler
Gene predictors used:
- Augustus trained with transcripts from related species: seems to
perform fine
- SNAP: no convergence with Augustus even after second training. Dropped
it because it predicted lots of additional low quality transcripts and
sometimes disrupted final Maker transcripts.
- Genemark: converged with Augustus after training (introns received
from TopHat2 output). Tends to predict some additional transcripts
(compared to Augustus). Few (but some) of these are covered by evidence
and thus become final Maker transcripts.
So the combination of Augustus and Genemark seems optimal. In general
both perform well in Maker and tend to predict the same transcripts.
However, I still observe some problems in the behavior of Maker which I
don't understand:
Example 1: One of the predictors predicts a small additional exon at the
start which is also covered by protein or EST data. But sometimes Maker
chooses the other predictors model for the final transcript. Mostly
these are minor differences but I don't understand this behavior?
Example 2: there are some extreme cases like an Augustus prediction with
17 exons which are all covered by Trinity and cufflinks isoforms.
Genemark instead predicts two separate small genes with 2 and 4 exons
respectively. The resulting final transcript has 7 exons and the
additional evidence from the trinity and cufflinks data is treated as UTR.
So I thought Augustus seems a little more accurate and run Maker only
with Augustus to resolve such conflicts, even though I would loose the
few additional transcripts from Genemark.
This is what happened:
- The gene in Example 2 now has all the 17 exons. This is good!
- Sadly another gene with several exons, which was formerly predicted by
both Augustus and Genemark and is also covered by cufflinks and trinity
transcripts, now consists only of two small exons in the final
transcript. Even though Augustus still predicts the same exons and the
same evidence is present - only the Genemark prediction is absent which
was almost identical to Augustus. This I completely don't understand.
I don't worry about the minor differences. The extreme cases are like
two genes in a hundred and I don't understand the behavior. I was
thinking that in case of conflicting models Maker will choose the one
that best fits the evidence. Obviously with most conflicts this is what
happens, because the majority of the final models look OK. But not the
above mentioned cases and I don't understand why?
Is there any parameter I missed to better resolve such conflicts?
Best
More information about the maker-devel
mailing list