[maker-devel] Better resolve conflicting gene models

Thu Mar 5 09:47:02 MST 2015

Hello, thanks for your previous advice.

(Btw, how can one reply to an existing thread such that the reply will 
be added to the same thread?)

I am trying to find the best parameters with Maker for the annotation of 
my genome. I have run Maker with several combinations of parameters and 
predictors on my three biggest scaffolds and looked at the results in 
Jbrowse. Overall most predictions seem fine, but there are some genes 
with conflicts and I have no idea why.

I have:

- 100Mb assembled genome
- Trinity RNAseq assembly
- cufflinks data (in my case don't seem to be messy as suggested, rather 
a good complement to the trinity data))
- protein evidence (related and unrelated species)
- repeat library from repeat modeler

Gene predictors used:

- Augustus trained with transcripts from related species: seems to 
perform fine

- SNAP: no convergence with Augustus even after second training. Dropped 
it because it predicted lots of additional low quality transcripts and 
sometimes disrupted final Maker transcripts.

- Genemark: converged with Augustus after training (introns received 
from TopHat2 output). Tends to predict some additional transcripts 
(compared to Augustus). Few (but some) of these are covered by evidence 
and thus become final Maker transcripts.

So the combination of Augustus and Genemark seems optimal. In general 
both perform well in Maker and tend to predict the same transcripts.

However, I still observe some problems in the behavior of Maker which I 
don't understand:

Example 1: One of the predictors predicts a small additional exon at the 
start which is also covered by protein or EST data. But sometimes Maker 
chooses the other predictors model for the final transcript. Mostly 
these are minor differences but I don't understand this behavior?

Example 2: there are some extreme cases like an Augustus prediction with 
17 exons which are all covered by Trinity and cufflinks isoforms. 
Genemark instead predicts two separate small genes with 2 and 4 exons 
respectively. The resulting final transcript has 7 exons and the 
additional evidence from the trinity and cufflinks data is treated as UTR.

So I thought Augustus seems a little more accurate and run Maker only 
with Augustus to resolve such conflicts, even though I would loose the 
few additional transcripts from Genemark.

This is what happened:

- The gene in Example 2 now has all the 17 exons. This is good!

- Sadly another gene with several exons, which was formerly predicted by 
both Augustus and Genemark and is also covered by cufflinks and trinity 
transcripts, now consists only of two small exons in the final 
transcript. Even though Augustus still predicts the same exons and the 
same evidence is present - only the Genemark prediction is absent which 
was almost identical to Augustus. This I completely don't understand.

I don't worry about the minor differences. The extreme cases are like 
two genes in a hundred and I don't understand the behavior. I was 
thinking that in case of conflicting models Maker will choose the one 
that best fits the evidence. Obviously with most conflicts this is what 
happens, because the majority of the final models look OK. But not the 
above mentioned cases and I don't understand why?

Is there any parameter I missed to better resolve such conflicts?

Best