[maker-devel] Better resolve conflicting gene models

Thu Mar 12 13:50:44 MDT 2015

Sorry for the slow reply.

> how can one reply to an existing thread such that the reply will be added to the same thread?

Threads show up under the Google archive, but that is really just an artifact of how things are archived. The mailing list itself doesn’t have threads, just a subject line like any other e-mail. But because we are archiving to a message board any messages with the same subject get turned into parts of the same thread.

> Example 1: One of the predictors predicts a small additional exon at the start which is also covered by protein or EST data. But sometimes Maker chooses the other predictors model for the final transcript. Mostly these are minor differences but I don't understand this behavior?

The gene chosen by MAKER is the one that best matches the evidence.  This is a numeric value called AED (lower means better match).  If a model predicts an exon or a base pair that is not supported by an evidence alignment, then it we be penalized.  If a model fails to predict a base pair that is supported by evidence then it will also be penalized.  The differences in your results are because you changed something between runs, either you added or subtracted a predictor (gives MAKER more models to choose from) or evidence changed (different AED score).  Whichever model has the lowest AED score will always be kept, so some change you made either resulted in different set of available models to choose from or a different AED score. 

Also be aware that models cannot overlap on the same strand, so if you made a change that results in a model being removed from consideration, you may have removed overlap that was precluding another model with a slightly higher AED from being chosen.

> 
> Example 2: there are some extreme cases like an Augustus prediction with 17 exons which are all covered by Trinity and cufflinks isoforms. Genemark instead predicts two separate small genes with 2 and 4 exons respectively. The resulting final transcript has 7 exons and the additional evidence from the trinity and cufflinks data is treated as UTR.
> 
> - Sadly another gene with several exons, which was formerly predicted by both Augustus and Genemark and is also covered by cufflinks and trinity transcripts, now consists only of two small exons in the final transcript. 
> Even though Augustus still predicts the same exons and the same evidence is present - only the Genemark prediction is absent which was almost identical to Augustus. This I completely don't understand.

The model chosen will always be the one with the lowest AED.  The changes you describe are all based off of a model with lower AED supplanting another, which may have opened up a region that was previously overlapped by a longer gene with a poor AED score.

I would also recommend not including cufflinks output if you have trinity data.  Cufflinks results often falsely bridge regions and will cause gene mergers by falsely extending evidence clusters. This makes it appear as if genes should extend into regions where they shouldn’t.  Given those hints, predictors will then try and predict at least something in those regions, and if they do the models will get a good AED score since they are supported by evidence even if it is false evidence.

—Carson