[maker-devel] Evaluating Genome Annotation

Wed Feb 25 09:40:40 MST 2015

Hi Folks,

I'm in the process of evaluating the genome annotation that I produced
using AWS (see earlier message).  This is a denovo genome assembly, for
which there is no closely related species.  As such, I followed the
standard procedure using only SNAP (for starters).

genome=4,668 scaffolds with N50> 1.7mb
custom repeatmasker database

Round1:
est2genome=1
protein2genome=1
Concatenated refseq proteins from 3 vertebrates
Cufflinks assembly of 12 tissues (~253,000,000 reads)
Using this entire set  of genes, I created a SNAP HMM, following the online
tutorial, and ran a second round of Maker:

Round 2:
est2genome=0
protein2genome=0
snaphmm=round1.hmm
Concatenated refseq proteins from 3 vertebrates
Cufflinks assembly of 12 tissues (~253,000,000 reads)

I used the resulting genes to train a second SNAP HMM, as suggested by the
tutorial and ran a third round of Maker:
Round 2:
est2genome=0
protein2genome=0
snaphmm=round2.hmm
Concatenated refseq proteins from 3 vertebrates
Cufflinks assembly of 12 tissues (~253,000,000 reads)

I'm concerned that the multiple iterations did not really improve my
annotation.  Here are some of the metrics that I've been able to calculate
thus far:

Using Fathom:
Round 1 contains 44,883 genes (43364 multi-exon) over 1410 sequences
Round 2 contains 15,946 15,812 multi-exon) over 1166 sequences
Round 3 contains 15,514 genes during round 3 (15,389 were multi-exon) over
1147 sequences

Using the AED_cdf_generator.pl script, I was able to calculate the
cumulative AED scores for each round.  I suspect on the first round, this
distribution is meaningless since the gene models are calculated directly
from evidence.  Interestingly, rounds 2 and 3 had remarkably similar AED
scores throughout the table, in Round 2, 92% of my genes had an AED score
of 0.5 or lower, whereas in round 3, 91.9% had an AED score of 0.5 or
lower.

Here are my questions:
1) I suppose I would have expected more genes predicted, instead each
iteration seems to produce fewer genes (although very slight difference
between rounds 2 & 3
2) AED distributions are identical between rounds 2 & 3, though it sounds
like there should be a vast improvement from the tutorials and my readings
3) It strikes me Roughly 75% of the genome is not being used (my suspicion
is that scaffolds are smaller than the maker threshold for consideration).
Should I lower this threshold so that more of the genome is considered?
4) I realize that after round 1 I generated the HMM based on all of the
genes predicted by the evidence, and it seems some have taken the approach
of restricting this list to the "best" genes, or using something like
CEGMA.   Is my method of HMM construction to blame?
5) Am I worried about nothing here?  Is this a pretty decent annotation?

Thanks for any input you folks are able to provide!

Happy annotating!

Jason Gallant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150225/672defbd/attachment-0002.html>