[maker-devel] Evaluating Genome Annotation

Wed Feb 25 10:25:30 MST 2015

> Here are my questions:
> 1) I suppose I would have expected more genes predicted, instead each iteration seems to produce fewer genes (although very slight difference between rounds 2 & 3

Your first round should over-predict especially if it is based off of cufflinks results (very noisy).  Your second and third rounds look about right for many organisms (both should be similar in gene count), but if you believe it is low for yours then run CEGMA to estimate your genomes completeness (i.e. if your genome is 85% complete then you expect your final number from MAKER to represent about 85% of the true number of genes). Also you may want to increase your protein database.  If the refseq genes you are using represent just a subset of the 3 vertebrate genomes rather than the whole genomes of those organisms, then you will want to get a couple of full genomes to work with. Also not having a high completion level genome on vertebrates in now out of the ordinary. In lamprey (an extreme case) the low completion level actually lead to the discovery that it’s cells undergo programed somatic deletion of about 25% of the genome, and since since it’s genome was sequenced off of the somatic tissue, it was obviously missing from the assembly.

> 2) AED distributions are identical between rounds 2 & 3, though it sounds like there should be a vast improvement from the tutorials and my readings

That’s what you expect.  The third round should show just minor improvement (AED is not a highly precise number so a difference of 1% basically means the second and third round results are identical for evidence support).  The real improvement from second round to third round is the quality of the unaided SNAP models (you really only get a sense of this by using apollo to view a few contigs).  Because the MAKER models are derived from evidence based hints, they will always be similar between runs, but the raw SNAP models in round 3 will be much more like the MAKER models that the unaided SNAP models from round 2.  This convergence helps you know that you gene predictor is trained.  You may also want to train Augustus and add that to your set of predictors (look for convergence between MAKER, SNAP, and Augustus models to indicate training has worked).  Augustus generally performs better on vertebrates than SNAP.  On some vertebrates you actually have to just drop SNAP completely (SNAP runs very poorly on the human genome for example). On genomes where you drop SNAP then you would just use Augustus (look at evidence alignments and convergence between MAKER/SNAP/Augustus models to make that decision).

> 3) It strikes me Roughly 75% of the genome is not being used (my suspicion is that scaffolds are smaller than the maker threshold for consideration).  Should I lower this threshold so that more of the genome is considered?

The default threshold for consideration is 1 bp. But when you actually run the predictors you will realize that they cannot physically put a multi exon gene in contigs bellow about 10kp in length.  So MAKER will run them, but you just won’t get any results.

> 4) I realize that after round 1 I generated the HMM based on all of the genes predicted by the evidence, and it seems some have taken the approach of restricting this list to the "best" genes, or using something like CEGMA.   Is my method of HMM construction to blame?

Your HMM’s are probably fine (look for a convergence between SNAP raw and MAKER evidence based models to see if SNAP is behaving well).  I think you probably need a better protein database, perhaps need to improve repeat masking as well (try running repeat modeler - I can’t overstate the importance of this since repeats can essentially break a gene predictor).  Try adding Augustus to the analysis.  Also in general, I’ve found that cufflinks processed evidence is far too noisy and it adversely affects results of annotation. Try processing the transcript data with Trinity instead (you will get better gene models). I doubt additional training of SNAP is necessary.

> 5) Am I worried about nothing here?  Is this a pretty decent annotation?

A reasonable expectation of accuracy for a first draft genome is probably in the upper 70’s to high 80’s.  Extremely high quality assemblies with lots of good transcript data might break into the 90’s.  For example more than 40% of the genes from the original draft of the mouse genome have since been thrown out over time (http://www.biomedcentral.com/1471-2105/10/67 <http://www.biomedcentral.com/1471-2105/10/67>).  The total gene count has remained similar, but those counts are actually based off of new genes in new locations in the genome. Also the honeybee genome recently got major improvements in there annotations (50% increase in gene count) after fixing problems with the original assembly and annotation process (http://www.biomedcentral.com/1471-2164/15/86 <http://www.biomedcentral.com/1471-2164/15/86>).

—Carson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150225/c28583e2/attachment-0003.html>