[maker-devel] Help debugging a MAKER result

Lior Glick liorglic at mail.tau.ac.il
Sun Sep 30 12:27:20 MDT 2018


Hi MAKER users,
I am new to Maker and had just finished running my first annotations.
Although the results make sense in general, I have reasons to suspect some
gene models are wrong and would like your help in understanding and
optimizing the results.
My research project involves the annotation of multiple tomato varieties
(individuals) which are a bit different from the published reference
genome. To this end, I created de-novo assemblies of these genomes and also
generated an evidence set to be used as input for Maker. Evidence consist
of a large set of transcripts from various tomato varieties and conditions,
as well as full protein sets from 6 plant species, including the proteins
derived from the annotation of the reference - called ITAG.
For an initial QA, I tried annotating the reference genome using my
evidence data and Augustus as gene predictor. This should allow me to
compare my result to the ITAG annotation, which I assume to be the
"correct" answer, and see how well I'm doing. I should mention that ITAG
annotation was also created using Maker, followed by manual curation.
I started by comparing the protein sets from my result and the ITAT set.
Specifically, I ran an all-vs-all blast and took the top hits. I discovered
that only about 70% of the ITAG proteins are covered by a protein from my
result with a high quality alignment (evalue > 10e-5, coverage > 90%). I
further investigated by running BUSCO on both protein sets and looking at
BUSCOs found in ITAG but missing in my result. Attached is a screenshot
from a genome browser where you can see such a case. Top track is the ITAG
gene model, below is my result. Third track is the protein evidence
alignments (i.e blastx and protein2genome features), and bottom track are
masked repeats.
As you can see, there seems to be two issues with my result:
1. The two genes in ITAG were fused into one. I guess this is a difficult
case as the genes are really close together.
2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my
result. This is in fact the reason I ended up with a truncated protein and
a missing BUSCO.
This is a bit surprising to me, since there seems to be quite a lot of
protein evidence supporting this region as a CDS. Can you help me figure
out why is the result so? Could it be due to the small repeats detected in
this region?
Any ideas on how my result can be improved without manual curation?

Many thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker.png
Type: image/png
Size: 30422 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180930/cc5639ab/attachment-0002.png>


More information about the maker-devel mailing list