[maker-devel] Help debugging a MAKER result

Tue Oct 2 00:50:32 MDT 2018

Hi Xabier, and thanks for your reply.
I forgot to mention it, but I used the annotated repeats derived from the
ITAG annotation as repeats library, so I expect these to be quite
appropriate. I guess my question is regarding the way Maker makes
decisions: Is the fact that some repeats (simple repeats in this case) were
predicted is enough to change a CDS into a UTR, despite sufficient protein
evidence?
I did not train Augustus myself, rather I used the species (tomato) profile
that comes with the Augustus release. Does that make sense?
As for the haploid/diploid issue - fortunately I don't have to deal with
that since cultivated tomato varieties are repeatedly selfed, so they are
(almost) completely homozygous.

‫בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos‬‏ <‪
xvazquezc at gmail.com‬‏>:‬

> Hi Lior,
>
> without getting in a lot of detail a good model covering the repeats in
> your genome is extremely important, specially in genomes with a lot of
> repeats. If the repeat library does not have an appropriate coverage,
> anything based on the masked genome will be affected
>
> The evidence you pass into Augustus to generate the gene model can have a
> huge impact. Aside of the repeats, BUSCO-generated gene models can
> under-predict
> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8
> And we have seen in our lab that the gene models generated by Augustus can
> be very different if you provide an haploid assembly vs haploid + alternate
> contigs vs diploid. In general, a purely haploid assembly generates a less
> biased model as it has lower number of duplicated conserved genes present,
> that will unbalance the gene model towards them. (at least in BUSCO-based
> models, but it should be extensible to any Augustus model)
>
> Note that in the end the generated annotation is just a model/hypothesis
> and may require more than a bit of curation... usually increasing with more
> complex genomes.
>
> Cheers,
> Xabi
>
> On Tue, 2 Oct 2018 at 05:23, Lior Glick <liorglic at mail.tau.ac.il> wrote:
>
>> Hi MAKER users,
>> I am new to Maker and had just finished running my first annotations.
>> Although the results make sense in general, I have reasons to suspect some
>> gene models are wrong and would like your help in understanding and
>> optimizing the results.
>> My research project involves the annotation of multiple tomato varieties
>> (individuals) which are a bit different from the published reference
>> genome. To this end, I created de-novo assemblies of these genomes and also
>> generated an evidence set to be used as input for Maker. Evidence consist
>> of a large set of transcripts from various tomato varieties and conditions,
>> as well as full protein sets from 6 plant species, including the proteins
>> derived from the annotation of the reference - called ITAG.
>> For an initial QA, I tried annotating the reference genome using my
>> evidence data and Augustus as gene predictor. This should allow me to
>> compare my result to the ITAG annotation, which I assume to be the
>> "correct" answer, and see how well I'm doing. I should mention that ITAG
>> annotation was also created using Maker, followed by manual curation.
>> I started by comparing the protein sets from my result and the ITAT set.
>> Specifically, I ran an all-vs-all blast and took the top hits. I discovered
>> that only about 70% of the ITAG proteins are covered by a protein from my
>> result with a high quality alignment (evalue > 10e-5, coverage > 90%). I
>> further investigated by running BUSCO on both protein sets and looking at
>> BUSCOs found in ITAG but missing in my result. Attached is a screenshot
>> from a genome browser where you can see such a case. Top track is the ITAG
>> gene model, below is my result. Third track is the protein evidence
>> alignments (i.e blastx and protein2genome features), and bottom track are
>> masked repeats.
>> As you can see, there seems to be two issues with my result:
>> 1. The two genes in ITAG were fused into one. I guess this is a difficult
>> case as the genes are really close together.
>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in
>> my result. This is in fact the reason I ended up with a truncated protein
>> and a missing BUSCO.
>> This is a bit surprising to me, since there seems to be quite a lot of
>> protein evidence supporting this region as a CDS. Can you help me figure
>> out why is the result so? Could it be due to the small repeats detected in
>> this region?
>> Any ideas on how my result can be improved without manual curation?
>>
>> Many thanks!
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>
>
> --
> Xabier Vázquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181002/13720814/attachment-0003.html>