<div dir="ltr"><div dir="ltr"><div>Hi Lior,</div><div><br></div><div>without getting in a lot of detail a good model covering the repeats in your genome is extremely important, specially in genomes with a lot of repeats. If the repeat library does not have an appropriate coverage, anything based on the masked genome will be affected<br></div><div><br></div><div>The evidence you pass into Augustus to generate the gene model can have a huge impact. Aside of the repeats, BUSCO-generated gene models can under-predict<br></div><div><a href="https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8" target="_blank">https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8</a></div><div>And we have seen in our lab that the gene models generated by Augustus can be very different if you provide an haploid assembly vs haploid + alternate contigs vs diploid. In general, a purely haploid assembly generates a less biased model as it has lower number of duplicated conserved genes present, that will unbalance the gene model towards them. (at least in BUSCO-based models, but it should be extensible to any Augustus model)<br></div><div><br></div><div>Note that in the end the generated annotation is just a model/hypothesis and may require more than a bit of curation... usually increasing with more complex genomes.</div><div><br></div><div>Cheers,</div><div>Xabi<br></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, 2 Oct 2018 at 05:23, Lior Glick <<a href="mailto:liorglic@mail.tau.ac.il" target="_blank">liorglic@mail.tau.ac.il</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="rtl"><div dir="ltr">Hi MAKER users,</div><div dir="ltr">I am new to Maker and had just finished running my first annotations. Although the results make sense in general, I have reasons to suspect some gene models are wrong and would like your help in understanding and optimizing the results.</div><div dir="ltr">My research project involves the annotation of multiple tomato varieties (individuals) which are a bit different from the published reference genome. To this end, I created de-novo assemblies of these genomes and also generated an evidence set to be used as input for Maker. Evidence consist of a large set of transcripts from various tomato varieties and conditions, as well as full protein sets from 6 plant species, including the proteins derived from the annotation of the reference - called ITAG.</div><div dir="ltr">For an initial QA, I tried annotating the reference genome using my evidence data and Augustus as gene predictor. This should allow me to compare my result to the ITAG annotation, which I assume to be the "correct" answer, and see how well I'm doing. I should mention that ITAG annotation was also created using Maker, followed by manual curation.</div><div dir="ltr">I started by comparing the protein sets from my result and the ITAT set. Specifically, I ran an all-vs-all blast and took the top hits. I discovered that only about 70% of the ITAG proteins are covered by a protein from my result with a high quality alignment (evalue > 10e-5, coverage > 90%). I further investigated by running BUSCO on both protein sets and looking at BUSCOs found in ITAG but missing in my result. Attached is a screenshot from a genome browser where you can see such a case. Top track is the ITAG gene model, below is my result. Third track is the protein evidence alignments (i.e blastx and protein2genome features), and bottom track are masked repeats.</div><div dir="ltr">As you can see, there seems to be two issues with my result:</div><div dir="ltr">1. The two genes in ITAG were fused into one. I guess this is a difficult case as the genes are really close together.</div><div dir="ltr">2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in my result. This is in fact the reason I ended up with a truncated protein and a missing BUSCO.</div><div dir="ltr">This is a bit surprising to me, since there seems to be quite a lot of protein evidence supporting this region as a CDS. Can you help me figure out why is the result so? Could it be due to the small repeats detected in this region?</div><div dir="ltr">Any ideas on how my result can be improved without manual curation?</div><div dir="ltr"><br></div><div dir="ltr">Many thanks!</div></div>
_______________________________________________<br>
maker-devel mailing list<br>
<a href="mailto:maker-devel@box290.bluehost.com" target="_blank">maker-devel@box290.bluehost.com</a><br>
<a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" rel="noreferrer" target="_blank">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a><br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="m_-7236745163160119211m_-7034574888123341356m_719052513224709693m_-7287022544034250215gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div>Xabier Vázquez-Campos, <i>PhD</i><br><i>Research Associate</i><br>NSW Systems Biology Initiative<br>School of Biotechnology and Biomolecular Sciences<br>
The University of New South Wales<br>Sydney NSW 2052 AUSTRALIA<br></div></div></div></div></div></div></div></div></div></div></div>