[maker-devel] Help debugging a MAKER result

Fri Oct 5 00:51:41 MDT 2018

Thank you both for your helpful ideas. I'm going to give them a try and see
how this effects my results. Will update when I have them.
Cheers indeed.

‫בתאריך יום ו׳, 5 באוק׳ 2018 ב-3:10 מאת ‪Carson Holt‬‏ <‪carsonhh at gmail.com
‬‏>:‬

> One correction. I meant to say set unmask=1.
>
> —Carson
>
>
> On Oct 4, 2018, at 5:52 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
> I’d just like to add info on how MAKER builds predictions. MAKER itself
> does not generate models. In your case, Augustus produces the models.
> Augustus will run twice. Once on it’s own (this will be on a repeat masked
> version of the assembly), and once again where MAKER provides it with a
> hints file as part of the command line used to run Augustus. The hints file
> is generated from the evidence alignments you provided to MAKER. The hints
> usually get Augustus to perform a little better than it does with training
> alone on a masked assembly.
>
> Under-masking or overmasking the assembly can both confound Augustus.
> MAKER hard masks complex repeats in the assembly (turns them from ATCG into
> N’s), and soft-masks simple repeats (turns ATCG into lower case actg). The
> lower case “soft-masking” affects BLAST alignment but not Augustus
> predictions (Augustus ignores it). MAKER also removes the hard-masking when
> it runs Augustus with the hints file. This is done because we’ve
> constrained Augustus to a smaller padded evidence cluster at the locus, and
> Augustus can no longer see the whole assembly.
>
> If you want to explore how masking affects the models, you can
> set unmask=0. Then Augustus will run 3 times (one extra run on the unmasked
> assembly). You can then look at contigs in a browser to see how the masked
> vs unmasked models compare to each other.
>
> —Carson
>
>
> On Oct 2, 2018, at 10:39 PM, Xabier Vázquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> Yeah, tomato should be rather well annotated.
>
> I would double check how good was the tomato genome at the time of the
> creation of the gene model. Also, creating a new Augustus model based on
> the first prediction run might improve things
>
> You have tomato on repbase. To be sure you are not missing anything, I
> would still run the advanced repeat library protocol, if it isn't
> computationally prohibitive.
>
> I don't know how good is SNAP for plant genomes, so it could be worth to
> try on top of the Augustus predictions.
>
> On top of this, I'd take a look into reference-based annotation tools like
> RATT. This would annotate all the common regions with the reference and
> then curate only on the regions that cannot be annotated from the reference
> using your Maker annotation
>
>
> On Tue, 2 Oct 2018 at 16:50, Lior Glick <liorglic at mail.tau.ac.il> wrote:
>
>> Hi Xabier, and thanks for your reply.
>> I forgot to mention it, but I used the annotated repeats derived from the
>> ITAG annotation as repeats library, so I expect these to be quite
>> appropriate. I guess my question is regarding the way Maker makes
>> decisions: Is the fact that some repeats (simple repeats in this case) were
>> predicted is enough to change a CDS into a UTR, despite sufficient protein
>> evidence?
>> I did not train Augustus myself, rather I used the species (tomato)
>> profile that comes with the Augustus release. Does that make sense?
>> As for the haploid/diploid issue - fortunately I don't have to deal with
>> that since cultivated tomato varieties are repeatedly selfed, so they are
>> (almost) completely homozygous.
>>
>> ‫בתאריך יום ג׳, 2 באוק׳ 2018 ב-3:01 מאת ‪Xabier Vázquez-Campos‬‏ <‪
>> xvazquezc at gmail.com‬‏>:‬
>>
>>> Hi Lior,
>>>
>>> without getting in a lot of detail a good model covering the repeats in
>>> your genome is extremely important, specially in genomes with a lot of
>>> repeats. If the repeat library does not have an appropriate coverage,
>>> anything based on the masked genome will be affected
>>>
>>> The evidence you pass into Augustus to generate the gene model can have
>>> a huge impact. Aside of the repeats, BUSCO-generated gene models can
>>> under-predict
>>> https://groups.google.com/forum/?hl=en-GB#!topic/maker-devel/ocnDG4nq1A8
>>> And we have seen in our lab that the gene models generated by Augustus
>>> can be very different if you provide an haploid assembly vs haploid +
>>> alternate contigs vs diploid. In general, a purely haploid assembly
>>> generates a less biased model as it has lower number of duplicated
>>> conserved genes present, that will unbalance the gene model towards them.
>>> (at least in BUSCO-based models, but it should be extensible to any
>>> Augustus model)
>>>
>>> Note that in the end the generated annotation is just a model/hypothesis
>>> and may require more than a bit of curation... usually increasing with more
>>> complex genomes.
>>>
>>> Cheers,
>>> Xabi
>>>
>>> On Tue, 2 Oct 2018 at 05:23, Lior Glick <liorglic at mail.tau.ac.il> wrote:
>>>
>>>> Hi MAKER users,
>>>> I am new to Maker and had just finished running my first annotations.
>>>> Although the results make sense in general, I have reasons to suspect some
>>>> gene models are wrong and would like your help in understanding and
>>>> optimizing the results.
>>>> My research project involves the annotation of multiple tomato
>>>> varieties (individuals) which are a bit different from the published
>>>> reference genome. To this end, I created de-novo assemblies of these
>>>> genomes and also generated an evidence set to be used as input for Maker.
>>>> Evidence consist of a large set of transcripts from various tomato
>>>> varieties and conditions, as well as full protein sets from 6 plant
>>>> species, including the proteins derived from the annotation of the
>>>> reference - called ITAG.
>>>> For an initial QA, I tried annotating the reference genome using my
>>>> evidence data and Augustus as gene predictor. This should allow me to
>>>> compare my result to the ITAG annotation, which I assume to be the
>>>> "correct" answer, and see how well I'm doing. I should mention that ITAG
>>>> annotation was also created using Maker, followed by manual curation.
>>>> I started by comparing the protein sets from my result and the ITAT
>>>> set. Specifically, I ran an all-vs-all blast and took the top hits. I
>>>> discovered that only about 70% of the ITAG proteins are covered by a
>>>> protein from my result with a high quality alignment (evalue > 10e-5,
>>>> coverage > 90%). I further investigated by running BUSCO on both protein
>>>> sets and looking at BUSCOs found in ITAG but missing in my result. Attached
>>>> is a screenshot from a genome browser where you can see such a case. Top
>>>> track is the ITAG gene model, below is my result. Third track is the
>>>> protein evidence alignments (i.e blastx and protein2genome features), and
>>>> bottom track are masked repeats.
>>>> As you can see, there seems to be two issues with my result:
>>>> 1. The two genes in ITAG were fused into one. I guess this is a
>>>> difficult case as the genes are really close together.
>>>> 2. The last (3') CDS of the ITAG gene was predicted to be the 3' UTR in
>>>> my result. This is in fact the reason I ended up with a truncated protein
>>>> and a missing BUSCO.
>>>> This is a bit surprising to me, since there seems to be quite a lot of
>>>> protein evidence supporting this region as a CDS. Can you help me figure
>>>> out why is the result so? Could it be due to the small repeats detected in
>>>> this region?
>>>> Any ideas on how my result can be improved without manual curation?
>>>>
>>>> Many thanks!
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>
>>>
>>> --
>>> Xabier Vázquez-Campos, *PhD*
>>> *Research Associate*
>>> NSW Systems Biology Initiative
>>> School of Biotechnology and Biomolecular Sciences
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>>
>>
>
> --
> Xabier Vázquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181005/a90f937a/attachment-0003.html>