[maker-devel] Some questions regarding ab-initio training

Thu Jun 5 12:28:55 MDT 2014

One thing you might want to try is adding another predictor like SNAP
together with Augustus and then process the MAKER results using EVM.  We
actually have a collaboration with the EVM group to produce a MAKER-EVM
pipeline (MAKER 3.0).  EVM will produce consensus models using the
predictions and the evidence in the MAKER GFF3 which are generally better
than just SNAP and Augustus with hints, so it might be able to remove some
of the artifacts you are worried about.

--Carson

On 6/5/14, 12:24 PM, "Carson Holt" <carsonhh at gmail.com> wrote:

>Like I said. The predictors do the best they can, so there is probably
>something about the regions to make the CDS, reading frame, or start/stop
>work that requires exons to be dropped or added.  In several ant genomes
>we saw something like this caused by incorrect homopolymers in the
>assembly which force the predictor to slightly alter the intron/exon
>structure because otherwise the reading frame made no sense (the EST
>alignments were used to confirmed that the assembly homopolymers were
>incorrect - lots of bad single base pair deletions).
>
>The way hints work is as follows.  At the simplest level ab initio
>predictors are calculating the probability of being in different states
>(intergenic, intron, exon, etc.).  The hints increase the probability of
>being in the intron state where MAKER gives an intron hint or being in an
>exon/CDS state when MAKER gives an exon/CDS hint.  So this bends the
>likelihood of the ab intio gene predictor to call something similar in
>structure to the evidence overlapping it.  That being said, if there is
>strong enough signal from something else in the sequence or my hints won't
>work with the splice sites in the region or the reading frame breaks, then
>no amount of hints can force augustus to make that model.
>
>--Carson
>
>
>
>On 6/5/14, 2:15 AM, "Marc Höppner" <marc.hoeppner at bils.se> wrote:
>
>>Hi,
>>
>>thanks for the feedback. I spent some more time on this and am still
>>somewhat unsatisfied with the whole thing…
>>
>>A few comments:
>>
>>I quite frequently see augustus and in extension Maker including exons
>>that are not supported by EST/Protein evidence and are not critical for
>>the gene model (i.e. I can take them out and still get a proper CDS).
>>Maybe I don’t know enough about how Maker creates hints and more
>>importantly what role these hints play for augustus, but I cannot really
>>see a great effect (any, really) on the gene models even if both ESTs and
>>proteins contradict an augustus gene model and the surplus exon is
>>non-essential. 
>>
>>(all evidence is provided as fasta files, protein2genome and est2genome
>>are set to 0)
>>
>>As for the repeat library, I suppose this is a critical point. I am using
>>repeats from a closely related species via Repeatmasker, modelled and
>>filtered repeats from RepeatModeler and repeats derived from a
>>high-coverage 454 data set. Not sure what else I can do to improve that.
>>
>>As for evidence, I am using the curated reference proteome from a related
>>species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454
>>reads. I don’t think it gets a whole lot better, in terms of what data
>>can be used.
>>
>>So in summary, I just don’t get where I want to using Augustus and Maker
>>- specifically, the gene models are full of weird, unsupported artefacts
>>despite manually curating > 850 models for training. I suppose I was
>>hoping for some secret trick to improve on this - but I guess there is
>>none? Actually, if I only do a pure evidence build (seeing that my input
>>data is very high quality), it looks better - which sort of goes against
>>the premise of Maker :/
>>
>>Regards,
>>
>>Marc
>>
>>
>>
>>
>>Marc P. Hoeppner, PhD
>>Team Leader
>>Department for Medical Biochemistry and Microbiology
>>Uppsala University, Sweden
>>marc.hoeppner at bils.se
>>
>>On 27 May 2014, at 17:25, Carson Holt <carsonhh at gmail.com> wrote:
>>
>>> Extra exons can be required for predictors to make sense of a region
>>>(they
>>> do the best they can).  This can be due to imperfect assemblies or
>>> repeats.  For plants the repeat database is the the one thing that will
>>> most affect the annotation quality.  You may need to spend some time
>>> building the best repeat library you can.  The repeat library is the
>>>next
>>> most important thing next to training the predictor, because they
>>>confuse
>>> the predictor (sometimes a lot) causing it to behave oddly in those
>>> regions (because repeats do encode real protein and protein fragments).
>>> Also when running now with MAKER make sure to include the entire
>>>proteome
>>> of a related species and not just UniProt, and you will get better
>>> performance.  Now that you have Augustus trained, using it inside of
>>>MAKER
>>> with an improved repeat library and additional protein evidence should
>>> give it the feedback that will allow it to perform better than it would
>>> with just naked ab initio prediction.
>>> 
>>> Thanks,
>>> Carson
>>> 
>>> 
>>> On 5/27/14, 2:12 AM, "Marc Höppner" <marc.hoeppner at bils.se> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I wanted to get some feedback regarding the training of ab-initio gene
>>>> finders - it’s not strictly Maker related, but I suppose there are
>>>>many
>>>> people on this list that have encountered and solved this issue in one
>>>> way or another.
>>>> 
>>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a
>>>> plant genome. This has always been a very frustrating process for me,
>>>>but
>>>> while I have a better idea now how to do it, I still don’t get the
>>>>sort
>>>> of accuracy that I am hoping for. A quick run-through of my process;
>>>> 
>>>> Evidence build with maker on level 1 and 2 proteins from Uniprot +
>>>> Sanger-sequenced EST data
>>>> 
>>>> Filtered for Models with an AED <= 0.3
>>>> 
>>>> Loaded that into WebApollo, together with an existing reference
>>>> annotation and the evidence tracks
>>>> 
>>>> Manually curated/selected 750 gene models using the following rules:
>>>> - Must have start/stop codon
>>>> - Most have as many exons as possible
>>>> - Must agree with evidence
>>>> - Must be >= 2kb part from other gene models (provided as flanking
>>>> regions for augustus to train intergenic sequence)
>>>> 
>>>> From these models, I created  a GBK file, split it into 650 (train)
>>>>and
>>>> 100 (test) models and created a new profile using the documented
>>>> procedure.
>>>> 
>>>> But:
>>>> 
>>>> While the naked ab-init models created through maker get a lot of
>>>>genes
>>>> ‘sort of right’, I still see too many issues to be really satisfied.
>>>> Problems include:
>>>> 
>>>> - random exon calls which are not supported by any line of evidence
>>>>(~1
>>>> per gene model, I would guess)
>>>> - poor congruency with some gene models (especially ones not used for
>>>> training/testing)
>>>> 
>>>> Is there any best-practice guide on how to improve this? The Augustus
>>>> website is unfortunately quite poor on detail… My impression so far is
>>>> that ramping up the number of training models isn’t really doing too
>>>>much
>>>> beyond a certain point (tried 400, 500 and 750).
>>>> 
>>>> Regards,
>>>> 
>>>> Marc
>>>> 
>>>> 
>>>> Marc P. Hoeppner, PhD
>>>> Team Leader
>>>> BILS Genome Annotation Platform
>>>> Department for Medical Biochemistry and Microbiology
>>>> Uppsala University, Sweden
>>>> marc.hoeppner at bils.se
>>>> 
>>>> 
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> 
>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>
>