[maker-devel] Some questions regarding ab-initio training

Thu Jun 5 12:24:03 MDT 2014

Like I said. The predictors do the best they can, so there is probably
something about the regions to make the CDS, reading frame, or start/stop
work that requires exons to be dropped or added.  In several ant genomes
we saw something like this caused by incorrect homopolymers in the
assembly which force the predictor to slightly alter the intron/exon
structure because otherwise the reading frame made no sense (the EST
alignments were used to confirmed that the assembly homopolymers were
incorrect - lots of bad single base pair deletions).

The way hints work is as follows.  At the simplest level ab initio
predictors are calculating the probability of being in different states
(intergenic, intron, exon, etc.).  The hints increase the probability of
being in the intron state where MAKER gives an intron hint or being in an
exon/CDS state when MAKER gives an exon/CDS hint.  So this bends the
likelihood of the ab intio gene predictor to call something similar in
structure to the evidence overlapping it.  That being said, if there is
strong enough signal from something else in the sequence or my hints won't
work with the splice sites in the region or the reading frame breaks, then
no amount of hints can force augustus to make that model.

--Carson

On 6/5/14, 2:15 AM, "Marc Höppner" <marc.hoeppner at bils.se> wrote:

>Hi,
>
>thanks for the feedback. I spent some more time on this and am still
>somewhat unsatisfied with the whole thing…
>
>A few comments:
>
>I quite frequently see augustus and in extension Maker including exons
>that are not supported by EST/Protein evidence and are not critical for
>the gene model (i.e. I can take them out and still get a proper CDS).
>Maybe I don’t know enough about how Maker creates hints and more
>importantly what role these hints play for augustus, but I cannot really
>see a great effect (any, really) on the gene models even if both ESTs and
>proteins contradict an augustus gene model and the surplus exon is
>non-essential. 
>
>(all evidence is provided as fasta files, protein2genome and est2genome
>are set to 0)
>
>As for the repeat library, I suppose this is a critical point. I am using
>repeats from a closely related species via Repeatmasker, modelled and
>filtered repeats from RepeatModeler and repeats derived from a
>high-coverage 454 data set. Not sure what else I can do to improve that.
>
>As for evidence, I am using the curated reference proteome from a related
>species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454
>reads. I don’t think it gets a whole lot better, in terms of what data
>can be used.
>
>So in summary, I just don’t get where I want to using Augustus and Maker
>- specifically, the gene models are full of weird, unsupported artefacts
>despite manually curating > 850 models for training. I suppose I was
>hoping for some secret trick to improve on this - but I guess there is
>none? Actually, if I only do a pure evidence build (seeing that my input
>data is very high quality), it looks better - which sort of goes against
>the premise of Maker :/
>
>Regards,
>
>Marc
>
>
>
>
>Marc P. Hoeppner, PhD
>Team Leader
>Department for Medical Biochemistry and Microbiology
>Uppsala University, Sweden
>marc.hoeppner at bils.se
>
>On 27 May 2014, at 17:25, Carson Holt <carsonhh at gmail.com> wrote:
>
>> Extra exons can be required for predictors to make sense of a region
>>(they
>> do the best they can).  This can be due to imperfect assemblies or
>> repeats.  For plants the repeat database is the the one thing that will
>> most affect the annotation quality.  You may need to spend some time
>> building the best repeat library you can.  The repeat library is the
>>next
>> most important thing next to training the predictor, because they
>>confuse
>> the predictor (sometimes a lot) causing it to behave oddly in those
>> regions (because repeats do encode real protein and protein fragments).
>> Also when running now with MAKER make sure to include the entire
>>proteome
>> of a related species and not just UniProt, and you will get better
>> performance.  Now that you have Augustus trained, using it inside of
>>MAKER
>> with an improved repeat library and additional protein evidence should
>> give it the feedback that will allow it to perform better than it would
>> with just naked ab initio prediction.
>> 
>> Thanks,
>> Carson
>> 
>> 
>> On 5/27/14, 2:12 AM, "Marc Höppner" <marc.hoeppner at bils.se> wrote:
>> 
>>> Hi,
>>> 
>>> I wanted to get some feedback regarding the training of ab-initio gene
>>> finders - it’s not strictly Maker related, but I suppose there are many
>>> people on this list that have encountered and solved this issue in one
>>> way or another.
>>> 
>>> Specifically, I am trying to train Augustus (and possibly SNAP) for a
>>> plant genome. This has always been a very frustrating process for me,
>>>but
>>> while I have a better idea now how to do it, I still don’t get the sort
>>> of accuracy that I am hoping for. A quick run-through of my process;
>>> 
>>> Evidence build with maker on level 1 and 2 proteins from Uniprot +
>>> Sanger-sequenced EST data
>>> 
>>> Filtered for Models with an AED <= 0.3
>>> 
>>> Loaded that into WebApollo, together with an existing reference
>>> annotation and the evidence tracks
>>> 
>>> Manually curated/selected 750 gene models using the following rules:
>>> - Must have start/stop codon
>>> - Most have as many exons as possible
>>> - Must agree with evidence
>>> - Must be >= 2kb part from other gene models (provided as flanking
>>> regions for augustus to train intergenic sequence)
>>> 
>>> From these models, I created  a GBK file, split it into 650 (train) and
>>> 100 (test) models and created a new profile using the documented
>>> procedure.
>>> 
>>> But:
>>> 
>>> While the naked ab-init models created through maker get a lot of genes
>>> ‘sort of right’, I still see too many issues to be really satisfied.
>>> Problems include:
>>> 
>>> - random exon calls which are not supported by any line of evidence (~1
>>> per gene model, I would guess)
>>> - poor congruency with some gene models (especially ones not used for
>>> training/testing)
>>> 
>>> Is there any best-practice guide on how to improve this? The Augustus
>>> website is unfortunately quite poor on detail… My impression so far is
>>> that ramping up the number of training models isn’t really doing too
>>>much
>>> beyond a certain point (tried 400, 500 and 750).
>>> 
>>> Regards,
>>> 
>>> Marc
>>> 
>>> 
>>> Marc P. Hoeppner, PhD
>>> Team Leader
>>> BILS Genome Annotation Platform
>>> Department for Medical Biochemistry and Microbiology
>>> Uppsala University, Sweden
>>> marc.hoeppner at bils.se
>>> 
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
>> 
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>