[maker-devel] Some questions regarding ab-initio training

Tue May 27 09:25:39 MDT 2014

Extra exons can be required for predictors to make sense of a region (they
do the best they can).  This can be due to imperfect assemblies or
repeats.  For plants the repeat database is the the one thing that will
most affect the annotation quality.  You may need to spend some time
building the best repeat library you can.  The repeat library is the next
most important thing next to training the predictor, because they confuse
the predictor (sometimes a lot) causing it to behave oddly in those
regions (because repeats do encode real protein and protein fragments).
Also when running now with MAKER make sure to include the entire proteome
of a related species and not just UniProt, and you will get better
performance.  Now that you have Augustus trained, using it inside of MAKER
with an improved repeat library and additional protein evidence should
give it the feedback that will allow it to perform better than it would
with just naked ab initio prediction.

Thanks,
Carson

On 5/27/14, 2:12 AM, "Marc Höppner" <marc.hoeppner at bils.se> wrote:

>Hi,
>
>I wanted to get some feedback regarding the training of ab-initio gene
>finders - it’s not strictly Maker related, but I suppose there are many
>people on this list that have encountered and solved this issue in one
>way or another.
>
>Specifically, I am trying to train Augustus (and possibly SNAP) for a
>plant genome. This has always been a very frustrating process for me, but
>while I have a better idea now how to do it, I still don’t get the sort
>of accuracy that I am hoping for. A quick run-through of my process;
>
>Evidence build with maker on level 1 and 2 proteins from Uniprot +
>Sanger-sequenced EST data
>
>Filtered for Models with an AED <= 0.3
>
>Loaded that into WebApollo, together with an existing reference
>annotation and the evidence tracks
>
>Manually curated/selected 750 gene models using the following rules:
>- Must have start/stop codon
>- Most have as many exons as possible
>- Must agree with evidence
>- Must be >= 2kb part from other gene models (provided as flanking
>regions for augustus to train intergenic sequence)
>
>From these models, I created  a GBK file, split it into 650 (train) and
>100 (test) models and created a new profile using the documented
>procedure.
>
>But:
>
>While the naked ab-init models created through maker get a lot of genes
>‘sort of right’, I still see too many issues to be really satisfied.
>Problems include:
>
>- random exon calls which are not supported by any line of evidence (~1
>per gene model, I would guess)
>- poor congruency with some gene models (especially ones not used for
>training/testing)
>
>Is there any best-practice guide on how to improve this? The Augustus
>website is unfortunately quite poor on detail… My impression so far is
>that ramping up the number of training models isn’t really doing too much
>beyond a certain point (tried 400, 500 and 750).
>
>Regards,
>
>Marc
>
>
>Marc P. Hoeppner, PhD
>Team Leader
>BILS Genome Annotation Platform
>Department for Medical Biochemistry and Microbiology
>Uppsala University, Sweden
>marc.hoeppner at bils.se
>
>
>_______________________________________________
>maker-devel mailing list
>maker-devel at box290.bluehost.com
>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org