[maker-devel] Some questions regarding ab-initio training
Marc Höppner
marc.hoeppner at bils.se
Thu Jun 5 02:15:55 MDT 2014
Hi,
thanks for the feedback. I spent some more time on this and am still somewhat unsatisfied with the whole thing…
A few comments:
I quite frequently see augustus and in extension Maker including exons that are not supported by EST/Protein evidence and are not critical for the gene model (i.e. I can take them out and still get a proper CDS). Maybe I don’t know enough about how Maker creates hints and more importantly what role these hints play for augustus, but I cannot really see a great effect (any, really) on the gene models even if both ESTs and proteins contradict an augustus gene model and the surplus exon is non-essential.
(all evidence is provided as fasta files, protein2genome and est2genome are set to 0)
As for the repeat library, I suppose this is a critical point. I am using repeats from a closely related species via Repeatmasker, modelled and filtered repeats from RepeatModeler and repeats derived from a high-coverage 454 data set. Not sure what else I can do to improve that.
As for evidence, I am using the curated reference proteome from a related species (<5 Mio years), unprot_swissprot and high-quality ESTs + 454 reads. I don’t think it gets a whole lot better, in terms of what data can be used.
So in summary, I just don’t get where I want to using Augustus and Maker - specifically, the gene models are full of weird, unsupported artefacts despite manually curating > 850 models for training. I suppose I was hoping for some secret trick to improve on this - but I guess there is none? Actually, if I only do a pure evidence build (seeing that my input data is very high quality), it looks better - which sort of goes against the premise of Maker :/
Regards,
Marc
Marc P. Hoeppner, PhD
Team Leader
Department for Medical Biochemistry and Microbiology
Uppsala University, Sweden
marc.hoeppner at bils.se
On 27 May 2014, at 17:25, Carson Holt <carsonhh at gmail.com> wrote:
> Extra exons can be required for predictors to make sense of a region (they
> do the best they can). This can be due to imperfect assemblies or
> repeats. For plants the repeat database is the the one thing that will
> most affect the annotation quality. You may need to spend some time
> building the best repeat library you can. The repeat library is the next
> most important thing next to training the predictor, because they confuse
> the predictor (sometimes a lot) causing it to behave oddly in those
> regions (because repeats do encode real protein and protein fragments).
> Also when running now with MAKER make sure to include the entire proteome
> of a related species and not just UniProt, and you will get better
> performance. Now that you have Augustus trained, using it inside of MAKER
> with an improved repeat library and additional protein evidence should
> give it the feedback that will allow it to perform better than it would
> with just naked ab initio prediction.
>
> Thanks,
> Carson
>
>
> On 5/27/14, 2:12 AM, "Marc Höppner" <marc.hoeppner at bils.se> wrote:
>
>> Hi,
>>
>> I wanted to get some feedback regarding the training of ab-initio gene
>> finders - it’s not strictly Maker related, but I suppose there are many
>> people on this list that have encountered and solved this issue in one
>> way or another.
>>
>> Specifically, I am trying to train Augustus (and possibly SNAP) for a
>> plant genome. This has always been a very frustrating process for me, but
>> while I have a better idea now how to do it, I still don’t get the sort
>> of accuracy that I am hoping for. A quick run-through of my process;
>>
>> Evidence build with maker on level 1 and 2 proteins from Uniprot +
>> Sanger-sequenced EST data
>>
>> Filtered for Models with an AED <= 0.3
>>
>> Loaded that into WebApollo, together with an existing reference
>> annotation and the evidence tracks
>>
>> Manually curated/selected 750 gene models using the following rules:
>> - Must have start/stop codon
>> - Most have as many exons as possible
>> - Must agree with evidence
>> - Must be >= 2kb part from other gene models (provided as flanking
>> regions for augustus to train intergenic sequence)
>>
>> From these models, I created a GBK file, split it into 650 (train) and
>> 100 (test) models and created a new profile using the documented
>> procedure.
>>
>> But:
>>
>> While the naked ab-init models created through maker get a lot of genes
>> ‘sort of right’, I still see too many issues to be really satisfied.
>> Problems include:
>>
>> - random exon calls which are not supported by any line of evidence (~1
>> per gene model, I would guess)
>> - poor congruency with some gene models (especially ones not used for
>> training/testing)
>>
>> Is there any best-practice guide on how to improve this? The Augustus
>> website is unfortunately quite poor on detail… My impression so far is
>> that ramping up the number of training models isn’t really doing too much
>> beyond a certain point (tried 400, 500 and 750).
>>
>> Regards,
>>
>> Marc
>>
>>
>> Marc P. Hoeppner, PhD
>> Team Leader
>> BILS Genome Annotation Platform
>> Department for Medical Biochemistry and Microbiology
>> Uppsala University, Sweden
>> marc.hoeppner at bils.se
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
More information about the maker-devel
mailing list