[maker-devel] Augustus retraining
Carson Holt
carsonhh at gmail.com
Tue Mar 24 09:38:08 MDT 2015
I’d pick a couple of species that are as closely related as you can find. Proteins will still align over large evolutionary distances, but you need a breadth of protein data that UniRef and even databases like Swiss-Prot won’t have (those databases are usually a little too conservative).
The gene level sensitivity/specificity values make certain assumptions about the models you are comparing with. Unfortunately the only species these assumptions hold true for is Drosophila and maybe C. elegans to a point. This is the reason you will rarely see species other than those two used for the gene and exon sensitivity/specificity metrics.
Thanks,
Carson
> On Mar 24, 2015, at 9:05 AM, Panos Ioannidis <panos.ioannidis at gmail.com> wrote:
>
> Yes, 6% is gene-level sensitivity. Exon-level is 62% and nucleotide-level is 88%. I only mentioned gene-level, because that's the only metric mentioned in the Augustus web site.
>
> I got these numbers outside of Maker. Actually, I only used Maker to generate the gff files needed to start the training (ran it using only EST evidence and only on a subset of my assembly, using this <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html> as a guide).
>
> Now, I've started running the second round of training, as you suggested. Since, however, I don't have data from closely related species, I'm only using Uniref50 as protein evidence.
>
> P
>
> On Tue, Mar 24, 2015 at 3:39 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> On your first round it is fine. It gives the predictor enough to work with, then on the second round you use improved models. When you say 6% sensitivity is that Augustus running on it’s own? If it’s inside of MAKER that means you are not providing sufficient protein evidence (you need the full proteome of at least two related species). Also is that the gene level, exon level, or nucleotide level sensitivity. If you are looking at the gene level sensitivity measure, you only get a match when you perfectly match all transcripts in a gene (models that may not be correct in the first place). This value will rarely go above 10% for any predictor. You need to use the nucleotide level sensitivity/specificity metrics. The gene and exon level metrics are basically meaningless (unless it’s Drosophila which is the only species annotated correctly enough to use them).
>
> —Carson
>
>
>> On Mar 24, 2015, at 8:31 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>>
>> Hi Carson,
>>
>> So you think it's okay to include incomplete gene models when training Augustus?
>>
>> I'll certainly try the bootstrap method you're suggesting. Even though I did it for SNAP, for some weird reason I forgot it for Augustus :p Do you think, however, that I can get a big improvement in gene-level sensitivity? Currently, I have only 6%...
>>
>> Thanks,
>> Panos
>>
>>
>> On Tue, Mar 24, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> Hi Panos,
>>
>> EST’s and mRNA-seq assemblies will bey their nature be partial. After a first round of training you can run MAKER together with protein and EST evidence and the newly trained Augustus species file. Because MAKER gives hints to Augustus as it runs, the models it produces will be improved over what it would get from just running Augustus on it’s own. Then take these gene models and use them to retrain Augustus. This is the standard bootstrap retraining procedure, and can be repeated as needed.
>>
>> More info on bootstrap training here (info is for SNAP but procedure is similar to Augustus) —> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014#Training_ab_initio_Gene_Predictors>
>> Here is an excellent explanation of Augustus training —> http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html <http://brie4.cshl.edu/pipermail/gmod-help/2012-June/001724.html>
>> and here are tools to convert SNAP training files to Augustus training files (MAKER comes with a tool that converts GFF3 for SNAP training so just take that and convert it for Augustus)—> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl <https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl>
>>
>> Finally you can also manually edit the GFF3 file in Apollo (easier to use the legacy stand alone version), and then convert that file for bootstrap training.
>>
>> —Carson
>>
>>
>>> On Mar 24, 2015, at 6:24 AM, Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>> wrote:
>>>
>>> Hi Xabier,
>>>
>>> Thanks for your quick reply!
>>>
>>> No, I haven't used WebAugustus, but I just checked it out and it looks like my training set is too big (~300 Mbp), so I can't even upload it!
>>>
>>> Anyway, I prefer to train it locally because I have better control over each step. Also, I have done the entire training procedure with less genes, but didn't get a good gene-level sensitivity (~5%). So now I'm trying to replicate it using more of my scaffolds, but as it appears I get a lot more incomplete models from exonerate (run through Maker).
>>>
>>> P
>>>
>>>
>>>
>>> On Tue, Mar 24, 2015 at 1:06 PM, Xabier Vázquez Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> Hi Panos,
>>>
>>> Have you tried using webAugustus for the (re)training? I found it very convenient for generating the models for Augustus.
>>>
>>> Cheers,
>>>
>>> 2015-03-24 19:29 GMT+11:00 Panos Ioannidis <panos.ioannidis at gmail.com <mailto:panos.ioannidis at gmail.com>>:
>>> Hello All,
>>>
>>> I'm trying to retrain Augustus using EST data from the same species and realized that quite a few of the gene models I get based on EST data are incomplete (i.e. no start and/or stop codon).
>>>
>>> Now, when I get to the "etraining" step in Augustus retraining (right after the time-consuming "optimize_augustus.pl <http://optimize_augustus.pl/>" step), I get a warning for each gene that doesn't contain a start or stop codon.
>>>
>>> .....
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.1-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2021791-2044735: Initial exon does not begin with start codon but with acg
>>> gene maker-scaffold4|size2210279-exonerate_est2genome-gene-20.2-mRNA-1 transcr. 1 in sequence scaffold4|size2210279_2045713-2064983: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
>>> ....
>>>
>>> Does anyone know whether training is compromised by such incomplete gene models? Do you usually exclude them from the training set?
>>>
>>> Oh, and by the way, the best guide to retraining Augustus is here <http://avrilomics.blogspot.ch/2013/04/training-augustus-gene-finding-software.html>. The official <http://bioinf.uni-greifswald.de/augustus/binaries/retraining.html> web page isn't bad, but doesn't explain in detail certain things.
>>>
>>> Thanks,
>>> Panos
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>
>>>
>>>
>>>
>>> --
>>> Xabier Vázquez Campos
>>> PhD Candidate
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150324/80336079/attachment-0003.html>
More information about the maker-devel
mailing list