[maker-devel] Advice for optimizing augustus training on fungal genome?
Jason Stajich
jason.stajich at gmail.com
Fri Jul 27 16:52:07 MDT 2012
Best option is to get RNA-Seq or some transcript evidence of course so you have something empirical.
I'm really surprised you are getting such a low count - this is just from augustus running on the cmdline?
Alternatively if you have 16,000 known proteins, derived from protein alignments of related species aligned to your genome I expect?
You could also try and generate a best set of gene predictions from this. When I did this for Coprinus I gathered full length alignments of fungal proteins and used at the time genomewise and genewise to get the best set of loci with gene calls. Today I'd probably just use FASTA to find the locus and exonerate to get the spliced gene model out and provide these as training. Basically this is doing a fungal-specific CEGMA, something we ought to build, but just don't have time to do.
Do you get similar results when running genemarkHMM with its selftraining?
jason
On Jul 27, 2012, at 9:38 AM, Carson Holt wrote:
> Regarding this older post, if you aren't getting good results using CEGMA
> for training augustus for a new species, then One option would be just to
> make a copy Neurospora's species directory. Edit the necessary file
> contents to make it list as a different species, then run the augustus
> training steps as before but this use the Neurospora copy as the base so
> augustus will be optimizing Neurospora's parameters to be more like your
> species of interest.
>
> Thanks,
> Carson
>
>
>
> On 12-06-28 11:11 AM, "Fourie Joubert" <fourie.joubert at up.ac.za> wrote:
>
>> Hi Everyone
>>
>> Apologies if this is not the relevant list to mail to.
>>
>> I am looking for advice in training augustus for a novel fungal genome.
>>
>> I generated a gene set using CEGMA (below), and have subsequently been
>> following the instructions at
>> http://www.molecularevolution.org/molevolfiles/exercises/augustus/scipio.h
>> tml
>> and at
>> http://www.molecularevolution.org/molevolfiles/exercises/augustus/training
>> .html.
>>
>> My training set is 339 genes and the test set is 100 genes.
>>
>> My initial output is below.
>>
>> It does not improve much with optimize_augustus.
>>
>> When using the training paramters to predict genes in the genome, I seem
>> to only find around 2,000 of the known ~16,000 genes. When I use the
>> training data from a distantly related fungus (Neurospora), I get
>> roughly the correct number of genes.
>>
>> I am obviously doing something wrong here... (commands below).
>>
>> I would really appreciate any advice on where to start looking for
>> improvement.
>>
>> Kindest regards!
>>
>> Fourie
>>
>>
>>
>>
>>
>> Augustus commands (Editedmyspecies_parameters.cfg and
>> setstopCodonExcludedFromCDS to true.):
>>
>>> etraining --species=myspecies genes.gb.train
>>
>>> augustus --species=myspecies genes.gb.test | tee firsttest.out
>>
>>> grep -A 22 Evaluation firsttest.out
>>
>>> optimize_augustus.pl --species=myspecies genes.gb.train
>>
>>> etraining --species=myspecies genes.gb.train
>>
>>> augustus --species=myspecies genes.gb.test | tee secondtest.out
>>
>>> grep -A 22 Evaluation secondtest.out
>>
>>
>>
>> CEGMA output:
>>
>> # Statistics of the completeness of the genome based on 248 CEGs
>> #
>>
>> #Prots %Completeness - #Total Average %Ortho
>>
>> Complete 240 96.77 - 278 1.16 11.67
>>
>> Group 1 64 96.97 - 72 1.12 7.81
>> Group 2 54 96.43 - 66 1.22 18.52
>> Group 3 58 95.08 - 70 1.21 13.79
>> Group 4 64 98.46 - 70 1.09 7.81
>>
>> Partial 245 98.79 - 290 1.18 13.88
>>
>> Group 1 65 98.48 - 73 1.12 7.69
>> Group 2 56 100.00 - 70 1.25 21.43
>> Group 3 59 96.72 - 75 1.27 18.64
>> Group 4 65 100.00 - 72 1.11 9.23
>>
>>
>>
>>
>> Augustus output:
>>
>> ******* Evaluation of gene prediction *******
>>
>> ---------------------------------------------\
>>
>> | sensitivity | specificity |
>>
>> ---------------------------------------------|
>>
>> nucleotide level | 0.933 | 0.772 |
>>
>> ---------------------------------------------/
>>
>> --------------------------------------------------------------------------
>> --------------------------------\
>>
>> | #pred | #anno | | FP = false pos. | FN = false
>> neg. | | |
>>
>> | total/ | total/ | TP
>> |--------------------|--------------------| sensitivity | specificity |
>>
>> | unique | unique | | part | ovlp | wrng | part | ovlp |
>> wrng | | |
>>
>> --------------------------------------------------------------------------
>> --------------------------------|
>>
>> | | | | 229 |
>> 85 | | |
>>
>> exon level | 475 | 331 | 246 | ------------------ |
>> ------------------ | 0.743 | 0.518 |
>>
>> | 475 | 331 | | 59 | 9 | 161 | 56 | 2 |
>> 27 | | |
>>
>> --------------------------------------------------------------------------
>> --------------------------------/
>>
>> --------------------------------------------------------------------------
>> --\
>>
>> transcript | #pred | #anno | TP | FP | FN | sensitivity |
>> specificity |
>>
>> --------------------------------------------------------------------------
>> --|
>>
>> gene level | 158 | 100 | 45 | 113 | 55 | 0.45 |
>> 0.285 |
>>
>> --------------------------------------------------------------------------
>> --/
>>
>>
>>
>>
>> --
>> --------------
>> Prof Fourie Joubert
>> Bioinformatics and Computational Biology Unit
>> Department of Biochemistry
>> University of Pretoria
>> fourie.joubert at up.ac.za
>> http://www.bi.up.ac.za
>> Tel. +27-12-420-5825
>> Fax. +27-12-420-5800
>>
>> -------------------------------------------------------------------------
>> This message and attachments are subject to a disclaimer. Please refer
>> to www.it.up.ac.za/documentation/governance/disclaimer/ for full details.
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org
More information about the maker-devel
mailing list