[maker-devel] Conensus gene model

Tue Oct 30 18:52:43 MDT 2012

Paul - 

I think I've posted on this before here if you are asking how to go from SNAP training to Augustus training.
http://sourceforge.net/mailarchive/message.php?msg_id=29361270

I do this type of training a lot - here some pointers. 

I often train by generating models using cegma on the genome and get these 400 or so good models as my training set.  when I have EST or RNA-Seq I use PASA to generate the best set of annotations.

For CEGMA - then I run this script that comes with MAKER:
cegma2zff output.cegma.gff genome.fa

Then I follow the SNAP directions

fathom genome.ann genome.dna -categorize 1000
fathom uni.ann uni.dna -export 1000 -plus
mkdir MYGENOME
cd MYGENOME
forge ../export.ann ../export.dna  --OPTIONS
cd ../MYGENOME
hmm-assembler.pl MYGENOME MYGENOME > MYGENOME.snap.hmm

I then also make the augustus training data like this running in the directory that has the export.ann and export.dna files:
perl gene_prediction/zff2augustus_gbk.pl > train.gb

using this script:
https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl

I also make ZFF from GFF with this script if I got the RNA-Seq aligned and best models from PASA and incorporate all these data in to my SNAP training set, and then export again back to gbk for the augustus training.
https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/pasatraining2zff.pl

Then you just need to run the Augustus training (autoAugTrain.pl) on the train.gb file.

Jason

On Oct 30, 2012, at 2:18 PM, Parul Kudtarkar <parulk at caltech.edu> wrote:

> Hello Carson and maker community,
> 
> Thank you very much for your guidelines on using the maker-pipeline.  Yes,
> green sea urchin genome that we are trying to annotate.
> We are running the on scaffolds and most of these scaffolds are small in
> size(very first genome assembly). We would typically expect 20,000 genes
> in this genome. So we are running maker using EST and proteins from the
> genome and out-groups to generate training dataset for SNAP and Augustus.
> Depending on the resulting predictions we may bootstrap the predicted
> genes once again using EST and proteins.
> 
> Do you have any further suggestions? Also could you point how to convert
> training set generated for SNAP to be used as training set for Augustus as
> well? Would maker give equal weightage to SNAP and Augustus predictions
> for generating gene model?
> 
> Thanks and regards,
> Parul Kudtarkar
> 
>> One thing you seem to be missing is protein evidence.
>> 
>> Is this a sea urchin (I looked up some of the ESTs)?  If so, I would
> recommend adding all proteins from the Strongylocentrotus purpuratus
> genome, then throw in another Deuterstome of your choice. Perhaps you
> should also add a couple of outgroup organisms like Nematostella
> vectensis
>> (cnidaria) and a protostome of your choice.  Be careful if adding adding
> to many protostome outgroups (i.e. C. elegans and Drosophila) because a
> big part of their evolution is gene loss (so distant cnidaria often
> match
>> deuterstomes better than most protostomes do).
>> 
>> You could take the maker results when protein data is included and use
> it
>> to retrain SNAP again.
>> 
>> Even a 22 kb contig is still really short.  Is this genome primarily
> constituted by short contigs like this?  I would recommend running CEGMA
> once on this genome to get an appropriate estimate of how recoverable
> the
>> genes are going to be (http://korflab.ucdavis.edu/datasets/cegma/).
> Cegma
>> will give you an estimate for genome completeness as well as estimates
> of
>> what percentage of genes will be found in their entirety and what
> percent
>> will be partial genes.  This is important to do if your genome is
> fragmented as it will give you a reasonable expectation of what you can
> expected to recover (as short contigs don't annotate very well - you
> tend
>> to loose a lot).
>> 
>> Thanks,
>> Carson
>> 
>> 
>> On 12-10-15 3:45 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>> 
>>> Hi Carson,
>>> Thanks. I have attached another contig which is 22 kb, with as many as 3
> exons EST alignments. Could you please recommend additional training. We
> are currently running maker on the entire contig set and eventually
> merge
>>> all the gff3 contig predictions. The using suggested parameter/methods
> we
>>> would like to get a consensus gene-set with minimal false
>>> positives/negatives.
>>> Thanks,
>>> Parul
>>>> The contig in question is really too small to get much out of it (only 5
>>> kb).  There was only one single exon EST alignments and a couple of
> predictions with no evidence support.  Anything smaller than 10 kb is
> mostly useless for annotation purposes.  You would really need a few
> 100kb
>>>> length or longer contigs to glean enough information for optimizing your
>>> parameters.
>>>> The general suggestions for any maker run are to use proteins from a
>>> closely related organism or a couple of closely related organisms for the
>>>> protein= option in maker.  Also leave single_exon set to 0, except for
>>> certain eukaryotes that have a bias for single exon transcripts (i.e. some
>>>> fungi and oomycetes).  And leave keep_preds set to 0 because ab initio
>>> predictors tend to over-predict by a wide margin (lots of false
>>>> positives).
>>>> Additional training would really depend on what your other contigs
> look
>>> like.  Do you have any large contigs?  I could look at one of those and
> give suggestions but the provided contig is just too short to glean
> much.
>>>> Thanks,
>>>> Carson
>>>> On 12-10-15 1:41 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>> Hello,
>>>>> Please advice on the aforementioned query?
>>>>> Thanks,
>>>>> Parul Kudtarkar
>>>>> ---------------------------- Original Message
>>>>> ----------------------------
>>>>> Subject: [maker-devel] Conensus gene model
>>>>> From:    "Parul Kudtarkar" <parulk at caltech.edu>
>>>>> Date:    Fri, October 12, 2012 2:46 pm
>>>>> To:      maker-devel at yandell-lab.org
>>>>> ------------------------------------------------------------------------
> --
>>> Hi,
>>>>> We are using snap(training set[hmm file] generated using est,protein
>>>>> and
>>> contig file), agustus,genemarkE(we ran it outside maker and have gff3
>>>>> file
>>>>> as input). The output that we get is combination of various
>>>>> gene-predictors and evidences. I have attached sample result file.
> What
>>> would you recommend to get consensus result set? Bootstrapping the
> resulting gff3 file (rerunning maker)?
>>>>> Thanks,
>>>>> Parul Kudtarkar
>>>>> --
>>>>> Scientific Programmer
>>>>> Center for Computational Regulatory Genomics
>>>>> Beckman Institute,
>>>>> California Institute of Technology
>>>>> http://www.spbase.org_______________________________________________
>>> maker-devel mailing list
>>>>> maker-devel at box290.bluehost.com
>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> --
>>>>> Scientific Programmer
>>>>> Center for Computational Regulatory Genomics
>>>>> Beckman Institute,
>>>>> California Institute of Technology
>>>>> http://www.spbase.org_______________________________________________
>>> maker-devel mailing list
>>>>> maker-devel at box290.bluehost.com
>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>> --
>>> Scientific Programmer
>>> Center for Computational Regulatory Genomics
>>> Beckman Institute,
>>> California Institute of Technology
>>> http://www.spbase.org
>> 
>> 
>> 
> 
> 
> --
> Scientific Programmer
> Center for Computational Regulatory Genomics
> Beckman Institute,
> California Institute of Technology
> http://www.spbase.org
> 
> 
> 
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org