[maker-devel] Conensus gene model

Wed Oct 31 18:04:33 MDT 2012

Hi Jason, thanks for directions on generating training-set for augustus.
Also as alignment evidence if we are providing protein sequences from the
primary organism as well as other closely related species is there an
option to give the primary protein file precedence over others?
At the moment I have all the proteins(from primary organism as well as
related species) into a single file as protein option in  maker_opts.ctl

Thanks and regards,
Parul Kudtarkar

> Paul -
>
> I think I've posted on this before here if you are asking how to go from
> SNAP training to Augustus training.
> http://sourceforge.net/mailarchive/message.php?msg_id=29361270
>
> I do this type of training a lot - here some pointers.
>
> I often train by generating models using cegma on the genome and get these
> 400 or so good models as my training set.  when I have EST or RNA-Seq I
> use PASA to generate the best set of annotations.
>
> For CEGMA - then I run this script that comes with MAKER:
> cegma2zff output.cegma.gff genome.fa
>
> Then I follow the SNAP directions
>
> fathom genome.ann genome.dna -categorize 1000
> fathom uni.ann uni.dna -export 1000 -plus
> mkdir MYGENOME
> cd MYGENOME
> forge ../export.ann ../export.dna  --OPTIONS
> cd ../MYGENOME
> hmm-assembler.pl MYGENOME MYGENOME > MYGENOME.snap.hmm
>
> I then also make the augustus training data like this running in the
> directory that has the export.ann and export.dna files:
> perl gene_prediction/zff2augustus_gbk.pl > train.gb
>
> using this script:
> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
>
> I also make ZFF from GFF with this script if I got the RNA-Seq aligned and
> best models from PASA and incorporate all these data in to my SNAP
> training set, and then export again back to gbk for the augustus training.
> https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/pasatraining2zff.pl
>
> Then you just need to run the Augustus training (autoAugTrain.pl) on the
> train.gb file.
>
> Jason
>
> On Oct 30, 2012, at 2:18 PM, Parul Kudtarkar <parulk at caltech.edu> wrote:
>
>> Hello Carson and maker community,
>>
>> Thank you very much for your guidelines on using the maker-pipeline.
>> Yes,
>> green sea urchin genome that we are trying to annotate.
>> We are running the on scaffolds and most of these scaffolds are small in
>> size(very first genome assembly). We would typically expect 20,000 genes
>> in this genome. So we are running maker using EST and proteins from the
>> genome and out-groups to generate training dataset for SNAP and
>> Augustus.
>> Depending on the resulting predictions we may bootstrap the predicted
>> genes once again using EST and proteins.
>>
>> Do you have any further suggestions? Also could you point how to convert
>> training set generated for SNAP to be used as training set for Augustus
>> as
>> well? Would maker give equal weightage to SNAP and Augustus predictions
>> for generating gene model?
>>
>> Thanks and regards,
>> Parul Kudtarkar
>>
>>> One thing you seem to be missing is protein evidence.
>>>
>>> Is this a sea urchin (I looked up some of the ESTs)?  If so, I would
>> recommend adding all proteins from the Strongylocentrotus purpuratus
>> genome, then throw in another Deuterstome of your choice. Perhaps you
>> should also add a couple of outgroup organisms like Nematostella
>> vectensis
>>> (cnidaria) and a protostome of your choice.  Be careful if adding
>>> adding
>> to many protostome outgroups (i.e. C. elegans and Drosophila) because a
>> big part of their evolution is gene loss (so distant cnidaria often
>> match
>>> deuterstomes better than most protostomes do).
>>>
>>> You could take the maker results when protein data is included and use
>> it
>>> to retrain SNAP again.
>>>
>>> Even a 22 kb contig is still really short.  Is this genome primarily
>> constituted by short contigs like this?  I would recommend running CEGMA
>> once on this genome to get an appropriate estimate of how recoverable
>> the
>>> genes are going to be (http://korflab.ucdavis.edu/datasets/cegma/).
>> Cegma
>>> will give you an estimate for genome completeness as well as estimates
>> of
>>> what percentage of genes will be found in their entirety and what
>> percent
>>> will be partial genes.  This is important to do if your genome is
>> fragmented as it will give you a reasonable expectation of what you can
>> expected to recover (as short contigs don't annotate very well - you
>> tend
>>> to loose a lot).
>>>
>>> Thanks,
>>> Carson
>>>
>>>
>>> On 12-10-15 3:45 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>
>>>> Hi Carson,
>>>> Thanks. I have attached another contig which is 22 kb, with as many as
>>>> 3
>> exons EST alignments. Could you please recommend additional training. We
>> are currently running maker on the entire contig set and eventually
>> merge
>>>> all the gff3 contig predictions. The using suggested parameter/methods
>> we
>>>> would like to get a consensus gene-set with minimal false
>>>> positives/negatives.
>>>> Thanks,
>>>> Parul
>>>>> The contig in question is really too small to get much out of it
>>>>> (only 5
>>>> kb).  There was only one single exon EST alignments and a couple of
>> predictions with no evidence support.  Anything smaller than 10 kb is
>> mostly useless for annotation purposes.  You would really need a few
>> 100kb
>>>>> length or longer contigs to glean enough information for optimizing
>>>>> your
>>>> parameters.
>>>>> The general suggestions for any maker run are to use proteins from a
>>>> closely related organism or a couple of closely related organisms for
>>>> the
>>>>> protein= option in maker.  Also leave single_exon set to 0, except
>>>>> for
>>>> certain eukaryotes that have a bias for single exon transcripts (i.e.
>>>> some
>>>>> fungi and oomycetes).  And leave keep_preds set to 0 because ab
>>>>> initio
>>>> predictors tend to over-predict by a wide margin (lots of false
>>>>> positives).
>>>>> Additional training would really depend on what your other contigs
>> look
>>>> like.  Do you have any large contigs?  I could look at one of those
>>>> and
>> give suggestions but the provided contig is just too short to glean
>> much.
>>>>> Thanks,
>>>>> Carson
>>>>> On 12-10-15 1:41 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>> Hello,
>>>>>> Please advice on the aforementioned query?
>>>>>> Thanks,
>>>>>> Parul Kudtarkar
>>>>>> ---------------------------- Original Message
>>>>>> ----------------------------
>>>>>> Subject: [maker-devel] Conensus gene model
>>>>>> From:    "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>> Date:    Fri, October 12, 2012 2:46 pm
>>>>>> To:      maker-devel at yandell-lab.org
>>>>>> ------------------------------------------------------------------------
>> --
>>>> Hi,
>>>>>> We are using snap(training set[hmm file] generated using est,protein
>>>>>> and
>>>> contig file), agustus,genemarkE(we ran it outside maker and have gff3
>>>>>> file
>>>>>> as input). The output that we get is combination of various
>>>>>> gene-predictors and evidences. I have attached sample result file.
>> What
>>>> would you recommend to get consensus result set? Bootstrapping the
>> resulting gff3 file (rerunning maker)?
>>>>>> Thanks,
>>>>>> Parul Kudtarkar
>>>>>> --
>>>>>> Scientific Programmer
>>>>>> Center for Computational Regulatory Genomics
>>>>>> Beckman Institute,
>>>>>> California Institute of Technology
>>>>>> http://www.spbase.org_______________________________________________
>>>> maker-devel mailing list
>>>>>> maker-devel at box290.bluehost.com
>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> --
>>>>>> Scientific Programmer
>>>>>> Center for Computational Regulatory Genomics
>>>>>> Beckman Institute,
>>>>>> California Institute of Technology
>>>>>> http://www.spbase.org_______________________________________________
>>>> maker-devel mailing list
>>>>>> maker-devel at box290.bluehost.com
>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>> --
>>>> Scientific Programmer
>>>> Center for Computational Regulatory Genomics
>>>> Beckman Institute,
>>>> California Institute of Technology
>>>> http://www.spbase.org
>>>
>>>
>>>
>>
>>
>> --
>> Scientific Programmer
>> Center for Computational Regulatory Genomics
>> Beckman Institute,
>> California Institute of Technology
>> http://www.spbase.org
>>
>>
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> Jason Stajich
> jason.stajich at gmail.com
> jason at bioperl.org
>
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org