[maker-devel] Conensus gene model

Mon Nov 5 14:40:46 MST 2012

Dear Carson,

Thanks you very much, this is helpful.

> The way models are generated, it really doesn't so much matter where the
> protein alignments came from.  Basically the protein alignment is just
> creating a region of potential CDS.  MAKER than gives that region as a
> hint to the gene predictors, but the gene predictors really make the call
> on how to finally structure the gene based on their training sets.  You
> can short circuit this by using the protein2genome option as a separate
> run with only your primary proteins.  MAKER will then try and turn those
> protein alignments directly into genes. Results from that run can
> sometimes be useful for generating training sets as well, or can be passed
> back into MAKER as pred_gff so MAKEr has the option to turn those into
> models as an alternative to the models produced by the ab initio
> predictors.
>
> --Carson
>
>
> On 12-10-31 8:04 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>
>>Hi Jason, thanks for directions on generating training-set for augustus.
>>Also as alignment evidence if we are providing protein sequences from the
>>primary organism as well as other closely related species is there an
>>option to give the primary protein file precedence over others?
>>At the moment I have all the proteins(from primary organism as well as
>>related species) into a single file as protein option in  maker_opts.ctl
>>
>>Thanks and regards,
>>Parul Kudtarkar
>>
>>> Paul -
>>>
>>> I think I've posted on this before here if you are asking how to go
>>> from
>>> SNAP training to Augustus training.
>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270
>>>
>>> I do this type of training a lot - here some pointers.
>>>
>>> I often train by generating models using cegma on the genome and get
>>>these
>>> 400 or so good models as my training set.  when I have EST or RNA-Seq I
>>> use PASA to generate the best set of annotations.
>>>
>>> For CEGMA - then I run this script that comes with MAKER:
>>> cegma2zff output.cegma.gff genome.fa
>>>
>>> Then I follow the SNAP directions
>>>
>>> fathom genome.ann genome.dna -categorize 1000
>>> fathom uni.ann uni.dna -export 1000 -plus
>>> mkdir MYGENOME
>>> cd MYGENOME
>>> forge ../export.ann ../export.dna  --OPTIONS
>>> cd ../MYGENOME
>>> hmm-assembler.pl MYGENOME MYGENOME > MYGENOME.snap.hmm
>>>
>>> I then also make the augustus training data like this running in the
>>> directory that has the export.ann and export.dna files:
>>> perl gene_prediction/zff2augustus_gbk.pl > train.gb
>>>
>>> using this script:
>>>
>>>https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/z
>>>ff2augustus_gbk.pl
>>>
>>> I also make ZFF from GFF with this script if I got the RNA-Seq aligned
>>>and
>>> best models from PASA and incorporate all these data in to my SNAP
>>> training set, and then export again back to gbk for the augustus
>>>training.
>>>
>>>https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/p
>>>asatraining2zff.pl
>>>
>>> Then you just need to run the Augustus training (autoAugTrain.pl) on
>>> the
>>> train.gb file.
>>>
>>> Jason
>>>
>>> On Oct 30, 2012, at 2:18 PM, Parul Kudtarkar <parulk at caltech.edu>
>>> wrote:
>>>
>>>> Hello Carson and maker community,
>>>>
>>>> Thank you very much for your guidelines on using the maker-pipeline.
>>>> Yes,
>>>> green sea urchin genome that we are trying to annotate.
>>>> We are running the on scaffolds and most of these scaffolds are small
>>>>in
>>>> size(very first genome assembly). We would typically expect 20,000
>>>>genes
>>>> in this genome. So we are running maker using EST and proteins from
>>>> the
>>>> genome and out-groups to generate training dataset for SNAP and
>>>> Augustus.
>>>> Depending on the resulting predictions we may bootstrap the predicted
>>>> genes once again using EST and proteins.
>>>>
>>>> Do you have any further suggestions? Also could you point how to
>>>>convert
>>>> training set generated for SNAP to be used as training set for
>>>> Augustus
>>>> as
>>>> well? Would maker give equal weightage to SNAP and Augustus
>>>> predictions
>>>> for generating gene model?
>>>>
>>>> Thanks and regards,
>>>> Parul Kudtarkar
>>>>
>>>>> One thing you seem to be missing is protein evidence.
>>>>>
>>>>> Is this a sea urchin (I looked up some of the ESTs)?  If so, I would
>>>> recommend adding all proteins from the Strongylocentrotus purpuratus
>>>> genome, then throw in another Deuterstome of your choice. Perhaps you
>>>> should also add a couple of outgroup organisms like Nematostella
>>>> vectensis
>>>>> (cnidaria) and a protostome of your choice.  Be careful if adding
>>>>> adding
>>>> to many protostome outgroups (i.e. C. elegans and Drosophila) because
>>>> a
>>>> big part of their evolution is gene loss (so distant cnidaria often
>>>> match
>>>>> deuterstomes better than most protostomes do).
>>>>>
>>>>> You could take the maker results when protein data is included and
>>>>> use
>>>> it
>>>>> to retrain SNAP again.
>>>>>
>>>>> Even a 22 kb contig is still really short.  Is this genome primarily
>>>> constituted by short contigs like this?  I would recommend running
>>>>CEGMA
>>>> once on this genome to get an appropriate estimate of how recoverable
>>>> the
>>>>> genes are going to be (http://korflab.ucdavis.edu/datasets/cegma/).
>>>> Cegma
>>>>> will give you an estimate for genome completeness as well as
>>>>> estimates
>>>> of
>>>>> what percentage of genes will be found in their entirety and what
>>>> percent
>>>>> will be partial genes.  This is important to do if your genome is
>>>> fragmented as it will give you a reasonable expectation of what you
>>>> can
>>>> expected to recover (as short contigs don't annotate very well - you
>>>> tend
>>>>> to loose a lot).
>>>>>
>>>>> Thanks,
>>>>> Carson
>>>>>
>>>>>
>>>>> On 12-10-15 3:45 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>
>>>>>> Hi Carson,
>>>>>> Thanks. I have attached another contig which is 22 kb, with as many
>>>>>>as
>>>>>> 3
>>>> exons EST alignments. Could you please recommend additional training.
>>>>We
>>>> are currently running maker on the entire contig set and eventually
>>>> merge
>>>>>> all the gff3 contig predictions. The using suggested
>>>>>>parameter/methods
>>>> we
>>>>>> would like to get a consensus gene-set with minimal false
>>>>>> positives/negatives.
>>>>>> Thanks,
>>>>>> Parul
>>>>>>> The contig in question is really too small to get much out of it
>>>>>>> (only 5
>>>>>> kb).  There was only one single exon EST alignments and a couple of
>>>> predictions with no evidence support.  Anything smaller than 10 kb is
>>>> mostly useless for annotation purposes.  You would really need a few
>>>> 100kb
>>>>>>> length or longer contigs to glean enough information for optimizing
>>>>>>> your
>>>>>> parameters.
>>>>>>> The general suggestions for any maker run are to use proteins from
>>>>>>> a
>>>>>> closely related organism or a couple of closely related organisms
>>>>>> for
>>>>>> the
>>>>>>> protein= option in maker.  Also leave single_exon set to 0, except
>>>>>>> for
>>>>>> certain eukaryotes that have a bias for single exon transcripts
>>>>>> (i.e.
>>>>>> some
>>>>>>> fungi and oomycetes).  And leave keep_preds set to 0 because ab
>>>>>>> initio
>>>>>> predictors tend to over-predict by a wide margin (lots of false
>>>>>>> positives).
>>>>>>> Additional training would really depend on what your other contigs
>>>> look
>>>>>> like.  Do you have any large contigs?  I could look at one of those
>>>>>> and
>>>> give suggestions but the provided contig is just too short to glean
>>>> much.
>>>>>>> Thanks,
>>>>>>> Carson
>>>>>>> On 12-10-15 1:41 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>> Hello,
>>>>>>>> Please advice on the aforementioned query?
>>>>>>>> Thanks,
>>>>>>>> Parul Kudtarkar
>>>>>>>> ---------------------------- Original Message
>>>>>>>> ----------------------------
>>>>>>>> Subject: [maker-devel] Conensus gene model
>>>>>>>> From:    "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>> Date:    Fri, October 12, 2012 2:46 pm
>>>>>>>> To:      maker-devel at yandell-lab.org
>>>>>>>>
>>>>>>>>--------------------------------------------------------------------
>>>>>>>>----
>>>> --
>>>>>> Hi,
>>>>>>>> We are using snap(training set[hmm file] generated using
>>>>>>>>est,protein
>>>>>>>> and
>>>>>> contig file), agustus,genemarkE(we ran it outside maker and have
>>>>>> gff3
>>>>>>>> file
>>>>>>>> as input). The output that we get is combination of various
>>>>>>>> gene-predictors and evidences. I have attached sample result file.
>>>> What
>>>>>> would you recommend to get consensus result set? Bootstrapping the
>>>> resulting gff3 file (rerunning maker)?
>>>>>>>> Thanks,
>>>>>>>> Parul Kudtarkar
>>>>>>>> --
>>>>>>>> Scientific Programmer
>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>> Beckman Institute,
>>>>>>>> California Institute of Technology
>>>>>>>>
>>>>>>>>http://www.spbase.org_______________________________________________
>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>
>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.
>>>>>>>>org
>>>> --
>>>>>>>> Scientific Programmer
>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>> Beckman Institute,
>>>>>>>> California Institute of Technology
>>>>>>>>
>>>>>>>>http://www.spbase.org_______________________________________________
>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>
>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.
>>>>>>>>org
>>>>>> --
>>>>>> Scientific Programmer
>>>>>> Center for Computational Regulatory Genomics
>>>>>> Beckman Institute,
>>>>>> California Institute of Technology
>>>>>> http://www.spbase.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Scientific Programmer
>>>> Center for Computational Regulatory Genomics
>>>> Beckman Institute,
>>>> California Institute of Technology
>>>> http://www.spbase.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>> Jason Stajich
>>> jason.stajich at gmail.com
>>> jason at bioperl.org
>>>
>>>
>>
>>
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org
>>
>>
>>_______________________________________________
>>maker-devel mailing list
>>maker-devel at box290.bluehost.com
>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org