[maker-devel] Conensus gene model

Parul Kudtarkar parulk at caltech.edu
Tue Oct 30 15:18:06 MDT 2012


Hello Carson and maker community,

Thank you very much for your guidelines on using the maker-pipeline.  Yes,
green sea urchin genome that we are trying to annotate.
We are running the on scaffolds and most of these scaffolds are small in
size(very first genome assembly). We would typically expect 20,000 genes
in this genome. So we are running maker using EST and proteins from the
genome and out-groups to generate training dataset for SNAP and Augustus.
Depending on the resulting predictions we may bootstrap the predicted
genes once again using EST and proteins.

Do you have any further suggestions? Also could you point how to convert
training set generated for SNAP to be used as training set for Augustus as
well? Would maker give equal weightage to SNAP and Augustus predictions
for generating gene model?

Thanks and regards,
Parul Kudtarkar

> One thing you seem to be missing is protein evidence.
>
> Is this a sea urchin (I looked up some of the ESTs)?  If so, I would
recommend adding all proteins from the Strongylocentrotus purpuratus
genome, then throw in another Deuterstome of your choice. Perhaps you
should also add a couple of outgroup organisms like Nematostella
vectensis
> (cnidaria) and a protostome of your choice.  Be careful if adding adding
to many protostome outgroups (i.e. C. elegans and Drosophila) because a
big part of their evolution is gene loss (so distant cnidaria often
match
> deuterstomes better than most protostomes do).
>
> You could take the maker results when protein data is included and use
it
> to retrain SNAP again.
>
> Even a 22 kb contig is still really short.  Is this genome primarily
constituted by short contigs like this?  I would recommend running CEGMA
once on this genome to get an appropriate estimate of how recoverable
the
> genes are going to be (http://korflab.ucdavis.edu/datasets/cegma/).
Cegma
> will give you an estimate for genome completeness as well as estimates
of
> what percentage of genes will be found in their entirety and what
percent
> will be partial genes.  This is important to do if your genome is
fragmented as it will give you a reasonable expectation of what you can
expected to recover (as short contigs don't annotate very well - you
tend
> to loose a lot).
>
> Thanks,
> Carson
>
>
> On 12-10-15 3:45 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>
>>Hi Carson,
>>Thanks. I have attached another contig which is 22 kb, with as many as 3
exons EST alignments. Could you please recommend additional training. We
are currently running maker on the entire contig set and eventually
merge
>>all the gff3 contig predictions. The using suggested parameter/methods
we
>>would like to get a consensus gene-set with minimal false
>>positives/negatives.
>>Thanks,
>>Parul
>>> The contig in question is really too small to get much out of it (only 5
>>kb).  There was only one single exon EST alignments and a couple of
predictions with no evidence support.  Anything smaller than 10 kb is
mostly useless for annotation purposes.  You would really need a few
100kb
>>> length or longer contigs to glean enough information for optimizing your
>>parameters.
>>> The general suggestions for any maker run are to use proteins from a
>>closely related organism or a couple of closely related organisms for the
>>> protein= option in maker.  Also leave single_exon set to 0, except for
>>certain eukaryotes that have a bias for single exon transcripts (i.e. some
>>> fungi and oomycetes).  And leave keep_preds set to 0 because ab initio
>>predictors tend to over-predict by a wide margin (lots of false
>>> positives).
>>> Additional training would really depend on what your other contigs
look
>>like.  Do you have any large contigs?  I could look at one of those and
give suggestions but the provided contig is just too short to glean
much.
>>> Thanks,
>>> Carson
>>> On 12-10-15 1:41 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>Hello,
>>>>Please advice on the aforementioned query?
>>>>Thanks,
>>>>Parul Kudtarkar
>>>>---------------------------- Original Message
>>>> ----------------------------
>>>>Subject: [maker-devel] Conensus gene model
>>>>From:    "Parul Kudtarkar" <parulk at caltech.edu>
>>>>Date:    Fri, October 12, 2012 2:46 pm
>>>>To:      maker-devel at yandell-lab.org
>>>>------------------------------------------------------------------------
--
>>Hi,
>>>>We are using snap(training set[hmm file] generated using est,protein
>>>> and
>>contig file), agustus,genemarkE(we ran it outside maker and have gff3
>>>> file
>>>>as input). The output that we get is combination of various
>>>>gene-predictors and evidences. I have attached sample result file.
What
>>would you recommend to get consensus result set? Bootstrapping the
resulting gff3 file (rerunning maker)?
>>>>Thanks,
>>>>Parul Kudtarkar
>>>>--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org_______________________________________________
>>maker-devel mailing list
>>>>maker-devel at box290.bluehost.com
>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org_______________________________________________
>>maker-devel mailing list
>>>>maker-devel at box290.bluehost.com
>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org
>
>
>


--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org








More information about the maker-devel mailing list