[maker-devel] Size of initial EST training set for SNAP

Tue Mar 18 10:59:39 MDT 2014

Thanks, guys, for the swift and informative response!  I will try to train
Augustus again, but at the very least, will include it with an arthropod
HMM in my final run (in addition to my custom SNAP HMM).

Cheers,

Felipe

On Tue, Mar 18, 2014 at 9:26 AM, Barry Moore <barry.utah at gmail.com> wrote:

> Hi Felipe,
>
> I think that plan sounds quite reasonable.  To address your primary
> concern, most gene prediction tools recommend something in the range of a
> minimum of a few hundred gene models to train on.  Since your an order of
> magnitude above that I think your in good shape.  Having said that, of
> course if you have concerns about biases in your training set you may be
> able to supplement it further by using a tool like CEGMA (
> http://korflab.ucdavis.edu/datasets/cegma/) to include high confidence
> genes that your set is missing.
>
> Since the final gene set will only be as complete as the gene predictions
> that MAKER has to choose from I would suggest that you also consider
> including at least one other gene predictor.  Augustus works well on a wide
> variety of genomes and while it is more difficult to train than SNAP it
> does accept hints from MAKER and will likely add to the diversity of the
> final gene set, even if you choose to use an existing HMM that has some
> reasonable relationship to your genome.  This is one of the advantages of
> MAKER supervision, while it would be best to train Augustus as well, MAKER
> will ensure that the final models are not too far out of line with the
> evidence and you'll likely see quite good results using a custom SNAP HMM
> and an existing Augustus HMM as predictor within MAKER.
>
> Thanks,
>
> B
>
> On Mar 18, 2014, at 10:08 AM, Felipe Barreto wrote:
>
> Hi, all,
>
> I've been learning a lot from reading posts from this group, and finally
> started doing actual runs of Maker on our current genome assembly
> (arthropod, genome size ~230Mb).  I started by training SNAP, but would
> like to check my approach before continuing with longer runs.
>
> From our full set of ~40,000 ESTs (RNA-seq assembly), I chose ~2000 that I
> deemed of very high quality based on blast alignments to Swiss-Prot (based
> on query-subject coverage, bit score, etc).  I then used only these 2000
> ESTs in a first Maker run using est2genome=1.  The output returned 1500
> models (with the 500 "missing" models probably a result of single-exon
> issues; not a concern at this point).
>
> I now plan on training SNAP with this first output, and then doing another
> Maker run now using: 1) all ESTs (but est2genome=0), 2) my chosen protein
> evidence, and 3) SNAP with the first HMM file.  The output of this second
> run will be used to re-train SNAP, and this second HMM file will be used in
> a final "official" run (while continuing to provide the EST and protein
> evidence, of course).
>
> Does this sound like a reasonable approach?  Simply put, my main concern
> is whether I'm using too few ESTs in my first est2genome step.
>
> Thanks for any insight!
>
> --
> Felipe Barreto
> Post-doctoral Scholar
> Scripps Institution of Oceanography
> University of California, San Diego
> La Jolla, CA 92093
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> Barry Moore
> Research Scientist
> Dept. of Human Genetics
> University of Utah
> Salt Lake City, UT 84112
> --------------------------------------------
> (801) 585-3543
>
>
>
>
>

-- 
Felipe Barreto
Post-doctoral Scholar
Scripps Institution of Oceanography
University of California, San Diego
La Jolla, CA 92093
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140318/f95daccd/attachment-0003.html>