[maker-devel] Size of initial EST training set for SNAP

Tue Mar 18 10:26:45 MDT 2014

Hi Felipe,

I think that plan sounds quite reasonable.  To address your primary concern, most gene prediction tools recommend something in the range of a minimum of a few hundred gene models to train on.  Since your an order of magnitude above that I think your in good shape.  Having said that, of course if you have concerns about biases in your training set you may be able to supplement it further by using a tool like CEGMA (http://korflab.ucdavis.edu/datasets/cegma/) to include high confidence genes that your set is missing.

Since the final gene set will only be as complete as the gene predictions that MAKER has to choose from I would suggest that you also consider including at least one other gene predictor.  Augustus works well on a wide variety of genomes and while it is more difficult to train than SNAP it does accept hints from MAKER and will likely add to the diversity of the final gene set, even if you choose to use an existing HMM that has some reasonable relationship to your genome.  This is one of the advantages of MAKER supervision, while it would be best to train Augustus as well, MAKER will ensure that the final models are not too far out of line with the evidence and you'll likely see quite good results using a custom SNAP HMM and an existing Augustus HMM as predictor within MAKER.

Thanks,

B

On Mar 18, 2014, at 10:08 AM, Felipe Barreto wrote:

> Hi, all,
> 
> I've been learning a lot from reading posts from this group, and finally started doing actual runs of Maker on our current genome assembly (arthropod, genome size ~230Mb).  I started by training SNAP, but would like to check my approach before continuing with longer runs.  
> 
> From our full set of ~40,000 ESTs (RNA-seq assembly), I chose ~2000 that I deemed of very high quality based on blast alignments to Swiss-Prot (based on query-subject coverage, bit score, etc).  I then used only these 2000 ESTs in a first Maker run using est2genome=1.  The output returned 1500 models (with the 500 "missing" models probably a result of single-exon issues; not a concern at this point).
> 
> I now plan on training SNAP with this first output, and then doing another Maker run now using: 1) all ESTs (but est2genome=0), 2) my chosen protein evidence, and 3) SNAP with the first HMM file.  The output of this second run will be used to re-train SNAP, and this second HMM file will be used in a final "official" run (while continuing to provide the EST and protein evidence, of course).
> 
> Does this sound like a reasonable approach?  Simply put, my main concern is whether I'm using too few ESTs in my first est2genome step.
> 
> Thanks for any insight!
> 
> -- 
> Felipe Barreto
> Post-doctoral Scholar
> Scripps Institution of Oceanography
> University of California, San Diego
> La Jolla, CA 92093
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140318/94293e29/attachment-0003.html>