<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi Felipe,<div><br></div><div>I think that plan sounds quite reasonable.  To address your primary concern, most gene prediction tools recommend something in the range of a minimum of a few hundred gene models to train on.  Since your an order of magnitude above that I think your in good shape.  Having said that, of course if you have concerns about biases in your training set you may be able to supplement it further by using a tool like CEGMA (<a href="http://korflab.ucdavis.edu/datasets/cegma/">http://korflab.ucdavis.edu/datasets/cegma/</a>) to include high confidence genes that your set is missing.</div><div><br></div><div>Since the final gene set will only be as complete as the gene predictions that MAKER has to choose from I would suggest that you also consider including at least one other gene predictor.  Augustus works well on a wide variety of genomes and while it is more difficult to train than SNAP it does accept hints from MAKER and will likely add to the diversity of the final gene set, even if you choose to use an existing HMM that has some reasonable relationship to your genome.  This is one of the advantages of MAKER supervision, while it would be best to train Augustus as well, MAKER will ensure that the final models are not too far out of line with the evidence and you'll likely see quite good results using a custom SNAP HMM and an existing Augustus HMM as predictor within MAKER.</div><div><br></div><div>Thanks,</div><div><br></div><div>B</div><div><br><div><div>On Mar 18, 2014, at 10:08 AM, Felipe Barreto wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><div dir="ltr"><span style="font-family:arial,sans-serif;font-size:12.727272033691406px">Hi, all,</span><div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">

I've been learning a lot from reading posts from this group, and finally started doing actual runs of Maker on our current genome assembly (arthropod, genome size ~230Mb).  I started by training SNAP, but would like to check my approach before continuing with longer runs.  </div>

<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">From our full set of ~40,000 ESTs (RNA-seq assembly), I chose ~2000 that I deemed of very high quality based on blast alignments to Swiss-Prot (based on query-subject coverage, bit score, etc).  I then used only these 2000 ESTs in a first Maker run using est2genome=1.  The output returned 1500 models (with the 500 "missing" models probably a result of single-exon issues; not a concern at this point).</div>

<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">I now plan on training SNAP with this first output, and then doing another Maker run now using: 1) all ESTs (but est2genome=0), 2) my chosen protein evidence, and 3) SNAP with the first HMM file.  The output of this second run will be used to re-train SNAP, and this second HMM file will be used in a final "official" run (while continuing to provide the EST and protein evidence, of course).</div>

<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">Does this sound like a reasonable approach?  Simply put, my main concern is whether I'm using too few ESTs in my first est2genome step.</div>

<div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">Thanks for any insight!</div><div><br></div>-- <br>Felipe Barreto<br>

Post-doctoral Scholar<br>Scripps Institution of Oceanography<br>University of California, San Diego<br>La Jolla, CA 92093

</div>

_______________________________________________<br>maker-devel mailing list<br><a href="mailto:maker-devel@box290.bluehost.com">maker-devel@box290.bluehost.com</a><br>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org<br></blockquote></div><br><div>

<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div><span class="Apple-style-span" style="font-family: Arial; font-size: 12px; "><div>Barry Moore</div><div>Research Scientist</div><div>Dept. of Human Genetics</div><div>University of Utah</div><div>Salt Lake City, UT 84112</div><div>--------------------------------------------</div><div>(801) 585-3543</div><div><br class="khtml-block-placeholder"></div></span></div><div><br></div></span><br class="Apple-interchange-newline">

</div>

<br></div></body></html>