<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif;"><div>That sounds good.  1,500 initial models should be more than sufficient for the first round of training.</div><div><br></div><div>—Carson</div><div><br></div><div><br></div><span id="OLK_SRC_BODY_SECTION"><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> Felipe Barreto <<a href="mailto:fbarreto@ucsd.edu">fbarreto@ucsd.edu</a>><br><span style="font-weight:bold">Date: </span> Tuesday, March 18, 2014 at 10:08 AM<br><span style="font-weight:bold">To: </span> MAKER group <<a href="mailto:maker-devel@yandell-lab.org">maker-devel@yandell-lab.org</a>><br><span style="font-weight:bold">Subject: </span> [maker-devel] Size of initial EST training set for SNAP<br></div><div><br></div><div dir="ltr"><span style="font-family: arial, sans-serif; font-size: 12.727272033691406px;">Hi, all,</span><div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">

I've been learning a lot from reading posts from this group, and finally started doing actual runs of Maker on our current genome assembly (arthropod, genome size ~230Mb).  I started by training SNAP, but would like to check my approach before continuing with longer runs.  </div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">From our full set of ~40,000 ESTs (RNA-seq assembly), I chose ~2000 that I deemed of very high quality based on blast alignments to Swiss-Prot (based on query-subject coverage, bit score, etc).  I then used only these 2000 ESTs in a first Maker run using est2genome=1.  The output returned 1500 models (with the 500 "missing" models probably a result of single-exon issues; not a concern at this point).</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">I now plan on training SNAP with this first output, and then doing another Maker run now using: 1) all ESTs (but est2genome=0), 2) my chosen protein evidence, and 3) SNAP with the first HMM file.  The output of this second run will be used to re-train SNAP, and this second HMM file will be used in a final "official" run (while continuing to provide the EST and protein evidence, of course).</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">Does this sound like a reasonable approach?  Simply put, my main concern is whether I'm using too few ESTs in my first est2genome step.</div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px"><br></div><div style="font-family:arial,sans-serif;font-size:12.727272033691406px">Thanks for any insight!</div><div><br></div>-- <br>Felipe Barreto<br>

Post-doctoral Scholar<br>Scripps Institution of Oceanography<br>University of California, San Diego<br>La Jolla, CA 92093

</div>

_______________________________________________

maker-devel mailing list

<a href="mailto:maker-devel@box290.bluehost.com">maker-devel@box290.bluehost.com</a>

<a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a>

</span></body></html>