<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Hi Xabier,<div class=""><br class=""></div><div class="">See below —> </div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><div><blockquote type="cite" class=""><div class=""><div dir="ltr" class="">I have to annotate two fungal genomes and I only have the DNA assembly (no EST or protein files). </div></div></blockquote><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><div class="">I understand that lacking of EST and protein files I should provide them as alt-est and protein from the closest species I can, but is it enough with one EST file from one organism for the alt-est?</div></div></div></div></blockquote><div><br class=""></div><div>Provide alt-EST if you have ESTs from a closely relate species, but do not have the proteome for that species.  If you have the proteome, use that.  Both are aligned in amino acid space, and provide the same hint information, the only difference being that alt-EST takes 10x longer because because noth target and query must be translated into all 6 reading frames.</div><br class=""><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><div class="">Regarding the steps to process would this be correct?:<br class=""><ol class=""><li class="">run Maker with the genome, alt-est and protein files, with est2genome=1 and protein2genome=1 (softmask=1 ?)</li><li class="">with this first output, create the hmm file for SNAP based on the first output</li><li class="">Set est2genome=0 and protein2genome=0, set the snaphmm file and run again (using -base option)<br class=""></li><li class="">repeat2 and 3 as necessary*<br class=""></li></ol></div></div></div></div></blockquote><div>If you don’t have ESTs, don’t do est2genome (alt-ESTs don’t count).  Just do protein2genome.  In general to rounds of training is the maximum you should do.  At that point, ab initio predictions and hint based predictions will start to look like each other (so the ab initio models are doing well on their own).</div><div><br class=""></div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><div class="">*How do you know when you get to the point where no more refinement is possible? Would that the final model? It should be based on the AED scores? How can I get it without looking into individual sequence headings? Also, do you perform the bootstrapping on the same folder? In the <a href="http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014" class="">tutorial </a>I saw different folders, (e.g. pyu_contig1, pyu_contig2) used on each repetition, not sure if just for demonstration purposes or if it is the proper way to go..<br class=""></div></div></div></div></blockquote><div><br class=""></div><div>Run it in the same folder.  This will allow MAKER to recycle raw reports from BALST etc. from the previous run (i.e. MAKER will run faster).  In the tutorial we ran separately just to be able to open old results and compare.</div><br class=""><div><br class=""></div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><div class="">I'm trying to run also a gene prediction with Augustus and GeneMark. The first run will include an already trained profile for Augustus and the native hmm file of genemark-ES**. Do they need to repeat the prediction by bootstrap like with SNAP? If so, do I need to generate new hmm files or prediction models based on results?<br class=""></div></div></div></div></blockquote><div><br class=""></div><div>You do with Augustus, but not GeneMark which does self training.</div><br class=""><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><div class="">**I have been trying to make the hmm file for genemark-ES using the <a href="http://gm_es.pl/" class="">gm_es.pl</a> script but no matter what parameters I use the cluster shut the job down as it exceeds 128GB of memory in use. The genome I've been testing for this is about 42Mbp in a roughly 40-50 MB fasta file</div></div></div></div></blockquote><div><br class=""></div><div>You can train GeneMark with just part of the genome. Try using 10Mb made up of the longest contigs.  Also I only recommend using GeneMark on Fungi, it tends to not work well on organisms with more complex intron/exon structures. Also you should build a species specific repeat database to supplement RepeatMaskers internal libraries.  I’d recommend using Repeat Modeler.</div><div><br class=""></div></div><br class=""><div class="">Thanks,</div></div><div class="">Carson</div><div class=""><br class=""></div></body></html>