<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">est2genome and protein2genome should only be used for initial training. They are not predictors, rather they take an EST/protein alignment, find the longest ORF and then turn the ORF directly into a gene model.  It is good enough to build a training dataset, but the models will almost always be partial and fragmented. Also because the alignments both produce and support themselves, they always score well, so their AED values are meaningless. Once you have a predictor trained, you should turn est2genome and protein2genome off. With a trained predictor, the alignments will then serve as hints to Augustus as to where likely introns/exons will be, and this will give the desired behavior.<div class=""><br class=""></div><div class="">Note Augustus will attempt to build the most probable model given the hints and the assembly sequence. If there are any assembly issues affecting the ORF, the predictor will often skip exons or split the model in the locus. Also make sure you have built a species specific repeat library to add to the default repeat libraries used by MAKER (you can use tools like RepeatModeler to do this). Otherwise you will get spurious alignments of much of your evidecence and Augustus will generate false positive results. You may also want to add a large dataset like Uniprot/swiss-prot to the protein evidence. <div class=""><div class=""><br class=""></div><div class="">The best way to evaluate annotations and performance is to visually review annotation in tools like Apollo. It will allow you to see if evidence, gene predictions, and final models achieve consensus or if alignments don’t match (spurious alignment generally suggests a repeat masking issue or evidence quality issue) or if raw ab initio predictions don’t match (indicates insufficient training or an underlying assembly issues).</div><div class=""><br class=""></div><div class="">—Carson</div><div class=""><br class=""><div class=""><br class=""></div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Nov 16, 2016, at 8:01 PM, Prashant Narendra SHINGATE <<a href="mailto:prashantns@imcb.a-star.edu.sg" class="">prashantns@imcb.a-star.edu.sg</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="WordSection1" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">Hi Carson,</div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"> </p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">We are annotating<span style="" class=""><span class="Apple-converted-space"> </span>the genome of a fish with a relatively small genome</span><span class="Apple-converted-space"> </span>(~4<span style="" class="">5</span>0Mb) using Maker<span style="" class=""><span class="Apple-converted-space"> </span>and encountering many genes that are split and predicted as multiple genes</span>.<span class="Apple-converted-space"> </span><span style="" class="">We are using Augustus for de novo prediction. Fortunately we have full-length RNAseq for about 4000 genes (and total ~50k transcripts) from the same species, and whole-genome protein sequences from a very closely related species.<span class="Apple-converted-space"> </span></span></div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"><span style="font-size: 10pt; font-family: Arial, sans-serif;" class=""> </span></p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">First we trained Augustus using ~4000 full length RNAs<span style="" class="">eq transcript</span><span class="Apple-converted-space"> </span>from<span class="Apple-converted-space"> </span><span style="" class="">the<span class="Apple-converted-space"> </span></span>same species. This trained Augustus model was used<span style="" class=""><span class="Apple-converted-space"> </span>in the</span><span class="Apple-converted-space"> </span>Maker<span class="Apple-converted-space"> </span><span style="" class="">annotation pipeline<span class="Apple-converted-space"> </span></span>along with<span style="" class=""><span class="Apple-converted-space"> </span>~50k</span><span class="Apple-converted-space"> </span>RNAseq<span class="Apple-converted-space"> </span><span style="" class="">transcripts<span class="Apple-converted-space"> </span></span>(>1000bp) and<span class="Apple-converted-space"> </span><span style="" class="">whole-genome<span class="Apple-converted-space"> </span></span>proteins<span class="Apple-converted-space"> </span><span style="" class="">sequences from a closely related species.</span></div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"><span style="" class=""> </span></p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class=""><span style="font-size: 10pt; font-family: Arial, sans-serif;" class="">We first tried annotating using the options<span class="Apple-converted-space"> </span></span>est2genome=1, protein2genome=1 and Augustus ON.  We found several genes were split and the program seemed to give weight to Augustus prediction in spite of having full-length RNAseq and protein sequences aligned to the gene predicted loci (visualized using Jbrowser).<span class="Apple-converted-space"> </span></div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"> </p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">In the next trial we used est2genome=1, protein2genome=1 and Augustus OFF in the first step. In the second step we did reiteration by est2genome=0, protein2genome=0 and Augustus ON. Still the output contained split genes.</div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"> </p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">In the third trial we used est2genome=1, protein2genome=1 and Augustus OFF and checked the output. In this output full-length genes were predicted whenever full-length RNAseq and/or protein sequences were available. This seems to suggest that when we use Augustus, more weight is given to Augustus de novo prediction and the synthesis of evidence from RNAseq and protein sequences is not happening.</div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"> </p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">Can you please let us know why we are getting split genes in spite of having full-length RNAseq and/or protein sequences? What changes would you suggest to the protocol to overcome this problem?</div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"> </p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">We thank you very much for your help and time.</div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"> </p><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class="">Regards,</div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class=""><b class=""><u class=""><span lang="EN-US" style="font-size: 9pt; color: rgb(14, 36, 242);" class=""><a href="mailto:prashantns@imcb.a-star.edu.sg" style="color: purple; text-decoration: underline;" class=""><span style="color: blue;" class="">Prashant Shingate, PhD</span></a></span></u></b><b class=""><span lang="EN-US" style="font-size: 9pt; color: rgb(31, 73, 125);" class=""><span class="Apple-converted-space"> </span></span></b><b class=""><span lang="EN-US" style="font-size: 9pt;" class="">::<span class="Apple-converted-space"> </span></span></b><b class=""><span lang="EN-US" style="font-size: 9pt; color: rgb(31, 73, 125);" class="">Research Fellow :: Comparative and Medical Genomics Lab :: Institute of Molecular and Cell Biology (IMCB) :: Agency for Science, Technology and Research (A*STAR)</span></b><span style="font-size: 9pt; color: rgb(31, 73, 125);" class=""></span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class=""><span lang="EN-US" style="font-size: 9pt; color: rgb(31, 73, 125);" class="">61 Biopolis Drive :: #05-04 Proteos :: Singapore 138673 :: DID (+65) 6586 9570 :: Fax (+65) 6779 1117</span><span lang="EN-US" style="font-size: 9pt; color: rgb(14, 36, 242);" class="">::<span class="Apple-converted-space"> </span></span><span lang="EN-US" style="font-size: 9pt; color: rgb(14, 36, 242);" class=""><a href="http://www.imcb.a-star.edu.sg/" style="color: purple; text-decoration: underline;" class=""><span style="color: rgb(14, 36, 242);" class="">http://www.imcb.a-star.edu.sg/</span></a></span><span style="font-size: 9pt; color: rgb(14, 36, 242);" class=""></span></div><div style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;" class=""><b class=""><span lang="EN-GB" style="font-size: 9pt; color: red;" class="">We advance science and develop innovative technology to further economic growth and improve lives. </span></b></div><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"><span class=""> </span></p><p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif;"> </p></div><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><font face="Arial" color="Gray" size="1" style="font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class=""><br class="">Note: This message may contain confidential information. If this Email/Fax has been sent to you by mistake, please notify the sender and delete it immediately. Thank you.</font></div></blockquote></div><br class=""></div></div></div></div></body></html>