<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div style="color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px; ">The non_overlapping_ab_initio.proteins.fasta file contain models that were rejected for lack of homology evidence that do not overlap the maker models. So if you accept everything it is not 21,000 proteins rather it's 33,000 (12,000 + 21,000). This is because the contents of that file do not occupy the same regions as the gene models (I.e. non-overlapping). Ab initio predictors have a tendency to overpredict which is why there are so many models. If you think you are missing genes you can try and add additional evidence to the maker run (I.e all proteins from a few related species). Also try running Cegma (<a href="http://korflab.ucdavis.edu/datasets/cegma/">http://korflab.ucdavis.edu/datasets/cegma/</a>) to estimate the completeness of you assembly. Sometime a lower number than expected can be attributed to an incomplete assembly.</div><div style="color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div style="color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px; ">Also you can run something like InterProScan to identify models in the non_overlapping_ab_initio.proteins.fasta file that contain Uniprot protein domains (likely to be real genes), then add them to your results as a second step using maker's model_gff option.</div><div style="color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px; "><br></div><div><font class="Apple-style-span" face="Calibri,sans-serif">I've attached a couple of scripts that can help with that. gff3_preds2models will turn a set of match/match_part features to </font><span class="Apple-style-span" style="font-family: Calibri, sans-serif; ">gene/mRNA/exon/CDS features. It doesn't check translation of CDS though, </span><span class="Apple-style-span" style="font-family: Calibri, sans-serif; ">so only use gene predictions which should be all CDS. </span><font class="Apple-style-span" face="Calibri,sans-serif">gff3_select is used to select some subset of features from a GFF3 file. </font><span class="Apple-style-span" style="font-family: Calibri, sans-serif; ">Useful for slicing sections of data from a GFF3 file.</span></div><div><span class="Apple-style-span" style="font-family: Calibri, sans-serif; "><br></span></div><div><span class="Apple-style-span" style="font-family: Calibri, sans-serif; ">Thanks,</span></div><div><span class="Apple-style-span" style="font-family: Calibri, sans-serif; ">Carson</span></div><div style="color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px; "><br></div><span id="OLK_SRC_BODY_SECTION" style="color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> Kipp Johnson <<a href="mailto:kippjohnson@uchicago.edu">kippjohnson@uchicago.edu</a>><br><span style="font-weight:bold">Date: </span> Thursday, 11 October, 2012 3:43 PM<br><span style="font-weight:bold">To: </span> <<a href="mailto:maker-devel@yandell-lab.org">maker-devel@yandell-lab.org</a>>, Carson Holt <<a href="mailto:carsonhh@gmail.com">carsonhh@gmail.com</a>><br><span style="font-weight:bold">Subject: </span> Interpreting Maker Results<br></div><div><br></div>Hi Carson,<div><br></div><div> I'm trying to get my genetic annotation out of maker. My maker run on a non-model eukaryote finished, and I used your gff3_merge script to merge the resulting files. This file is enormous, because I used snap, augustus, genemark, repeatmasker, exonerate, and blast, and has a lot of entries from all of these different programs.</div><div><br></div><div> I want to extract only the genetic regions predicted by maker, so I used the gff3_merge script with the "-g" option. However, when I do this, I get a maker file that only has about 12,000 genes, while I was expecting around 20,000 genes for our genome. However, when I use the fasta merge tool, however, I get output files (for example, "merged.fasta.all.maker.non_overlapping_ab_initio.proteins.fasta") with about 21,000 proteins, which is closer to the gene number that I was expecting. Does the "-g" option ignore evidence from blast/exonerate or similar? How should I extract the complete set of genetic regions to blast against, so that I can go about further working on the annotation?</div><div><br></div><div>Also, what is maker using to find these 9,000 extra proteins? Are these these all alternately sliced or something along those lines? I can't find any documentation online for how to actually get the final annotations out of maker correctly.</div><div><br></div><div>Thanks for your time!</div><div><br></div><div>Best,</div><div><br></div><div> Kipp Johnson</div><div> <a href="mailto:kippjohnson@uchicago.edu">kippjohnson@uchicago.edu</a></div></span></body></html>