<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">MAKER requires every gene to have at least some evidence support.  This is very important for most most eukaryotes as false positive predictions will dominate what is called by snap/augustus.  However, it is not such a large problem in fungi because of their high gene density and less frequent introns.  Setting keep_preds=1 will maximize sensitivity at the cost of specificity (bad idea in most eukaryotes, but not so much in fungi).  I would not be surprised if a bias toward sensitivity is used by most fungi annotation projects with every gene that can be annotated being annotated (even if it does increase false positives).  It is a tactic that can work at least in fungi.</div><div class=""><br class=""></div><div class="">Also if the assembly is fragmented, you will be less likely to have evidence support for all genes as the evidence alignments will not meet the % coverage thresholds in the maker_bopts.ctl file.  You may want to separate out your shorter contigs, and annotate them separately with more relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and en_score_limit=.</div><div class=""><br class=""></div><div class="">—Carson</div><div class=""><br class=""></div><br class=""><div><blockquote type="cite" class=""><div class="">On Jan 31, 2015, at 4:21 PM, Jason Stajich <<a href="mailto:jason.stajich@gmail.com" class="">jason.stajich@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Xabier -<div class=""> FYI - though you probably already compared, those stats are on par with the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and genome size is still same range supporting the duplication hypothesis) </div><div class=""> Hw version 1 asmbly -</div><div class="">N50 9623; Max 71563</div><div class="">CEGMA for Hw1 </div><div class=""><div class="">             #Prots  %Completeness  -  #Total  Average  %Ortho</div><div class=""><br class=""></div><div class="">  Complete      196       79.03      -   498     2.54     81.12</div><div class="">   Partial      228       91.94      -   673     2.95     95.18<br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Mikael - yes - we should compare notes on the models JGI is calling which have little support in MAKER - I am not sure if their pipeline runs with augustus/snap using informant hints though usually they are bringing RNAseq into the mix - I don't know if your approach for reannotation assembled the RNAseq and used it as evidence?</div><div class=""><br class=""></div><div class="">We'll be trying to assess some of this when comparisons of proportion of shared genes in the first 1KFG paper so we may be able to say with more certainty of these extra predictions whether they are shared more widely and get a handle on singleton/false positives rates.</div><div class=""><br class=""></div></div><div class="">Jason</div></div><div class="gmail_extra"><br clear="all" class=""><div class=""><div class="gmail_signature"><div dir="ltr" class="">Jason Stajich<br class=""><a href="mailto:jason.stajich@gmail.com" target="_blank" class="">jason.stajich@gmail.com</a><br class=""></div></div></div>

<br class=""><div class="gmail_quote">On Sat, Jan 31, 2015 at 12:51 AM, Xabier Vázquez Campos <span dir="ltr" class=""><<a href="mailto:xvazquezc@gmail.com" target="_blank" class="">xvazquezc@gmail.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class=""><div class="">Thanks Mikael,<br class=""><br class=""></div>This are the assembly stats as taken from abyss-fac, indeed it isn't a great N50, but it isn't that bad either<br class=""><br class="">   n       n:500   n:N50   min     N80      N50     N20      E-size    max     sum     <br class="">14277   7099    1185    500     4698    10771   20438   14530   154519  42.68e6<br class=""><br class=""><br class=""></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br class=""><div class="gmail_quote">2015-01-31 19:42 GMT+11:00 Mikael Brandström Durling <span dir="ltr" class=""><<a href="mailto:mikael.durling@slu.se" target="_blank" class="">mikael.durling@slu.se</a>></span>:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word" class="">

Hi Xabier,<br class="">

<div class=""></div>

<br class="">

<div class="">

<blockquote type="cite" class="">

<div class="">31 jan 2015 kl. 05:48 skrev Xabier Vázquez Campos <<a href="mailto:xvazquezc@gmail.com" target="_blank" class="">xvazquezc@gmail.com</a>>:</div>

<br class="">

<div class="">

<div dir="ltr" class=""><span class="">Hi all,<br class="">

<br class="">

One of the fungal genomes I'm annotating is relatively shattered (?), with many contigs/scaffolds and based on CEGMA analysis only may indicate a potential widespread duplication of the genome<br class="">

<br class="">

</span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">

#      Statistics of the completeness of the genome based on 248 CEGs      #<br class="">

              #Prots  %Completeness  -  #Total  Average  %Ortho<br class="">

<br class="">

  Complete      181       72.98      -   365     2.02     67.40<br class=""></span><span class="">

   Partial      230       92.74      -   528     2.30     77.83<br class="">

</span></blockquote>

</div>

</div>

</blockquote>

<div class=""><br class="">

</div>

<div class=""><br class="">

</div>

<div class="">Judging from these figure, you seem to have a very fragmented assembly? What N50 have you reached? According to my experience, assemblies with an N50 below 5-10 times the average gene length tend to give problems in producing good gene sets. Not to say

 that the gene sets are unusable, but for comparing e.g. gene complements to other species, it will be hard to draw any conclusions when a high proportion of the genes are incomplete.</div><span class="">

<br class="">

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" class="">The expected genome size is relatively low (~42 Mb by abyss-fac) in comparison with

<i class="">Hortaea werneckii</i> (51.6Mb, 23333 genes), a related fungi with nearly 90% of its genes present in at least two copies.<br class="">

Paper: <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328" target="_blank" class="">http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328</a><br class="">

<div class=""><br class="">

</div>

<div class="">Now to the Maker part... So, as part of the Maker annotation, I trained SNAP and Augustus, and I generated a specific RepeatModeler library. I recorded the predicted outputs from each Maker run (AED, number of predicted proteins and transcripts...).

 Both Augustus and SNAP used to give quite high number (~19000 and ~23000 respectively) in comparison with the xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how does maker deal with gene duplications? Or is this just a phenomenon given

 that there is no support from the protein files provided initially to Maker? I've used 4 different protein files for the annotation, could it be that they weren't the best choices? I picked them from the closest relatives and similar environments</div>

</div>

</div>

</blockquote>

<div class=""><br class="">

</div>

</span><div class="">Unless you by mistake filter out duplicated gene families as repeats with repeat modeler, maker should not care about duplicated genes. However, maker, without keep_preds=1, reports only genes with some kind of support (be it EST or protein homology).

 This is rather conservative, but if you enable keep_preds, you will get more genes as you have noted. Just for the sake of comparison, I have reannotad more than ten genomes downloaded from JGI, providing MAKER with similar evidence as JGI, and consistently,

 MAKER is reporting fewer gene models. I have yet to do a more thorough comparison to tell what genes JGI are reporting that don’t appear in the MAKER annotations.</div><span class="">

<br class="">

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" class="">

<div class=""><br class="">

</div>

<div class="">So, in my last run I turn the keep_preds=1 and the proteins in the xxx.all.maker.proteins.fasta reached to <br class="">

</div>

<div class=""><br class="">

</div>

<div class="">Last question regarding the protein files. I download the annotated genomes from the JGI and most of them have two annotation folders "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been using the protein files found in the

 later as I expected to have real evidence and a lower chance of being predicting false genes. Am I right?</div>

</div>

</div>

</blockquote>

<div class=""><br class="">

</div>

</span><div class="">Yes, I would say so. The FilteredModels have passed through their model selection pipeline, while all_models contains models from all predictors, as well as combinations of predictors and EST evidence.</div>

<div class=""><br class="">

</div>

<div class="">Just some 2 cents of observations of mine,</div>

<div class="">cheers,</div>

<div class="">Mikael</div>

<br class="">

<blockquote type="cite" class="">

<div class=""><span class="">

<div dir="ltr" class="">

<div class=""><br class="">

</div>

<div class="">Thank you in advance,</div>

<div class=""><br class="">

</div>

<div class="">Xabier</div>

<div class=""><br class="">

<br class="">

-- <br class="">

Xabier Vázquez Campos<br class="">

PhD Candidate<br class="">

Water Research Centre<br class="">

School of Civil and Environmental Engineering<br class="">

The University of New South Wales<br class="">

Sydney NSW 2052 AUSTRALIA</div>

</div></span>

_______________________________________________<br class="">

maker-devel mailing list<br class="">

<a href="mailto:maker-devel@box290.bluehost.com" target="_blank" class="">maker-devel@box290.bluehost.com</a><br class="">

<a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" target="_blank" class="">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a><br class="">

</div>

</blockquote>

</div>

<br class="">

</div>

</blockquote></div><br class=""><br clear="all" class=""><br class="">-- <br class=""><div class="">Xabier Vázquez Campos<br class=""><i class="">PhD Candidate</i><br class="">Water Research Centre<br class="">School of Civil and Environmental Engineering<br class="">

The University of New South Wales<br class="">Sydney NSW 2052 AUSTRALIA<br class=""></div>

</div>

</div></div><br class="">_______________________________________________<br class="">

maker-devel mailing list<br class="">

<a href="mailto:maker-devel@box290.bluehost.com" class="">maker-devel@box290.bluehost.com</a><br class="">

<a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" target="_blank" class="">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a><br class="">

<br class=""></blockquote></div><br class=""></div>

_______________________________________________<br class="">maker-devel mailing list<br class=""><a href="mailto:maker-devel@box290.bluehost.com" class="">maker-devel@box290.bluehost.com</a><br class="">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org<br class=""></div></blockquote></div><br class=""></body></html>