<div dir="ltr">Hi Alice,<div><br></div><div>In my experience the fewer longer genes is generally a good thing (and very normal) resulting from the merging of split models and extension of incomplete models. I find it helpful to load the annotations and evidence into a browser to get a visual idea of what is happening.</div><div><br></div><div>Mike </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis <span dir="ltr"><<a href="mailto:alicebdennis@gmail.com" target="_blank">alicebdennis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello again,<br>

<br>

I posted a while ago about a genome I'm running through the Maker2<br>

pipeline. I was concerned because my results were still changing with<br>

3 and 4 iterations.<br>

<br>

Following the very useful advice of Carson (below), I've made a few<br>

modifications (adding a RepeatModeler run, using a big protein<br>

database), but my gene predictions are still changing between the 3rd<br>

and 4th iterations. Perhaps this is ok, but these increasing gene<br>

lengths make me worry that I haven't built stable models.<br>

<br>

Here is the short version of what I've done.<br>

1. Run RepeatModeler, but this only produced 47 sequences in the<br>

resulting .fasta... so that seemed a bit small.<br>

<br>

2. Run Maker2 using:<br>

- RepeatModeler output + "model_org=all" and "softmask=1" in the<br>

Repeat Masking section.<br>

- protein evidence from 2 distantly related species AND all of Uniprot<br>

- ests from a different strain of my species (a parasitoid wasp)<br>

- the .hmm from Nasonia, one of the 2 distantly related species whose<br>

proteome I also provided as protein evidence<br>

- my assembled genome of 1,509 scaffolds.<br>

<br>

3. After this, I did three subsequent rounds of Maker2 (cleverly named<br>

Rounds 2, 3 and 4). Each one used the same input, except the Nasonia<br>

.hmm was replaced by a SNAP generated .hmm from the previous round.<br>

Also, the est2genome and protein2genome was changed from 1 to 0 in all<br>

runs after the first.<br>

<br>

Here are some results:<br>

Round1: 14,647 genes, average length 2,491<br>

Round2: 12,158 genes, average length 3,760<br>

Round3: 13,515 genes, average length 3,090<br>

Round4: 12,169 genes, average length 3,918<br>

<br>

This is a bit confusing because the number of genes predicted goes up<br>

and down, as does their lengths. I've doubly checked the dates of my<br>

files, and they are all labeled such that I don't think anything could<br>

be swapped.<br>

<br>

So my questions are:<br>

Is this an indication that my models are unstable and I shouldn't<br>

trust these predictions?<br>

Is the decreasing number of genes, while also getting longer perhaps a<br>

good thing?<br>

How do I know when to stop if genes keep getting longer?<br>

<br>

<br>

Thanks very much,<br>

Alice<br>

<div><div class="h5"><br>

<br>

On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <<a href="mailto:carsonhh@gmail.com">carsonhh@gmail.com</a>> wrote:<br>

> The gene models are actually produced by SNAP, Augustus, or whatever gene<br>

> predictor you are using, so if you change the HMM every round, then the<br>

> models will change too.  But I have one concern.  You are using a very<br>

> sparse protein evidence dataset.  The protein dataset is very important to<br>

> MAKER’s performance, and for itterative training of the ab initio<br>

> predictors.  Normally after the second iteration, additional training should<br>

> not be beneficial, but if you are getting wildly different results on 3rd<br>

> and 4th round, then you probably aren’t getting sufficient good models to<br>

> train with.<br>

><br>

> For a protein dataset you should be using the entire a proteome from a<br>

> minimum of two related species and perhaps all of UniProt/Swiss-prot to get<br>

> a broad protein database.  Don’t use the proteins extracted by CEGMA and<br>

> HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff scrip<br>

> that comes with MAEKR), but don’t give the proteins to MAKER as evidence,<br>

> also the HaMSTr results will be redundant with the ESTs.  You need proteins<br>

> from related species to look for homology not found in the EST dataset.<br>

><br>

> Also repeat masking is important for any genome and has a huge effect on ab<br>

> initio predictor performance.  Make sure you run something like<br>

> RepeatModeler to look for species specific repeats that will not already be<br>

> in RepBase.  Then add those results to the rmlib= option in the maker<br>

> control files.<br>

><br>

> Thanks,<br>

> Carson<br>

><br>

><br>

><br>

><br>

> On Dec 12, 2014, at 7:10 AM, Dennis, Alice <<a href="mailto:Alice.Dennis@eawag.ch">Alice.Dennis@eawag.ch</a>> wrote:<br>

><br>

</div></div><div><div class="h5">> Hi all,<br>

><br>

> I am a relatively new user to Maker2, and I’m looking for advise on running<br>

> many iterations of the same dataset in Maker2.<br>

><br>

> I have a relatively small genome (~124 MB) from a wasp that is assembled<br>

> into ~1,500 scaffold. I have run several iterations of Maker2 by<br>

> re-generating .hmms in SNAP and feeding them into the next round, and my<br>

> gene predictions keep increasing (in number and in size).  The only thing<br>

> that changes at each round is the .hmm.<br>

> This is the evidence that I give is:<br>

> -          de novo assembled ESTs from a different strain of the same<br>

> species (70,000 contigs… I am currently working on improving this assembly<br>

> with the hope that this will be helpful here)<br>

> -          610 proteins extracted from the genome scaffolds using CEGMA and<br>

> HaMSTr<br>

><br>

> For my 1st iteration, I used the Nasonia .hmm from SNAP, and the<br>

> est2genome/protein2genome option.<br>

><br>

> For the 2nd, 3rd and 4th rounds I have used .hmms generated from the<br>

> previous round, all without the est2genome/protein2genome option. All other<br>

> files are the same as in the original run.<br>

><br>

> As I understand it, after the second round, nothing should change in Maker2.<br>

> But the differences are obvious between runs. Some entirely new exons are<br>

> annotated. For example,  just counting “exon” in the .gff file gives me<br>

> 73,000 after the third iteration and 96,000 after the fourth! Actually the<br>

> biggest leap in this number is between the third and fourth round. I can<br>

> also see that many features are longer when I look at the files in Geneious.<br>

><br>

> Is this sort of change possible after the second round of Maker2? Is there<br>

> something I have done wrong in my runs, or am a understanding this output<br>

> incorrectly?<br>

><br>

> Thank you,<br>

> Alice<br>

><br>

</div></div><span class="">> _______________________________________________<br>

> maker-devel mailing list<br>

> <a href="mailto:maker-devel@box290.bluehost.com">maker-devel@box290.bluehost.com</a><br>

> <a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" target="_blank">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a><br>

><br>

><br>

<br>

<br>

<br>

</span>--<br>

<br>

<br>

Alice Dennis<br>

<a href="mailto:alicebdennis@gmail.com">alicebdennis@gmail.com</a><br>

<br>

Postdoctoral Researcher<br>

Institute for Integrative Biology, ETH Zürich & EAWAG<br>

Überlandstrasse 133<br>

P.O. Box 611<br>

8600 Dübendorf, Switzerland<br>

<br>

<a href="https://adennis5.wordpress.com/" target="_blank">https://adennis5.wordpress.com/</a><br>

<div class="HOEnZb"><div class="h5"><br>

_______________________________________________<br>

maker-devel mailing list<br>

<a href="mailto:maker-devel@box290.bluehost.com">maker-devel@box290.bluehost.com</a><br>

<a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" target="_blank">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr">Michael Campbell MS, RD.<br>Doctoral Candidate<br>Eccles Institute of Human Genetics<br>

University of Utah<br>

15 North 2030 East, Room 2100<br>

Salt Lake City, UT 84112-5330<br>ph:585-3543<br><br></div></div>

</div>