<div dir="ltr">Hi Alice,<div><br></div><div>In my experience the fewer longer genes is generally a good thing (and very normal) resulting from the merging of split models and extension of incomplete models. I find it helpful to load the annotations and evidence into a browser to get a visual idea of what is happening.</div><div><br></div><div>Mike </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis <span dir="ltr"><<a href="mailto:alicebdennis@gmail.com" target="_blank">alicebdennis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello again,<br>
<br>
I posted a while ago about a genome I'm running through the Maker2<br>
pipeline. I was concerned because my results were still changing with<br>
3 and 4 iterations.<br>
<br>
Following the very useful advice of Carson (below), I've made a few<br>
modifications (adding a RepeatModeler run, using a big protein<br>
database), but my gene predictions are still changing between the 3rd<br>
and 4th iterations. Perhaps this is ok, but these increasing gene<br>
lengths make me worry that I haven't built stable models.<br>
<br>
Here is the short version of what I've done.<br>
1. Run RepeatModeler, but this only produced 47 sequences in the<br>
resulting .fasta... so that seemed a bit small.<br>
<br>
2. Run Maker2 using:<br>
- RepeatModeler output + "model_org=all" and "softmask=1" in the<br>
Repeat Masking section.<br>
- protein evidence from 2 distantly related species AND all of Uniprot<br>
- ests from a different strain of my species (a parasitoid wasp)<br>
- the .hmm from Nasonia, one of the 2 distantly related species whose<br>
proteome I also provided as protein evidence<br>
- my assembled genome of 1,509 scaffolds.<br>
<br>
3. After this, I did three subsequent rounds of Maker2 (cleverly named<br>
Rounds 2, 3 and 4). Each one used the same input, except the Nasonia<br>
.hmm was replaced by a SNAP generated .hmm from the previous round.<br>
Also, the est2genome and protein2genome was changed from 1 to 0 in all<br>
runs after the first.<br>
<br>
Here are some results:<br>
Round1: 14,647 genes, average length 2,491<br>
Round2: 12,158 genes, average length 3,760<br>
Round3: 13,515 genes, average length 3,090<br>
Round4: 12,169 genes, average length 3,918<br>
<br>
This is a bit confusing because the number of genes predicted goes up<br>
and down, as does their lengths. I've doubly checked the dates of my<br>
files, and they are all labeled such that I don't think anything could<br>
be swapped.<br>
<br>
So my questions are:<br>
Is this an indication that my models are unstable and I shouldn't<br>
trust these predictions?<br>
Is the decreasing number of genes, while also getting longer perhaps a<br>
good thing?<br>
How do I know when to stop if genes keep getting longer?<br>
<br>
<br>
Thanks very much,<br>
Alice<br>
<div><div class="h5"><br>
<br>
On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <<a href="mailto:carsonhh@gmail.com">carsonhh@gmail.com</a>> wrote:<br>
> The gene models are actually produced by SNAP, Augustus, or whatever gene<br>
> predictor you are using, so if you change the HMM every round, then the<br>
> models will change too. But I have one concern. You are using a very<br>
> sparse protein evidence dataset. The protein dataset is very important to<br>
> MAKER’s performance, and for itterative training of the ab initio<br>
> predictors. Normally after the second iteration, additional training should<br>
> not be beneficial, but if you are getting wildly different results on 3rd<br>
> and 4th round, then you probably aren’t getting sufficient good models to<br>
> train with.<br>
><br>
> For a protein dataset you should be using the entire a proteome from a<br>
> minimum of two related species and perhaps all of UniProt/Swiss-prot to get<br>
> a broad protein database. Don’t use the proteins extracted by CEGMA and<br>
> HaMSTr. CEGMA can be used to guide the first HMM creation (cegma2zff scrip<br>
> that comes with MAEKR), but don’t give the proteins to MAKER as evidence,<br>
> also the HaMSTr results will be redundant with the ESTs. You need proteins<br>
> from related species to look for homology not found in the EST dataset.<br>
><br>
> Also repeat masking is important for any genome and has a huge effect on ab<br>
> initio predictor performance. Make sure you run something like<br>
> RepeatModeler to look for species specific repeats that will not already be<br>
> in RepBase. Then add those results to the rmlib= option in the maker<br>
> control files.<br>
><br>
> Thanks,<br>
> Carson<br>
><br>
><br>
><br>
><br>
> On Dec 12, 2014, at 7:10 AM, Dennis, Alice <<a href="mailto:Alice.Dennis@eawag.ch">Alice.Dennis@eawag.ch</a>> wrote:<br>
><br>
</div></div><div><div class="h5">> Hi all,<br>
><br>
> I am a relatively new user to Maker2, and I’m looking for advise on running<br>
> many iterations of the same dataset in Maker2.<br>
><br>
> I have a relatively small genome (~124 MB) from a wasp that is assembled<br>
> into ~1,500 scaffold. I have run several iterations of Maker2 by<br>
> re-generating .hmms in SNAP and feeding them into the next round, and my<br>
> gene predictions keep increasing (in number and in size). The only thing<br>
> that changes at each round is the .hmm.<br>
> This is the evidence that I give is:<br>
> - de novo assembled ESTs from a different strain of the same<br>
> species (70,000 contigs… I am currently working on improving this assembly<br>
> with the hope that this will be helpful here)<br>
> - 610 proteins extracted from the genome scaffolds using CEGMA and<br>
> HaMSTr<br>
><br>
> For my 1st iteration, I used the Nasonia .hmm from SNAP, and the<br>
> est2genome/protein2genome option.<br>
><br>
> For the 2nd, 3rd and 4th rounds I have used .hmms generated from the<br>
> previous round, all without the est2genome/protein2genome option. All other<br>
> files are the same as in the original run.<br>
><br>
> As I understand it, after the second round, nothing should change in Maker2.<br>
> But the differences are obvious between runs. Some entirely new exons are<br>
> annotated. For example, just counting “exon” in the .gff file gives me<br>
> 73,000 after the third iteration and 96,000 after the fourth! Actually the<br>
> biggest leap in this number is between the third and fourth round. I can<br>
> also see that many features are longer when I look at the files in Geneious.<br>
><br>
> Is this sort of change possible after the second round of Maker2? Is there<br>
> something I have done wrong in my runs, or am a understanding this output<br>
> incorrectly?<br>
><br>
> Thank you,<br>
> Alice<br>
><br>
</div></div><span class="">> _______________________________________________<br>
> maker-devel mailing list<br>
> <a href="mailto:maker-devel@box290.bluehost.com">maker-devel@box290.bluehost.com</a><br>
> <a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" target="_blank">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a><br>
><br>
><br>
<br>
<br>
<br>
</span>--<br>
<br>
<br>
Alice Dennis<br>
<a href="mailto:alicebdennis@gmail.com">alicebdennis@gmail.com</a><br>
<br>
Postdoctoral Researcher<br>
Institute for Integrative Biology, ETH Zürich & EAWAG<br>
Überlandstrasse 133<br>
P.O. Box 611<br>
8600 Dübendorf, Switzerland<br>
<br>
<a href="https://adennis5.wordpress.com/" target="_blank">https://adennis5.wordpress.com/</a><br>
<div class="HOEnZb"><div class="h5"><br>
_______________________________________________<br>
maker-devel mailing list<br>
<a href="mailto:maker-devel@box290.bluehost.com">maker-devel@box290.bluehost.com</a><br>
<a href="http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org" target="_blank">http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr">Michael Campbell MS, RD.<br>Doctoral Candidate<br>Eccles Institute of Human Genetics<br>
University of Utah<br>
15 North 2030 East, Room 2100<br>
Salt Lake City, UT 84112-5330<br>ph:585-3543<br><br></div></div>
</div>