[maker-devel] iterative Maker2

Fri Dec 12 08:41:42 MST 2014

The gene models are actually produced by SNAP, Augustus, or whatever gene predictor you are using, so if you change the HMM every round, then the models will change too.  But I have one concern.  You are using a very sparse protein evidence dataset.  The protein dataset is very important to MAKER’s performance, and for itterative training of the ab initio predictors.  Normally after the second iteration, additional training should not be beneficial, but if you are getting wildly different results on 3rd and 4th round, then you probably aren’t getting sufficient good models to train with.

For a protein dataset you should be using the entire a proteome from a minimum of two related species and perhaps all of UniProt/Swiss-prot to get a broad protein database.  Don’t use the proteins extracted by CEGMA and HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff scrip that comes with MAEKR), but don’t give the proteins to MAKER as evidence, also the HaMSTr results will be redundant with the ESTs.  You need proteins from related species to look for homology not found in the EST dataset.

Also repeat masking is important for any genome and has a huge effect on ab initio predictor performance.  Make sure you run something like RepeatModeler to look for species specific repeats that will not already be in RepBase.  Then add those results to the rmlib= option in the maker control files.

Thanks,
Carson

> On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch> wrote:
> 
> Hi all,
>  
> I am a relatively new user to Maker2, and I’m looking for advise on running many iterations of the same dataset in Maker2.
>  
> I have a relatively small genome (~124 MB) from a wasp that is assembled into ~1,500 scaffold. I have run several iterations of Maker2 by re-generating .hmms in SNAP and feeding them into the next round, and my gene predictions keep increasing (in number and in size).  The only thing that changes at each round is the .hmm.
> This is the evidence that I give is:
> -          de novo assembled ESTs from a different strain of the same species (70,000 contigs… I am currently working on improving this assembly with the hope that this will be helpful here)
> -          610 proteins extracted from the genome scaffolds using CEGMA and HaMSTr
>  
> For my 1st iteration, I used the Nasonia .hmm from SNAP, and the est2genome/protein2genome option.
>  
> For the 2nd, 3rd and 4th rounds I have used .hmms generated from the previous round, all without the est2genome/protein2genome option. All other files are the same as in the original run.
>  
> As I understand it, after the second round, nothing should change in Maker2. But the differences are obvious between runs. Some entirely new exons are annotated. For example,  just counting “exon” in the .gff file gives me 73,000 after the third iteration and 96,000 after the fourth! Actually the biggest leap in this number is between the third and fourth round. I can also see that many features are longer when I look at the files in Geneious.
>  
> Is this sort of change possible after the second round of Maker2? Is there something I have done wrong in my runs, or am a understanding this output incorrectly?
>  
> Thank you, 
> Alice
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20141212/0318a00e/attachment-0002.html>