[maker-devel] iterative Maker2

Michael Campbell michael.s.campbell1 at gmail.com
Thu Mar 26 09:50:41 MDT 2015


Hi Alice,

In my experience the fewer longer genes is generally a good thing (and very
normal) resulting from the merging of split models and extension of
incomplete models. I find it helpful to load the annotations and evidence
into a browser to get a visual idea of what is happening.

Mike

On Thu, Mar 26, 2015 at 4:34 AM, Alice Dennis <alicebdennis at gmail.com>
wrote:

> Hello again,
>
> I posted a while ago about a genome I'm running through the Maker2
> pipeline. I was concerned because my results were still changing with
> 3 and 4 iterations.
>
> Following the very useful advice of Carson (below), I've made a few
> modifications (adding a RepeatModeler run, using a big protein
> database), but my gene predictions are still changing between the 3rd
> and 4th iterations. Perhaps this is ok, but these increasing gene
> lengths make me worry that I haven't built stable models.
>
> Here is the short version of what I've done.
> 1. Run RepeatModeler, but this only produced 47 sequences in the
> resulting .fasta... so that seemed a bit small.
>
> 2. Run Maker2 using:
> - RepeatModeler output + "model_org=all" and "softmask=1" in the
> Repeat Masking section.
> - protein evidence from 2 distantly related species AND all of Uniprot
> - ests from a different strain of my species (a parasitoid wasp)
> - the .hmm from Nasonia, one of the 2 distantly related species whose
> proteome I also provided as protein evidence
> - my assembled genome of 1,509 scaffolds.
>
> 3. After this, I did three subsequent rounds of Maker2 (cleverly named
> Rounds 2, 3 and 4). Each one used the same input, except the Nasonia
> .hmm was replaced by a SNAP generated .hmm from the previous round.
> Also, the est2genome and protein2genome was changed from 1 to 0 in all
> runs after the first.
>
> Here are some results:
> Round1: 14,647 genes, average length 2,491
> Round2: 12,158 genes, average length 3,760
> Round3: 13,515 genes, average length 3,090
> Round4: 12,169 genes, average length 3,918
>
> This is a bit confusing because the number of genes predicted goes up
> and down, as does their lengths. I've doubly checked the dates of my
> files, and they are all labeled such that I don't think anything could
> be swapped.
>
> So my questions are:
> Is this an indication that my models are unstable and I shouldn't
> trust these predictions?
> Is the decreasing number of genes, while also getting longer perhaps a
> good thing?
> How do I know when to stop if genes keep getting longer?
>
>
> Thanks very much,
> Alice
>
>
> On Fri, Dec 12, 2014 at 4:41 PM, Carson Holt <carsonhh at gmail.com> wrote:
> > The gene models are actually produced by SNAP, Augustus, or whatever gene
> > predictor you are using, so if you change the HMM every round, then the
> > models will change too.  But I have one concern.  You are using a very
> > sparse protein evidence dataset.  The protein dataset is very important
> to
> > MAKER’s performance, and for itterative training of the ab initio
> > predictors.  Normally after the second iteration, additional training
> should
> > not be beneficial, but if you are getting wildly different results on 3rd
> > and 4th round, then you probably aren’t getting sufficient good models to
> > train with.
> >
> > For a protein dataset you should be using the entire a proteome from a
> > minimum of two related species and perhaps all of UniProt/Swiss-prot to
> get
> > a broad protein database.  Don’t use the proteins extracted by CEGMA and
> > HaMSTr.  CEGMA can be used to guide the first HMM creation (cegma2zff
> scrip
> > that comes with MAEKR), but don’t give the proteins to MAKER as evidence,
> > also the HaMSTr results will be redundant with the ESTs.  You need
> proteins
> > from related species to look for homology not found in the EST dataset.
> >
> > Also repeat masking is important for any genome and has a huge effect on
> ab
> > initio predictor performance.  Make sure you run something like
> > RepeatModeler to look for species specific repeats that will not already
> be
> > in RepBase.  Then add those results to the rmlib= option in the maker
> > control files.
> >
> > Thanks,
> > Carson
> >
> >
> >
> >
> > On Dec 12, 2014, at 7:10 AM, Dennis, Alice <Alice.Dennis at eawag.ch>
> wrote:
> >
> > Hi all,
> >
> > I am a relatively new user to Maker2, and I’m looking for advise on
> running
> > many iterations of the same dataset in Maker2.
> >
> > I have a relatively small genome (~124 MB) from a wasp that is assembled
> > into ~1,500 scaffold. I have run several iterations of Maker2 by
> > re-generating .hmms in SNAP and feeding them into the next round, and my
> > gene predictions keep increasing (in number and in size).  The only thing
> > that changes at each round is the .hmm.
> > This is the evidence that I give is:
> > -          de novo assembled ESTs from a different strain of the same
> > species (70,000 contigs… I am currently working on improving this
> assembly
> > with the hope that this will be helpful here)
> > -          610 proteins extracted from the genome scaffolds using CEGMA
> and
> > HaMSTr
> >
> > For my 1st iteration, I used the Nasonia .hmm from SNAP, and the
> > est2genome/protein2genome option.
> >
> > For the 2nd, 3rd and 4th rounds I have used .hmms generated from the
> > previous round, all without the est2genome/protein2genome option. All
> other
> > files are the same as in the original run.
> >
> > As I understand it, after the second round, nothing should change in
> Maker2.
> > But the differences are obvious between runs. Some entirely new exons are
> > annotated. For example,  just counting “exon” in the .gff file gives me
> > 73,000 after the third iteration and 96,000 after the fourth! Actually
> the
> > biggest leap in this number is between the third and fourth round. I can
> > also see that many features are longer when I look at the files in
> Geneious.
> >
> > Is this sort of change possible after the second round of Maker2? Is
> there
> > something I have done wrong in my runs, or am a understanding this output
> > incorrectly?
> >
> > Thank you,
> > Alice
> >
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> >
> >
>
>
>
> --
>
>
> Alice Dennis
> alicebdennis at gmail.com
>
> Postdoctoral Researcher
> Institute for Integrative Biology, ETH Zürich & EAWAG
> Überlandstrasse 133
> P.O. Box 611
> 8600 Dübendorf, Switzerland
>
> https://adennis5.wordpress.com/
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>



-- 
Michael Campbell MS, RD.
Doctoral Candidate
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:585-3543
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150326/016a477f/attachment-0003.html>


More information about the maker-devel mailing list