[maker-devel] Can maker select a gene model based on #algoritham predicted it

Gowthaman Ramasamy gowthaman.ramasamy at seattlebiomed.org
Fri Jun 1 13:06:30 MDT 2012


Hi Carson,
I agree with you mostly. Its better to have some form of evidence (blast, pfam etc) to call something as a coding gene. Genes that dont have evidence are hard to interpret anyway. But, in the organism we work on (Malaria and Leishmania parasites) we tend to see 100s of genus specific genes. Of course, as you would suspect, their biological/functional significance is not known. They remain as hypothetical proteins for years. But, the researchers still would like to lean towards slight over prediction over under prediction. 

Here is the approach i follow. I collect NON redundant set of proteins from all the related genus to supply as evidence in Maker. I run Augustus and Genemark inside Maker. Also supply gene models from another ab-initio gene prediction suite (automagi-our parasite specific). Automagi in-trun runs 3 algorithm and chooses a consensus gene model.  In short i run 5 gene predictors and chose anything that is predicted by 3. And predictions need NOT overlap to their entire length (this helps us to pull genes that are separated into two due to frame shifts). 

Yesterday, I wrote a small script, that takes all features predicted by MAKER compares them with 3 gffs (of Automagi=3, Augustus, Genemark). I keep_pred=1. It counts if a Maker chosen gene overlaps with prediction from at least 3 of 5 algorithms.

Hi Barry,
Thanks for letting me contribute to the wiki. Most of the edits i thought of is from discussing with Carson. I thought, doing so will save him bit more time from emailing in future. Its possible you got most of it already in latest wiki.  Something like, how to train Genemark. Its really hard to find it in their documentation. I learned it from one of Carson's earlier discussion.  My two cents.

Thanks,
Gowthaman

Carson: Thanks for the great tool. And thanks for every GREATER support. 

  
________________________________________
From: Carson Holt [carsonhh at gmail.com]
Sent: Friday, June 01, 2012 11:41 AM
To: Barry Moore
Cc: Gowthaman Ramasamy; maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Can maker select a gene model based on #algoritham predicted it

While I could add an option to keep them if there are more than one, the actual implementation is not as trivial as it seems.  On some organisms like fungi and oomycetes, the predictions that don't overlap evidence tend to be similar to each other across predictors, but on other eukaryotes with difficult and complex intron/exon structure like lamprey or even planaria about the only time two predictors will produce similar results coorelated with when there is evidence supporting them, and all the unsupported regions are messy with weird partial overlaps (sometimes even conflicting reading frames).  I have a figure in the MAKER2 paper showing how poorly these algorithms perform on such organisms and how additional evidence based feedback provided by MAKER produces dramatically improved results.

The way I get around the issues when choosing the non-redundant non-overlapping proteins recorded at the end of a MAKER run uses a complex variant of the AED calculation across the alternate predictions to build a consensus.  So in short it's not exactly as simple as just saying there are two predictions at a given locus.  It would require some thought (as well as good documentation), but it could probably be done.

--Carson

From: Barry Moore <barry.moore at genetics.utah.edu<mailto:barry.moore at genetics.utah.edu>>
Date: Friday, 1 June, 2012 2:22 PM
To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Cc: Gowthaman Ramasamy <gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebiomed.org>>, "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Subject: Re: [maker-devel] Can maker select a gene model based on #algoritham predicted it

Carson,

How hard would it be to have maker take an option something like 'require_abinits=2' that would instruct maker to promote predictions that overlap with (2, 3 or more) other predictions?  Seems like the maker might have all that info in one place at some point already?

Gowthaman, your contributions to the maker tutorial would be most welcome.  I've got an offline copy of a newer tutorial wiki that is more up to date than the GMOD version.  It's on a server right now that we've got locked behind a firewall, but I'm hoping to move that to a public facing server in the next week and I'd be happy to give you an account on the wiki.

B

On May 30, 2012, at 6:54 AM, Carson Holt wrote:

It's not an option in exactly the way you are specifying, but there is
something I usually do for annotation that works well.  I run interproscan
or rpsblast on the non_overlapping.proteins.fasta file and select just
those non-overlapping models that have a recognizable protein domain (just
searching the pfam doamin space is more than sufficient).  Then I provide
the selected results to model_gff, and provide the previous maker results
to the maker_gff option with (all reannotation pass options set to 1 and
all analysis options turned off).  This adds models with at least
recognizable domains (as even multiple gene predictors can overpredict in
a similar way).

Attached is a script to help select predictions and upgrade them to models
in GFF3 format.  If you have question let me know.

Thanks,
Carson



On 12-05-29 5:54 PM, "Gowthaman Ramasamy"
<gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebiomed.org>> wrote:

Hi Carson,
Thanks for all the help during the long weekend, in spite of that long
drive. I am still trying to imagine that.

I now have maker to consider our own prediction via pred_gff, and use
augustus and gene mark (with our training model). And i was able to use
altest and protein evidences. Maker happily picks one gene model when
there is a overlap between three different predictions. But, when I look
at the gff, it seems like it picks a gene model only when there is an
est/protein evidence. It leaves out some genes even though, they are
predicted by all three algorithms. Of course, keep_pred=1 helps to keep
all the models. This kind of leads to over prediction.

But, I am looking for something in between. And would like to know if
that is possible?
1) Pick a gene model if it has an evidence from (est/prot etc...)
irrespective of how many algorithms predicted it
2) In the absence of extrinsic evidence (est/prot etc), pick a gene model
if that is predicted by at least two algorithms.

Or even simpler:
I have ab-initio predictions from three algorithms, Can I output, those
genes that is supported by at least two of them. I care less about
exactness of gene boundaries.

Thanks,
Gowthaman

PS: With my recent attempts, i learned couple things about maker/other
associated tools that is not documented in gmod-maker wiki. Is it
possible/ok if I add contents to it. I am okay with running it by you
before making it public.

<gff3_preds2models>_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543







More information about the maker-devel mailing list