[maker-devel] Can maker select a gene model based on #algoritham predicted it

Fri Jun 1 14:39:25 MDT 2012

I like this a lot Carson - for two reasons:  First, it sounds like it's fairly easy to implement with that data and code that already exists within MAKER! And second it sounds like the right way to be doing this - the more the abintis agree the more likely to they are to be correct.

B

On Jun 1, 2012, at 1:23 PM, Carson Holt wrote:

> The metric AED is Annotation Edit Distance (original paper -->
> http://www.biomedcentral.com/1471-2105/10/67).  It's roughly related to
> the sensitivity/specificity measure used to quantify the performance of
> gene predictors and can be used to measure changes in gene models across
> releases, and I further adapted use it for some slightly different purpose
> than given in the original paper above.
> 
> This is copied from the MAKER2 paper -->
> "Given a gene prediction i and a reference j, the base pair level
> sensitivity can be calculated using the formula SN = |i∩j|/|j|; where
> |i∩j| represents the number overlapping nucleotides between i and j, and
> |j| represents the total number of nucleotides in the reference j.
> Alternatively, specificity is calculated using the formula SP = |i∩j|/|i|,
> and accuracy is the average of the two.  Because we are not comparing to a
> high quality reference (reference is arbitrary for AED), it is more
> correct to refer to the average of sensitivity and specificity as the
> congruency rather than accuracy; where C = (SN+SP)/2. The incongruency, or
> distance between i and j, then becomes D = 1-C, with a value of 0
> indicating complete agreement of an annotation to the evidence, and values
> at or near 1 indicating disagreement or no evidence support."
> 
> 
> 
> The ab-initio AED in comparison is the pairwise AED calculated between
> each overlapping prediction and then averaged.  Each pair then have a
> score representing it's average distance from the overlapping set of
> predictions as a whole.  So a value of .1 would be 10% average
> incongruency or 90% average congruency.
> 
> Thanks,
> Carson
> 
> 
> 
> On 12-06-01 3:07 PM, "Gowthaman Ramasamy"
> <gowthaman.ramasamy at seattlebiomed.org> wrote:
> 
>> That sounds really good.
>> 
>> Just wondering what would that float point mean?
>> 
>> fraction of gene prediction algorithms predicted that region to contain a
>> gene (irrespective of boundaries matching) so 0.2 means 20% of algorithms
>> predicted it?? 
>> Or 
>> it just indicates lever of concordance (in maker language) and user has
>> to try different values before settling on one?
>> 
>> Thanks,
>> gowthaman
>> ________________________________________
>> From: Carson Holt [carsonhh at gmail.com]
>> Sent: Friday, June 01, 2012 11:52 AM
>> To: Barry Moore
>> Cc: Gowthaman Ramasamy; maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] Can maker select a gene model based on
>> #algoritham predicted it
>> 
>> One idea related to this.  I could have keep_preds be a floating point
>> value between 0 and 1.  This would then represent a threshold for an
>> internal MAKER value called the ab-initio AED (it already exists
>> internally deep in MAKER).  0 would turn keep_preds off (as it does now),
>> 1 would keep everything (as it does now), and values in between would
>> allow the user to dial in the degree of consensus among overlapping
>> predictions when considering them without evidence.  The ab-initio AED
>> already works similar to AED, with 0 being perfect concordance and 1
>> being complete discordance.
>> 
>> --Carson
>> 
>> 
>> 
>> From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
>> Date: Friday, 1 June, 2012 2:41 PM
>> To: Barry Moore 
>> <barry.moore at genetics.utah.edu<mailto:barry.moore at genetics.utah.edu>>
>> Cc: Gowthaman Ramasamy
>> <gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>> med.org>>, 
>> "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
>> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
>> Subject: Re: [maker-devel] Can maker select a gene model based on
>> #algoritham predicted it
>> 
>> While I could add an option to keep them if there are more than one, the
>> actual implementation is not as trivial as it seems.  On some organisms
>> like fungi and oomycetes, the predictions that don't overlap evidence
>> tend to be similar to each other across predictors, but on other
>> eukaryotes with difficult and complex intron/exon structure like lamprey
>> or even planaria about the only time two predictors will produce similar
>> results coorelated with when there is evidence supporting them, and all
>> the unsupported regions are messy with weird partial overlaps (sometimes
>> even conflicting reading frames).  I have a figure in the MAKER2 paper
>> showing how poorly these algorithms perform on such organisms and how
>> additional evidence based feedback provided by MAKER produces
>> dramatically improved results.
>> 
>> The way I get around the issues when choosing the non-redundant
>> non-overlapping proteins recorded at the end of a MAKER run uses a
>> complex variant of the AED calculation across the alternate predictions
>> to build a consensus.  So in short it's not exactly as simple as just
>> saying there are two predictions at a given locus.  It would require some
>> thought (as well as good documentation), but it could probably be done.
>> 
>> --Carson
>> 
>> From: Barry Moore 
>> <barry.moore at genetics.utah.edu<mailto:barry.moore at genetics.utah.edu>>
>> Date: Friday, 1 June, 2012 2:22 PM
>> To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
>> Cc: Gowthaman Ramasamy
>> <gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>> med.org>>, 
>> "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
>> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
>> Subject: Re: [maker-devel] Can maker select a gene model based on
>> #algoritham predicted it
>> 
>> Carson,
>> 
>> How hard would it be to have maker take an option something like
>> 'require_abinits=2' that would instruct maker to promote predictions that
>> overlap with (2, 3 or more) other predictions?  Seems like the maker
>> might have all that info in one place at some point already?
>> 
>> Gowthaman, your contributions to the maker tutorial would be most
>> welcome.  I've got an offline copy of a newer tutorial wiki that is more
>> up to date than the GMOD version.  It's on a server right now that we've
>> got locked behind a firewall, but I'm hoping to move that to a public
>> facing server in the next week and I'd be happy to give you an account on
>> the wiki.
>> 
>> B
>> 
>> On May 30, 2012, at 6:54 AM, Carson Holt wrote:
>> 
>> It's not an option in exactly the way you are specifying, but there is
>> something I usually do for annotation that works well.  I run interproscan
>> or rpsblast on the non_overlapping.proteins.fasta file and select just
>> those non-overlapping models that have a recognizable protein domain (just
>> searching the pfam doamin space is more than sufficient).  Then I provide
>> the selected results to model_gff, and provide the previous maker results
>> to the maker_gff option with (all reannotation pass options set to 1 and
>> all analysis options turned off).  This adds models with at least
>> recognizable domains (as even multiple gene predictors can overpredict in
>> a similar way).
>> 
>> Attached is a script to help select predictions and upgrade them to models
>> in GFF3 format.  If you have question let me know.
>> 
>> Thanks,
>> Carson
>> 
>> 
>> 
>> On 12-05-29 5:54 PM, "Gowthaman Ramasamy"
>> <gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>> med.org>> wrote:
>> 
>> Hi Carson,
>> Thanks for all the help during the long weekend, in spite of that long
>> drive. I am still trying to imagine that.
>> 
>> I now have maker to consider our own prediction via pred_gff, and use
>> augustus and gene mark (with our training model). And i was able to use
>> altest and protein evidences. Maker happily picks one gene model when
>> there is a overlap between three different predictions. But, when I look
>> at the gff, it seems like it picks a gene model only when there is an
>> est/protein evidence. It leaves out some genes even though, they are
>> predicted by all three algorithms. Of course, keep_pred=1 helps to keep
>> all the models. This kind of leads to over prediction.
>> 
>> But, I am looking for something in between. And would like to know if
>> that is possible?
>> 1) Pick a gene model if it has an evidence from (est/prot etc...)
>> irrespective of how many algorithms predicted it
>> 2) In the absence of extrinsic evidence (est/prot etc), pick a gene model
>> if that is predicted by at least two algorithms.
>> 
>> Or even simpler:
>> I have ab-initio predictions from three algorithms, Can I output, those
>> genes that is supported by at least two of them. I care less about
>> exactness of gene boundaries.
>> 
>> Thanks,
>> Gowthaman
>> 
>> PS: With my recent attempts, i learned couple things about maker/other
>> associated tools that is not documented in gmod-maker wiki. Is it
>> possible/ok if I add contents to it. I am okay with running it by you
>> before making it public.
>> 
>> <gff3_preds2models>_______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
>> Barry Moore
>> Research Scientist
>> Dept. of Human Genetics
>> University of Utah
>> Salt Lake City, UT 84112
>> --------------------------------------------
>> (801) 585-3543
>> 
>> 
>> 
>> 
> 
> 

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20120601/c444140d/attachment-0003.html>