[maker-devel] Can maker select a gene model based on #algoritham predicted it

Gowthaman Ramasamy gowthaman.ramasamy at seattlebiomed.org
Fri Jun 1 13:28:21 MDT 2012


That sounds really neat. I will read those papers. Thanks for sharing.

Gowthaman
________________________________________
From: Carson Holt [carsonhh at gmail.com]
Sent: Friday, June 01, 2012 12:23 PM
To: Gowthaman Ramasamy; Barry Moore
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Can maker select a gene model based on #algoritham predicted it

The metric AED is Annotation Edit Distance (original paper -->
http://www.biomedcentral.com/1471-2105/10/67).  It's roughly related to
the sensitivity/specificity measure used to quantify the performance of
gene predictors and can be used to measure changes in gene models across
releases, and I further adapted use it for some slightly different purpose
than given in the original paper above.

This is copied from the MAKER2 paper -->
"Given a gene prediction i and a reference j, the base pair level
sensitivity can be calculated using the formula SN = |i∩j|/|j|; where
|i∩j| represents the number overlapping nucleotides between i and j, and
|j| represents the total number of nucleotides in the reference j.
Alternatively, specificity is calculated using the formula SP = |i∩j|/|i|,
and accuracy is the average of the two.  Because we are not comparing to a
high quality reference (reference is arbitrary for AED), it is more
correct to refer to the average of sensitivity and specificity as the
congruency rather than accuracy; where C = (SN+SP)/2. The incongruency, or
distance between i and j, then becomes D = 1-C, with a value of 0
indicating complete agreement of an annotation to the evidence, and values
at or near 1 indicating disagreement or no evidence support."



The ab-initio AED in comparison is the pairwise AED calculated between
each overlapping prediction and then averaged.  Each pair then have a
score representing it's average distance from the overlapping set of
predictions as a whole.  So a value of .1 would be 10% average
incongruency or 90% average congruency.

Thanks,
Carson



On 12-06-01 3:07 PM, "Gowthaman Ramasamy"
<gowthaman.ramasamy at seattlebiomed.org> wrote:

>That sounds really good.
>
>Just wondering what would that float point mean?
>
>fraction of gene prediction algorithms predicted that region to contain a
>gene (irrespective of boundaries matching) so 0.2 means 20% of algorithms
>predicted it??
>Or
>it just indicates lever of concordance (in maker language) and user has
>to try different values before settling on one?
>
>Thanks,
>gowthaman
>________________________________________
>From: Carson Holt [carsonhh at gmail.com]
>Sent: Friday, June 01, 2012 11:52 AM
>To: Barry Moore
>Cc: Gowthaman Ramasamy; maker-devel at yandell-lab.org
>Subject: Re: [maker-devel] Can maker select a gene model based on
>#algoritham predicted it
>
>One idea related to this.  I could have keep_preds be a floating point
>value between 0 and 1.  This would then represent a threshold for an
>internal MAKER value called the ab-initio AED (it already exists
>internally deep in MAKER).  0 would turn keep_preds off (as it does now),
>1 would keep everything (as it does now), and values in between would
>allow the user to dial in the degree of consensus among overlapping
>predictions when considering them without evidence.  The ab-initio AED
>already works similar to AED, with 0 being perfect concordance and 1
>being complete discordance.
>
>--Carson
>
>
>
>From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
>Date: Friday, 1 June, 2012 2:41 PM
>To: Barry Moore
><barry.moore at genetics.utah.edu<mailto:barry.moore at genetics.utah.edu>>
>Cc: Gowthaman Ramasamy
><gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>med.org>>,
>"maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
><maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
>Subject: Re: [maker-devel] Can maker select a gene model based on
>#algoritham predicted it
>
>While I could add an option to keep them if there are more than one, the
>actual implementation is not as trivial as it seems.  On some organisms
>like fungi and oomycetes, the predictions that don't overlap evidence
>tend to be similar to each other across predictors, but on other
>eukaryotes with difficult and complex intron/exon structure like lamprey
>or even planaria about the only time two predictors will produce similar
>results coorelated with when there is evidence supporting them, and all
>the unsupported regions are messy with weird partial overlaps (sometimes
>even conflicting reading frames).  I have a figure in the MAKER2 paper
>showing how poorly these algorithms perform on such organisms and how
>additional evidence based feedback provided by MAKER produces
>dramatically improved results.
>
>The way I get around the issues when choosing the non-redundant
>non-overlapping proteins recorded at the end of a MAKER run uses a
>complex variant of the AED calculation across the alternate predictions
>to build a consensus.  So in short it's not exactly as simple as just
>saying there are two predictions at a given locus.  It would require some
>thought (as well as good documentation), but it could probably be done.
>
>--Carson
>
>From: Barry Moore
><barry.moore at genetics.utah.edu<mailto:barry.moore at genetics.utah.edu>>
>Date: Friday, 1 June, 2012 2:22 PM
>To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
>Cc: Gowthaman Ramasamy
><gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>med.org>>,
>"maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
><maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
>Subject: Re: [maker-devel] Can maker select a gene model based on
>#algoritham predicted it
>
>Carson,
>
>How hard would it be to have maker take an option something like
>'require_abinits=2' that would instruct maker to promote predictions that
>overlap with (2, 3 or more) other predictions?  Seems like the maker
>might have all that info in one place at some point already?
>
>Gowthaman, your contributions to the maker tutorial would be most
>welcome.  I've got an offline copy of a newer tutorial wiki that is more
>up to date than the GMOD version.  It's on a server right now that we've
>got locked behind a firewall, but I'm hoping to move that to a public
>facing server in the next week and I'd be happy to give you an account on
>the wiki.
>
>B
>
>On May 30, 2012, at 6:54 AM, Carson Holt wrote:
>
>It's not an option in exactly the way you are specifying, but there is
>something I usually do for annotation that works well.  I run interproscan
>or rpsblast on the non_overlapping.proteins.fasta file and select just
>those non-overlapping models that have a recognizable protein domain (just
>searching the pfam doamin space is more than sufficient).  Then I provide
>the selected results to model_gff, and provide the previous maker results
>to the maker_gff option with (all reannotation pass options set to 1 and
>all analysis options turned off).  This adds models with at least
>recognizable domains (as even multiple gene predictors can overpredict in
>a similar way).
>
>Attached is a script to help select predictions and upgrade them to models
>in GFF3 format.  If you have question let me know.
>
>Thanks,
>Carson
>
>
>
>On 12-05-29 5:54 PM, "Gowthaman Ramasamy"
><gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>med.org>> wrote:
>
>Hi Carson,
>Thanks for all the help during the long weekend, in spite of that long
>drive. I am still trying to imagine that.
>
>I now have maker to consider our own prediction via pred_gff, and use
>augustus and gene mark (with our training model). And i was able to use
>altest and protein evidences. Maker happily picks one gene model when
>there is a overlap between three different predictions. But, when I look
>at the gff, it seems like it picks a gene model only when there is an
>est/protein evidence. It leaves out some genes even though, they are
>predicted by all three algorithms. Of course, keep_pred=1 helps to keep
>all the models. This kind of leads to over prediction.
>
>But, I am looking for something in between. And would like to know if
>that is possible?
>1) Pick a gene model if it has an evidence from (est/prot etc...)
>irrespective of how many algorithms predicted it
>2) In the absence of extrinsic evidence (est/prot etc), pick a gene model
>if that is predicted by at least two algorithms.
>
>Or even simpler:
>I have ab-initio predictions from three algorithms, Can I output, those
>genes that is supported by at least two of them. I care less about
>exactness of gene boundaries.
>
>Thanks,
>Gowthaman
>
>PS: With my recent attempts, i learned couple things about maker/other
>associated tools that is not documented in gmod-maker wiki. Is it
>possible/ok if I add contents to it. I am okay with running it by you
>before making it public.
>
><gff3_preds2models>_______________________________________________
>maker-devel mailing list
>maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>Barry Moore
>Research Scientist
>Dept. of Human Genetics
>University of Utah
>Salt Lake City, UT 84112
>--------------------------------------------
>(801) 585-3543
>
>
>
>






More information about the maker-devel mailing list