[maker-devel] Can maker select a gene model based on #algoritham predicted it
Barry Moore
barry.moore at genetics.utah.edu
Fri Jun 1 14:39:25 MDT 2012
I like this a lot Carson - for two reasons: First, it sounds like it's fairly easy to implement with that data and code that already exists within MAKER! And second it sounds like the right way to be doing this - the more the abintis agree the more likely to they are to be correct.
B
On Jun 1, 2012, at 1:23 PM, Carson Holt wrote:
> The metric AED is Annotation Edit Distance (original paper -->
> http://www.biomedcentral.com/1471-2105/10/67). It's roughly related to
> the sensitivity/specificity measure used to quantify the performance of
> gene predictors and can be used to measure changes in gene models across
> releases, and I further adapted use it for some slightly different purpose
> than given in the original paper above.
>
> This is copied from the MAKER2 paper -->
> "Given a gene prediction i and a reference j, the base pair level
> sensitivity can be calculated using the formula SN = |i∩j|/|j|; where
> |i∩j| represents the number overlapping nucleotides between i and j, and
> |j| represents the total number of nucleotides in the reference j.
> Alternatively, specificity is calculated using the formula SP = |i∩j|/|i|,
> and accuracy is the average of the two. Because we are not comparing to a
> high quality reference (reference is arbitrary for AED), it is more
> correct to refer to the average of sensitivity and specificity as the
> congruency rather than accuracy; where C = (SN+SP)/2. The incongruency, or
> distance between i and j, then becomes D = 1-C, with a value of 0
> indicating complete agreement of an annotation to the evidence, and values
> at or near 1 indicating disagreement or no evidence support."
>
>
>
> The ab-initio AED in comparison is the pairwise AED calculated between
> each overlapping prediction and then averaged. Each pair then have a
> score representing it's average distance from the overlapping set of
> predictions as a whole. So a value of .1 would be 10% average
> incongruency or 90% average congruency.
>
> Thanks,
> Carson
>
>
>
> On 12-06-01 3:07 PM, "Gowthaman Ramasamy"
> <gowthaman.ramasamy at seattlebiomed.org> wrote:
>
>> That sounds really good.
>>
>> Just wondering what would that float point mean?
>>
>> fraction of gene prediction algorithms predicted that region to contain a
>> gene (irrespective of boundaries matching) so 0.2 means 20% of algorithms
>> predicted it??
>> Or
>> it just indicates lever of concordance (in maker language) and user has
>> to try different values before settling on one?
>>
>> Thanks,
>> gowthaman
>> ________________________________________
>> From: Carson Holt [carsonhh at gmail.com]
>> Sent: Friday, June 01, 2012 11:52 AM
>> To: Barry Moore
>> Cc: Gowthaman Ramasamy; maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] Can maker select a gene model based on
>> #algoritham predicted it
>>
>> One idea related to this. I could have keep_preds be a floating point
>> value between 0 and 1. This would then represent a threshold for an
>> internal MAKER value called the ab-initio AED (it already exists
>> internally deep in MAKER). 0 would turn keep_preds off (as it does now),
>> 1 would keep everything (as it does now), and values in between would
>> allow the user to dial in the degree of consensus among overlapping
>> predictions when considering them without evidence. The ab-initio AED
>> already works similar to AED, with 0 being perfect concordance and 1
>> being complete discordance.
>>
>> --Carson
>>
>>
>>
>> From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
>> Date: Friday, 1 June, 2012 2:41 PM
>> To: Barry Moore
>> <barry.moore at genetics.utah.edu<mailto:barry.moore at genetics.utah.edu>>
>> Cc: Gowthaman Ramasamy
>> <gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>> med.org>>,
>> "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
>> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
>> Subject: Re: [maker-devel] Can maker select a gene model based on
>> #algoritham predicted it
>>
>> While I could add an option to keep them if there are more than one, the
>> actual implementation is not as trivial as it seems. On some organisms
>> like fungi and oomycetes, the predictions that don't overlap evidence
>> tend to be similar to each other across predictors, but on other
>> eukaryotes with difficult and complex intron/exon structure like lamprey
>> or even planaria about the only time two predictors will produce similar
>> results coorelated with when there is evidence supporting them, and all
>> the unsupported regions are messy with weird partial overlaps (sometimes
>> even conflicting reading frames). I have a figure in the MAKER2 paper
>> showing how poorly these algorithms perform on such organisms and how
>> additional evidence based feedback provided by MAKER produces
>> dramatically improved results.
>>
>> The way I get around the issues when choosing the non-redundant
>> non-overlapping proteins recorded at the end of a MAKER run uses a
>> complex variant of the AED calculation across the alternate predictions
>> to build a consensus. So in short it's not exactly as simple as just
>> saying there are two predictions at a given locus. It would require some
>> thought (as well as good documentation), but it could probably be done.
>>
>> --Carson
>>
>> From: Barry Moore
>> <barry.moore at genetics.utah.edu<mailto:barry.moore at genetics.utah.edu>>
>> Date: Friday, 1 June, 2012 2:22 PM
>> To: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
>> Cc: Gowthaman Ramasamy
>> <gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>> med.org>>,
>> "maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>"
>> <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
>> Subject: Re: [maker-devel] Can maker select a gene model based on
>> #algoritham predicted it
>>
>> Carson,
>>
>> How hard would it be to have maker take an option something like
>> 'require_abinits=2' that would instruct maker to promote predictions that
>> overlap with (2, 3 or more) other predictions? Seems like the maker
>> might have all that info in one place at some point already?
>>
>> Gowthaman, your contributions to the maker tutorial would be most
>> welcome. I've got an offline copy of a newer tutorial wiki that is more
>> up to date than the GMOD version. It's on a server right now that we've
>> got locked behind a firewall, but I'm hoping to move that to a public
>> facing server in the next week and I'd be happy to give you an account on
>> the wiki.
>>
>> B
>>
>> On May 30, 2012, at 6:54 AM, Carson Holt wrote:
>>
>> It's not an option in exactly the way you are specifying, but there is
>> something I usually do for annotation that works well. I run interproscan
>> or rpsblast on the non_overlapping.proteins.fasta file and select just
>> those non-overlapping models that have a recognizable protein domain (just
>> searching the pfam doamin space is more than sufficient). Then I provide
>> the selected results to model_gff, and provide the previous maker results
>> to the maker_gff option with (all reannotation pass options set to 1 and
>> all analysis options turned off). This adds models with at least
>> recognizable domains (as even multiple gene predictors can overpredict in
>> a similar way).
>>
>> Attached is a script to help select predictions and upgrade them to models
>> in GFF3 format. If you have question let me know.
>>
>> Thanks,
>> Carson
>>
>>
>>
>> On 12-05-29 5:54 PM, "Gowthaman Ramasamy"
>> <gowthaman.ramasamy at seattlebiomed.org<mailto:gowthaman.ramasamy at seattlebio
>> med.org>> wrote:
>>
>> Hi Carson,
>> Thanks for all the help during the long weekend, in spite of that long
>> drive. I am still trying to imagine that.
>>
>> I now have maker to consider our own prediction via pred_gff, and use
>> augustus and gene mark (with our training model). And i was able to use
>> altest and protein evidences. Maker happily picks one gene model when
>> there is a overlap between three different predictions. But, when I look
>> at the gff, it seems like it picks a gene model only when there is an
>> est/protein evidence. It leaves out some genes even though, they are
>> predicted by all three algorithms. Of course, keep_pred=1 helps to keep
>> all the models. This kind of leads to over prediction.
>>
>> But, I am looking for something in between. And would like to know if
>> that is possible?
>> 1) Pick a gene model if it has an evidence from (est/prot etc...)
>> irrespective of how many algorithms predicted it
>> 2) In the absence of extrinsic evidence (est/prot etc), pick a gene model
>> if that is predicted by at least two algorithms.
>>
>> Or even simpler:
>> I have ab-initio predictions from three algorithms, Can I output, those
>> genes that is supported by at least two of them. I care less about
>> exactness of gene boundaries.
>>
>> Thanks,
>> Gowthaman
>>
>> PS: With my recent attempts, i learned couple things about maker/other
>> associated tools that is not documented in gmod-maker wiki. Is it
>> possible/ok if I add contents to it. I am okay with running it by you
>> before making it public.
>>
>> <gff3_preds2models>_______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>> Barry Moore
>> Research Scientist
>> Dept. of Human Genetics
>> University of Utah
>> Salt Lake City, UT 84112
>> --------------------------------------------
>> (801) 585-3543
>>
>>
>>
>>
>
>
Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20120601/c444140d/attachment-0003.html>
More information about the maker-devel
mailing list