[maker-devel] A way to compare 2 annotation runs?

Wed Apr 20 07:16:43 MDT 2016

I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three.

MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5.

There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics.  Here is a link to the paper. Let me know if you need me to send you a pdf.
http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract

I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful.
https://github.com/mscampbell/Genome_annotation

The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you.

Mike
On Apr 19, 2016, at 5:44 PM, Cook, Malcolm <MEC at stowers.org<mailto:MEC at stowers.org>> wrote:

Just a quick thought

The smallest summary of what you’re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html

??

From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore
Sent: Tuesday, April 19, 2016 4:37 PM
To: Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de>>; maker-devel <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Cc: Campbell, Michael <mcampbel at cshl.edu<mailto:mcampbel at cshl.edu>>
Subject: Re: [maker-devel] A way to compare 2 annotation runs?

The Sequence Ontology provides some tools for this:

SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout.
https://github.com/The-Sequence-Ontology/SOBA

This simple example provides a table for two GFF3 files of the count of feature types:

SOBAcl --columns file --rows type --data type --data_type count   \

  data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff

More complex examples are available in the test file SOBA/t/sobacl_test.sh

The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents.  I has some scripts that would provide metrics along the lines of what you’re looking for, but is primarily a programing library to make it easy to roll your own
https://github.com/The-Sequence-Ontology/GAL<https://github.com/The-Sequence-Ontology/SOBA>

If you’re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce:

use GAL::Annotation;

my $annot = GAL::Annotation->new(qw(file.gff file.fasta);

my $features = $annot->features;

my $genes = $features->search( {type => ‘gene'} );

while (my $gene = $genes->next) {

    print $gene->feature_id        . “\t";

    print $gene->splice_complexity . “\n”;

    }

}

Hope that helps,

Barry

On Apr 19, 2016, at 9:08 AM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:

I’m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help.

—Carson

On Apr 19, 2016, at 6:08 AM, Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de>> wrote:

Hello All,

We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this.

I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful.

So how are people assessing quality of a maker run? How do you say one run was 'better' than another?

best regards & thanks for your input,
Florian

_______________________________________________
maker-devel mailing list
maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org