[maker-devel] A way to compare 2 annotation runs?

Mon Apr 25 13:46:23 MDT 2016

Hi Florian,

SomethinmRNA like this should work:

SOBAcl -data +_AED t/data/refseq_short.gff3 --data_type mean

Sorry this feature was undocumented and I discovered a bug in it while I was looking at it just now, so you’ll need to pull an update from git for it to work correctly.  Basically if you add a ‘+’ to the valued passed to —data SOBAcl will treat the —data value (with the + removed) as a key to look up the value in the attributes from column 9, so if +_AED is given on the command line then the value of the _AED attribute will be used for the summary statistics.  Note if the attribute is missing for a given feature then 0 is used as the value (which is of course different than treating it as NULL).  Also note if —data_type is count then feature that have the given attribute are counted regardless of the value of the attribute.

Just FYI, grabbing those values with a GAL script would look like this (untested):

use GAL::Annotation;
my $annot = GAL::Annotation->new(qw(file.gff);
my $features = $annot->features;
my $mRNAs = $features->search( {type => ‘mRNA'} );
while (my $mRNA = $mRNAs->next) {
    print $mRNA->feature_id;
    print “\t";
    print $mRNA->attribute_value(‘_AED’);
    print “\n”;
    }
}

B

On Apr 25, 2016, at 5:55 AM, Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de>> wrote:

Hello All,

First off, thank you all for your input! I took a look at all your suggestions and have some questions:

The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file:

scaffold2278_size3634    maker    mRNA    124    2128    .    -    .    ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus);
Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated?  Less genes but longer averages?

For the moment I will take a look at GAL, though perl is not my strongest language.

For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned.

The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite?

You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation?

I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input.

kind regards,
Florian

On 20.04.2016 15:16, Campbell, Michael wrote:

I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three.

MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5.

There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics.  Here is a link to the paper. Let me know if you need me to send you a pdf.
http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract

I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful.
https://github.com/mscampbell/Genome_annotation

The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you.

Mike
On Apr 19, 2016, at 5:44 PM, Cook, Malcolm <MEC at stowers.org<mailto:MEC at stowers.org><mailto:MEC at stowers.org><mailto:MEC at stowers.org>> wrote:

Just a quick thought

The smallest summary of what you’re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html

??

From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore
Sent: Tuesday, April 19, 2016 4:37 PM
To: Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de>>; maker-devel <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>>
Cc: Campbell, Michael <mcampbel at cshl.edu<mailto:mcampbel at cshl.edu><mailto:mcampbel at cshl.edu><mailto:mcampbel at cshl.edu>>
Subject: Re: [maker-devel] A way to compare 2 annotation runs?

The Sequence Ontology provides some tools for this:

SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout.
https://github.com/The-Sequence-Ontology/SOBA

This simple example provides a table for two GFF3 files of the count of feature types:

SOBAcl --columns file --rows type --data type --data_type count   \

  data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff

More complex examples are available in the test file SOBA/t/sobacl_test.sh

The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents.  I has some scripts that would provide metrics along the lines of what you’re looking for, but is primarily a programing library to make it easy to roll your own
https://github.com/The-Sequence-Ontology/GAL<https://github.com/The-Sequence-Ontology/SOBA><https://github.com/The-Sequence-Ontology/SOBA>

If you’re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce:

use GAL::Annotation;

my $annot = GAL::Annotation->new(qw(file.gff file.fasta);

my $features = $annot->features;

my $genes = $features->search( {type => ‘gene'} );

while (my $gene = $genes->next) {

    print $gene->feature_id        . “\t";

    print $gene->splice_complexity . “\n”;

    }

}

Hope that helps,

Barry

On Apr 19, 2016, at 9:08 AM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com><mailto:carsonhh at gmail.com><mailto:carsonhh at gmail.com>> wrote:

I’m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help.

—Carson

On Apr 19, 2016, at 6:08 AM, Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de>> wrote:

Hello All,

We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this.

I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful.

So how are people assessing quality of a maker run? How do you say one run was 'better' than another?

best regards & thanks for your input,
Florian

_______________________________________________
maker-devel mailing list
maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>
http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org

<AED_cummulative.png><comparison.txt>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20160425/fdb5915b/attachment-0003.html>