[maker-devel] A way to compare 2 annotation runs?

Mon Apr 25 11:43:50 MDT 2016

I updated the AED_cdf_generator.pl script on github so it only looks at mRNA lines. The only time that it would get AEDs from the gene predictions is if pred_stats was set to 1. Was pred_stats=1 set in the maker_opts.ctl file?

Thanks,
Mike
> On Apr 25, 2016, at 1:29 PM, Campbell, Michael <mcampbel at cshl.edu> wrote:
> 
> Hi Florian,
> 
> I just looked at the code for the AED_cdf_generator.plscript and it is probably grabbing the AEDs off of the raw gene predictions. I’ll modify the code so it only grabs the mRNA lines. In the mean time you can use  the gff3_merge withe the -g flag and it will output the MAKER genes only. The command would look like this gff3_merge -g all.gff -o genes_only.gff
> 
> Mike 
>> On Apr 25, 2016, at 1:00 PM, Dolze, Florian <fdolze at students.uni-mainz.de> wrote:
>> 
>> This might be the case, I simply used the script on my complete output gff with all features in it without manually filtering for only mRNA. 
>> 
>> On s side note regarding keep_preds, if I wanted to call genes somewhat less stringent because I am expecting to find more, would I set this to e.g. 0.5 to increase the number called of genes?
>> 
>> -Florian
>> 
>>> Am 25.04.2016 um 17:30 schrieb Carson Holt <carsonhh at gmail.com>:
>>> 
>>> If you’re running with keep_preds=0, then you either passed in models with model_gff (always kept even without evidence support), or you are parsing out the AED of non-gene reference models from the GFF3 when building your CDF graph. If that is the case, make sure you only pull AED off of features labeled as mRNA in column 2 of the GFF3 and not match features.
>>> 
>>> —Carson
>>> 
>>> 
>>>> On Apr 25, 2016, at 9:05 AM, Florian <fdolze at students.uni-mainz.de> wrote:
>>>> 
>>>> 
>>>> Hi Mike,
>>>> 
>>>> We have run MAKER with keep_preds=0. For completeness I attached the options file we used. We used a SNAP model trained on CEGMA data, GeneMark and Augustus trained with their webservice for the first run and then iterated on the results.
>>>> 
>>>> We expect around 17.000-18.000 genes, but our annotation contains ~12.5k according to SOBAcl. If I remove ~40% with AED values of 1 I will be left with very few compared to the expected number.
>>>> 
>>>> 
>>>> 
>>>>                                     type X file type (count)
>>>> =========================================================================================================
>>>> | |../v2_second_round_functional_blast.gff|../v2_third_round_functional_blast.gff|
>>>> =========================================================================================================
>>>> |CDS                     |  63953 |  65160                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |contig                  |   5292 |   5292                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |exon                    |  60381 |  61233                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |expressed_sequence_match| 275160                                | 275160                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |five_prime_UTR          |   9424 |   8764                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |gene                    |  12654 |  12235                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |mRNA                    |  13698 |  13137                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |match                   | 146111                                | 136852                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |match_part              |1704978 |1697601                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |protein_match           | 421814                                | 421814                               |
>>>> +------------------------+---------------------------------------+--------------------------------------+
>>>> |three_prime_UTR         |   6894 |   6325                               |
>>>> ---------------------------------------------------------------------------------------------------------
>>>> 
>>>> 
>>>> regards,
>>>> Florian
>>>> 
>>>> 
>>>>> On 25.04.2016 16:16, Campbell, Michael wrote:
>>>>> Hi Florian,
>>>>> 
>>>>> Your not off topic here. I’ve attached the paper.
>>>>> 
>>>>> Looking at the plot you sent I’m guessing that there is red dot right underneath the turquoise do t at at (1,1), that would be consistent with the compare annotation script output. Do you have keep_preds=1 set in the maker_opts.ctl file? If so that would explain the abundance of AED=1 annotations. When keep_preds is set to 1 all of the gene predictions are reported as gene models, when keep_preds is set to 0 only the models with evidence support are reported. Also, how many genes are you expecting and how many are you getting?
>>>>> 
>>>>> The paper I attached goes over different approaches to building final gene sets. The plot attached suggests to me that you have a bunch of unsupported gene models that need to be cleaned out. I will commonly filter out any gene model with an AED of 1 unless it has a protein family domain. This will almost certainly bring the fraction of annotated gene models with an AED <0.5 up to around 90% or more.
>>>>> 
>>>>> As annotations improve you do usually see fewer total genes but they are longer.
>>>>> 
>>>>> One of the best ways to get a feel for annotation quality is to load the annotations in to a browser like apollo or jbrowse and look at a few of your favorite genes
>>>>> 
>>>>> Thanks,
>>>>> Mike
>>>>> 
>>>>> 
>>>>> On Apr 25, 2016, at 7:55 AM, Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de>> wrote:
>>>>> 
>>>>> Hello All,
>>>>> 
>>>>> First off, thank you all for your input! I took a look at all your suggestions and have some questions:
>>>>> 
>>>>> The SOBAcl tool is nice but I cant seem to find a way to get to the AED values MAKER produces. For example here is a line from my GFF file:
>>>>> 
>>>>> scaffold2278_size3634    maker    mRNA    124    2128    .    -    .    ID=CRIP_012390-RA;Parent=CRIP_012390;Name=CRIP_012390-RA;Alias=maker-scaffold2278_size3634-augustus-gene-0.3-mRNA-1;_AED=0.16;_QI=0|1|0.33|1|1|1|3|0|574;_eAED=0.16;Note=Similar to Tbc1d25: TBC1 domain family member 25 (Mus musculus);
>>>>> Notice the _AED entry is in the 9th "field" combined with all the other descriptive data. Is there a way to get to this? The information about number and mean/distribution of length of genes, while certainly valuable, is hard to interpret for me. How would one classify improvement? More genes annotated?  Less genes but longer averages?
>>>>> 
>>>>> For the moment I will take a look at GAL, though perl is not my strongest language.
>>>>> 
>>>>> 
>>>>> For the scripts Michael provided I have attached the results. It would be great if you could send me a pdf version of the paper you mentioned.
>>>>> 
>>>>> The comparison script lists SN/SP/AC with >98% which indicates there should be no big changes between annotations right? But the cumulative AED graph shows a LOT entries have an AED value of 1 which would indicate the exact opposite?
>>>>> 
>>>>> You said 95% with less than 0.5 AED would be pretty good, soo only ~55% would mean this is a pretty bad annotation?
>>>>> 
>>>>> I am not sure if this is maybe to far off topic for the maker mailing list, but thank you for any clarification / input.
>>>>> 
>>>>> 
>>>>> 
>>>>> kind regards,
>>>>> Florian
>>>>> 
>>>>> On 20.04.2016 15:16, Campbell, Michael wrote:
>>>>> 
>>>>> I suspect the Jaccard distance would let you see the annotation sets converging over iterations. The distance between run one and run three should be greater than the distance between run one and two or run two and three.
>>>>> 
>>>>> MAKER calculates a modified Jaccard distance between the MAKER generated gene models and the aligned evidence called Annotation Edit Distance or AED. Comparing the distribution of AEDs between annotations is a way to tell which annotation set matches the evidence the best. As a rule of thumb an annotation set is pretty good if greater than ~95% of the annotations have an AED less than 0.5.
>>>>> 
>>>>> There is an accessory script in the MAKER bin called AED_cdf_generator.pl that helps in comparing AED scores. This script is mentioned in the protocols paper Carson mentioned. This paper also describes using protein family domains and homology to manually curated proteins in swissprot as quality metrics.  Here is a link to the paper. Let me know if you need me to send you a pdf.
>>>>> http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0411s48/abstract
>>>>> 
>>>>> I also have a "use at your own risk" script on github that I use to compare MAKER runs two at a time. the script is called compare_annotations_3.2.pl. This particular script has had a long evolution, so it is a little hard to follow the code, but it might be helpful.
>>>>> https://github.com/mscampbell/Genome_annotation
>>>>> 
>>>>> The SOBA tool that Barry mentioned is a lot more flexible and if you are familiar with perl the GAL library does a lot of heavy lifting for you.
>>>>> 
>>>>> Mike
>>>>> On Apr 19, 2016, at 5:44 PM, Cook, Malcolm <MEC at stowers.org<mailto:MEC at stowers.org><mailto:MEC at stowers.org><mailto:MEC at stowers.org>> wrote:
>>>>> 
>>>>> Just a quick thought
>>>>> 
>>>>> The smallest summary of what you’re after might be the jaccard difference between you annotation as computed by bedtoolshttp://bedtools.readthedocs.org/en/latest/content/tools/jaccard.html
>>>>> 
>>>>> ??
>>>>> 
>>>>> 
>>>>> From: maker-devel [mailto:maker-devel-bounces at yandell-lab.org] On Behalf Of Barry Moore
>>>>> Sent: Tuesday, April 19, 2016 4:37 PM
>>>>> To: Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de>>; maker-devel <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>>
>>>>> Cc: Campbell, Michael <mcampbel at cshl.edu<mailto:mcampbel at cshl.edu><mailto:mcampbel at cshl.edu><mailto:mcampbel at cshl.edu>>
>>>>> Subject: Re: [maker-devel] A way to compare 2 annotation runs?
>>>>> 
>>>>> The Sequence Ontology provides some tools for this:
>>>>> 
>>>>> SOBAcl has some pre-configured reports/graphs with some flexibility to modify their content/layout.
>>>>> https://github.com/The-Sequence-Ontology/SOBA
>>>>> 
>>>>> This simple example provides a table for two GFF3 files of the count of feature types:
>>>>> 
>>>>> 
>>>>> SOBAcl --columns file --rows type --data type --data_type count   \
>>>>> 
>>>>> data/dmel-all-r5.30_0001000.gff data/dmel-all-r5.30_0010000.gff
>>>>> 
>>>>> More complex examples are available in the test file SOBA/t/sobacl_test.sh
>>>>> 
>>>>> 
>>>>> The GAL library is a perl library that works well with MAKER output and other valid GFF3 documents.  I has some scripts that would provide metrics along the lines of what you’re looking for, but is primarily a programing library to make it easy to roll your own
>>>>> https://github.com/The-Sequence-Ontology/GAL<https://github.com/The-Sequence-Ontology/SOBA><https://github.com/The-Sequence-Ontology/SOBA>
>>>>> 
>>>>> If you’re OK with a little bit of perl code, modifying the synopsis code in the README a bit you can generate the splice complexity metrics described here (http://www.ncbi.nlm.nih.gov/pubmed/19236712) are easy to produce:
>>>>> 
>>>>> use GAL::Annotation;
>>>>> 
>>>>> my $annot = GAL::Annotation->new(qw(file.gff file.fasta);
>>>>> 
>>>>> my $features = $annot->features;
>>>>> 
>>>>> 
>>>>> 
>>>>> my $genes = $features->search( {type => ‘gene'} );
>>>>> 
>>>>> while (my $gene = $genes->next) {
>>>>> 
>>>>>  print $gene->feature_id        . “\t";
>>>>> 
>>>>>  print $gene->splice_complexity . “\n”;
>>>>> 
>>>>>  }
>>>>> 
>>>>> }
>>>>> 
>>>>> 
>>>>> Hope that helps,
>>>>> 
>>>>> Barry
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 19, 2016, at 9:08 AM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com><mailto:carsonhh at gmail.com><mailto:carsonhh at gmail.com>> wrote:
>>>>> 
>>>>> I’m going to ask Michael Campbell to answer this. He wrote a protocols paper that will help.
>>>>> 
>>>>> —Carson
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 19, 2016, at 6:08 AM, Florian <fdolze at students.uni-mainz.de<mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de><mailto:fdolze at students.uni-mainz.de>> wrote:
>>>>> 
>>>>> 
>>>>> Hello All,
>>>>> 
>>>>> We ran MAKER on a newly assembled genome for 3 iterations, since 2 seems to be the recommended standard and while on holiday I just ran it a third time. Now I want to compare the results of the iterations to see where the annotation (hopefully) improved/changed but I cant really come up with a clever way to this.
>>>>> 
>>>>> I reckon this has to be an often solved problem though I couldnt find a solution except an older entry in this mail-list but that wasnt helpful.
>>>>> 
>>>>> 
>>>>> So how are people assessing quality of a maker run? How do you say one run was 'better' than another?
>>>>> 
>>>>> 
>>>>> best regards & thanks for your input,
>>>>> Florian
>>>>> 
>>>>> _______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>
>>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>
>>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>>>>> 
>>>>> _______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org><mailto:maker-devel at yandell-lab.org>
>>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>>>>> 
>>>>> 
>>>>> 
>>>>> <AED_cummulative.png><comparison.txt>
>>>> 
>>>> <maker_opts_run2.log>_______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at yandell-lab.org
>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>>> 
>