[maker-devel] AED score

Thu Nov 29 07:22:42 MST 2012

There are certain characteristics that are apparent in this contig.  First
it seems to be repeat rich with a very low gene density.  You also have
very short ESTs, and because of the lengths you are probably getting many
of them to align spuriously which produces very short gene models that are
more than likely false positives or at the very least just a piece of a
gene.  I would turn off est2genome as a predictor for this reason unless
you can get longer EST assemblies (i.e. From mRNAseq).    Your protein
alignments also seem to be few and far between.  You probably need to add
more proteins from a couple of related species, and you might consider
using protein2genome rather than est2genome as a predictor if you are
still working to generate a training set. Also est2genome produced models
almost always have an AED score near 0 so mixing est2genome with the
AED_threshold with such limited protein support does create an artificial
bias to get back very short and incomplete models.

How many contigs do you have in total and what is the N50 value for the
assembly? If you have a large number of very short contigs, you will get
very inflated gene counts because you get genes split across contigs and
many contigs tend t be subtle rearrangements of other contigs just
assembled in a slightly different way (so you can get bits and pieces of
the same genes just rearranged).  This scenario is another confounding
factor if using the est2genome predictor with short ESTs.  I would
recommend running CEGMA to get an estimate for the genome completeness as
well as get an estimate of fragmentation as one of the statistics produced
is a percent of genes that are found complete (end to end) vs those that
are partial.  CEGMA identifies house keeping genes that tend to be shorter
and less intron rich than other genes in the genome, so if CEGMA gives a
high partial percentage and a low complete percentage, then this pattern
can be expected to be even more exaggerated for other genes in the genome.

If your genome is highly fragmented or proteins do not align well then
there are other strategies.  For example, some vertebrate genomes end up
having extremely fragmented assemblies (on the order of 100,000 contigs),
and if they are distantly related to other annotated species few proteins
may align to the contigs because the introns in the alignments tend to be
so long and exons so short that it pushes down the significance scores too
much.  In those cases heavy mRNAseq seems to be the best if not only way
to get enough evidence to stitch gene models together.

Thanks,
Carson

On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:

>Dear Carson and Daniel,
>
>Thanks. I ran sample file for filtering genes based on AED score. The
>input gff3 file was provided to option model_pred(see attached file
>Scaffold1.gff), the cutoff AED score was set to 0.75. There are at least 5
>genes with AED score less than 0.75. However there were no genes predicted
>in the output file(see attached file Scaffold1_out). I have also attached
>the maker_opts.ctl. Could you please advice on this.
>
>Thanks and regards,
>Parul Kudtarkar
>
>> Use the AED_threshold option in the maker_opts.ctl file if you just want
>> to restrict final gene models to close matches directly within maker.
>>On
>> the other hand, if you are trying to build a dataset for training gene
>> predictors, use the maker2zff script for generating a filtered dataset
>>for
>> SNAP training.  There are a number of filters available. Just call the
>> script once without parameters to see the options.
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>
>>>Hi Parul,
>>>
>>>I think the way you described (with the maker_opts.ctl file) is how you
>>>want to proceed. You still need to give the genome too.
>>>
>>>Daniel
>>>
>>>
>>>Daniel Ence
>>>Graduate Student
>>>Eccles Institute of Human Genetics
>>>University of Utah
>>>15 North 2030 East, Room 2100
>>>Salt Lake City, UT 84112-5330
>>>________________________________________
>>>From: maker-devel-bounces at yandell-lab.org
>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul Kudtarkar
>>>[parulk at caltech.edu]
>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>To: Parul Kudtarkar
>>>Cc: maker-devel at yandell-lab.org
>>>Subject: Re: [maker-devel] AED score
>>>
>>>Also, are there any other parameters that are required when filtering
>>>based on AED score?
>>>
>>>> Hello Carson,
>>>>
>>>> Just to confirm, Is there a script that would filter gene models at
>>>> specific AED score.
>>>> Alternatively if I were to do this within maker with regards to
>>>>parameters
>>>> in maker_opts.ctl file I would have to provide my predicted genes gff3
>>>> file to model_gff and  set AED_threshold at desired threshold?
>>>>
>>>> Thanks and regards,
>>>> Parul Kudtarkar
>>>>
>>>>> AED score with 1 are the ones you don't want.  0 is best and 1 is
>>>>> worst
>>>>> as
>>>>> it is a distance metric.  You can use the AED_threshold parameter to
>>>>> require better matching to the evidence by setting it closer to 0.
>>>>>You
>>>>> can
>>>>> also try to increase protein homology evidence as some of your calls
>>>>>may
>>>>> be split genes due to lack of evidence linking them.
>>>>>
>>>>> --Carson
>>>>>
>>>>>
>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>
>>>>>>Dear Maker community,
>>>>>>
>>>>>>For gene-prediction I get training data-set from evidence based
>>>>>>prediction, I use this data-set to train SNAP as well as Augustus
>>>>>>predictions, followed by boot-strapping. I would typically expect
>>>>>>20-30K
>>>>>>genes however I am getting 8 times the expected gene count indicating
>>>>>> too
>>>>>>many false positives. Is there a way to further refine these
>>>>>>predication/script to retain predictions with AED score 1 and if yes
>>>>>>how
>>>>>>to go about this?
>>>>>>
>>>>>>Thanks and regards,
>>>>>>Parul Kudtarkar
>>>>>>
>>>>>>--
>>>>>>Scientific Programmer
>>>>>>Center for Computational Regulatory Genomics
>>>>>>Beckman Institute,
>>>>>>California Institute of Technology
>>>>>>http://www.spbase.org
>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>maker-devel mailing list
>>>>>>maker-devel at box290.bluehost.com
>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o
>>>>>>rg
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Scientific Programmer
>>>> Center for Computational Regulatory Genomics
>>>> Beckman Institute,
>>>> California Institute of Technology
>>>> http://www.spbase.org
>>>>
>>>
>>>
>>>--
>>>Scientific Programmer
>>>Center for Computational Regulatory Genomics
>>>Beckman Institute,
>>>California Institute of Technology
>>>http://www.spbase.org
>>>
>>>
>>>_______________________________________________
>>>maker-devel mailing list
>>>maker-devel at box290.bluehost.com
>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>_______________________________________________
>>>maker-devel mailing list
>>>maker-devel at box290.bluehost.com
>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>
>
>--
>Scientific Programmer
>Center for Computational Regulatory Genomics
>Beckman Institute,
>California Institute of Technology
>http://www.spbase.org