[maker-devel] AED score

Thu Nov 29 18:39:31 MST 2012

Wow 330,000 is a lot. a large portion of genes are likely to be partial at
best.  You should seriously consider using mRNAseq to capture those by
using maker's est_gff option to pass in results from cufflinks or trinity.
 Also I wouldn't even try to annotate contigs less than 10kb in size, just
have maker skip them by setting the min_contig filter in the
maker_opts.ctl file.

Thanks,
Carson

On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:

>Thanks for the guidance Carson, total contig size is 330,611 with N50 of
>39.17kb. I agree we have short ESTs. So this is the possible reason when
>filtering based on AED score 0.75 there are no gene models predicted
>despite the model_gff file has few genes with scores less than 0.75?
>
>Thanks and regards,
>Parul Kudtarkar
>
>> There are certain characteristics that are apparent in this contig.
>First
>> it seems to be repeat rich with a very low gene density.  You also have
>very short ESTs, and because of the lengths you are probably getting
>many
>> of them to align spuriously which produces very short gene models that
>are
>> more than likely false positives or at the very least just a piece of a
>gene.  I would turn off est2genome as a predictor for this reason unless
>you can get longer EST assemblies (i.e. From mRNAseq).    Your protein
>alignments also seem to be few and far between.  You probably need to
>add
>> more proteins from a couple of related species, and you might consider
>using protein2genome rather than est2genome as a predictor if you are
>still working to generate a training set. Also est2genome produced
>models
>> almost always have an AED score near 0 so mixing est2genome with the
>AED_threshold with such limited protein support does create an
>artificial
>> bias to get back very short and incomplete models.
>>
>> How many contigs do you have in total and what is the N50 value for the
>assembly? If you have a large number of very short contigs, you will get
>very inflated gene counts because you get genes split across contigs and
>many contigs tend t be subtle rearrangements of other contigs just
>assembled in a slightly different way (so you can get bits and pieces of
>the same genes just rearranged).  This scenario is another confounding
>factor if using the est2genome predictor with short ESTs.  I would
>recommend running CEGMA to get an estimate for the genome completeness
>as
>> well as get an estimate of fragmentation as one of the statistics
>produced
>> is a percent of genes that are found complete (end to end) vs those that
>are partial.  CEGMA identifies house keeping genes that tend to be
>shorter
>> and less intron rich than other genes in the genome, so if CEGMA gives a
>high partial percentage and a low complete percentage, then this pattern
>can be expected to be even more exaggerated for other genes in the
>genome.
>>
>> If your genome is highly fragmented or proteins do not align well then
>there are other strategies.  For example, some vertebrate genomes end up
>having extremely fragmented assemblies (on the order of 100,000
>contigs),
>> and if they are distantly related to other annotated species few
>proteins
>> may align to the contigs because the introns in the alignments tend to
>be
>> so long and exons so short that it pushes down the significance scores
>too
>> much.  In those cases heavy mRNAseq seems to be the best if not only way
>to get enough evidence to stitch gene models together.
>>
>> Thanks,
>> Carson
>>
>>
>>
>> On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>
>>>Dear Carson and Daniel,
>>>Thanks. I ran sample file for filtering genes based on AED score. The
>input gff3 file was provided to option model_pred(see attached file
>Scaffold1.gff), the cutoff AED score was set to 0.75. There are at least
>>> 5
>>>genes with AED score less than 0.75. However there were no genes
>>> predicted
>>>in the output file(see attached file Scaffold1_out). I have also
>attached
>>>the maker_opts.ctl. Could you please advice on this.
>>>Thanks and regards,
>>>Parul Kudtarkar
>>>> Use the AED_threshold option in the maker_opts.ctl file if you just
>>>>want
>>>> to restrict final gene models to close matches directly within maker.
>>>>On
>>>> the other hand, if you are trying to build a dataset for training gene
>predictors, use the maker2zff script for generating a filtered dataset
>>>>for
>>>> SNAP training.  There are a number of filters available. Just call the
>script once without parameters to see the options.
>>>> Thanks,
>>>> Carson
>>>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>>Hi Parul,
>>>>>I think the way you described (with the maker_opts.ctl file) is how
>you
>>>>>want to proceed. You still need to give the genome too.
>>>>>Daniel
>>>>>Daniel Ence
>>>>>Graduate Student
>>>>>Eccles Institute of Human Genetics
>>>>>University of Utah
>>>>>15 North 2030 East, Room 2100
>>>>>Salt Lake City, UT 84112-5330
>>>>>________________________________________
>>>>>From: maker-devel-bounces at yandell-lab.org
>>>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul Kudtarkar
>[parulk at caltech.edu]
>>>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>To: Parul Kudtarkar
>>>>>Cc: maker-devel at yandell-lab.org
>>>>>Subject: Re: [maker-devel] AED score
>>>>>Also, are there any other parameters that are required when filtering
>based on AED score?
>>>>>> Hello Carson,
>>>>>> Just to confirm, Is there a script that would filter gene models at
>specific AED score.
>>>>>> Alternatively if I were to do this within maker with regards to
>>>>>>parameters
>>>>>> in maker_opts.ctl file I would have to provide my predicted genes
>>>>>>gff3
>>>>>> file to model_gff and  set AED_threshold at desired threshold?
>Thanks and regards,
>>>>>> Parul Kudtarkar
>>>>>>> AED score with 1 are the ones you don't want.  0 is best and 1 is
>worst
>>>>>>> as
>>>>>>> it is a distance metric.  You can use the AED_threshold parameter
>to
>>>>>>> require better matching to the evidence by setting it closer to 0.
>>>>>>>You
>>>>>>> can
>>>>>>> also try to increase protein homology evidence as some of your
>calls
>>>>>>>may
>>>>>>> be split genes due to lack of evidence linking them.
>>>>>>> --Carson
>>>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>>Dear Maker community,
>>>>>>>>For gene-prediction I get training data-set from evidence based
>prediction, I use this data-set to train SNAP as well as Augustus
>predictions, followed by boot-strapping. I would typically expect
>20-30K
>>>>>>>>genes however I am getting 8 times the expected gene count
>>>>>>>> indicating
>>>>>>>> too
>>>>>>>>many false positives. Is there a way to further refine these
>predication/script to retain predictions with AED score 1 and if
>yes
>>>>>>>>how
>>>>>>>>to go about this?
>>>>>>>>Thanks and regards,
>>>>>>>>Parul Kudtarkar
>>>>>>>>--
>>>>>>>>Scientific Programmer
>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>Beckman Institute,
>>>>>>>>California Institute of Technology
>>>>>>>>http://www.spbase.org
>>>>>>>>_______________________________________________
>>>>>>>>maker-devel mailing list
>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab
>>>>>>>>.o
>rg
>>>>>> --
>>>>>> Scientific Programmer
>>>>>> Center for Computational Regulatory Genomics
>>>>>> Beckman Institute,
>>>>>> California Institute of Technology
>>>>>> http://www.spbase.org
>>>>>--
>>>>>Scientific Programmer
>>>>>Center for Computational Regulatory Genomics
>>>>>Beckman Institute,
>>>>>California Institute of Technology
>>>>>http://www.spbase.org
>>>>>_______________________________________________
>>>>>maker-devel mailing list
>>>>>maker-devel at box290.bluehost.com
>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or
>>>>>g
>_______________________________________________
>>>>>maker-devel mailing list
>>>>>maker-devel at box290.bluehost.com
>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or
>>>>>g
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> 
>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>--
>>>Scientific Programmer
>>>Center for Computational Regulatory Genomics
>>>Beckman Institute,
>>>California Institute of Technology
>>>http://www.spbase.org
>>
>>
>>
>
>
>--
>Scientific Programmer
>Center for Computational Regulatory Genomics
>Beckman Institute,
>California Institute of Technology
>http://www.spbase.org
>
>
>
>