[maker-devel] AED score

Parul Kudtarkar parulk at caltech.edu
Thu Nov 29 17:31:40 MST 2012


Thanks for the guidance Carson, total contig size is 330,611 with N50 of
39.17kb. I agree we have short ESTs. So this is the possible reason when
filtering based on AED score 0.75 there are no gene models predicted
despite the model_gff file has few genes with scores less than 0.75?

Thanks and regards,
Parul Kudtarkar

> There are certain characteristics that are apparent in this contig.
First
> it seems to be repeat rich with a very low gene density.  You also have
very short ESTs, and because of the lengths you are probably getting
many
> of them to align spuriously which produces very short gene models that
are
> more than likely false positives or at the very least just a piece of a
gene.  I would turn off est2genome as a predictor for this reason unless
you can get longer EST assemblies (i.e. From mRNAseq).    Your protein
alignments also seem to be few and far between.  You probably need to
add
> more proteins from a couple of related species, and you might consider
using protein2genome rather than est2genome as a predictor if you are
still working to generate a training set. Also est2genome produced
models
> almost always have an AED score near 0 so mixing est2genome with the
AED_threshold with such limited protein support does create an
artificial
> bias to get back very short and incomplete models.
>
> How many contigs do you have in total and what is the N50 value for the
assembly? If you have a large number of very short contigs, you will get
very inflated gene counts because you get genes split across contigs and
many contigs tend t be subtle rearrangements of other contigs just
assembled in a slightly different way (so you can get bits and pieces of
the same genes just rearranged).  This scenario is another confounding
factor if using the est2genome predictor with short ESTs.  I would
recommend running CEGMA to get an estimate for the genome completeness
as
> well as get an estimate of fragmentation as one of the statistics
produced
> is a percent of genes that are found complete (end to end) vs those that
are partial.  CEGMA identifies house keeping genes that tend to be
shorter
> and less intron rich than other genes in the genome, so if CEGMA gives a
high partial percentage and a low complete percentage, then this pattern
can be expected to be even more exaggerated for other genes in the
genome.
>
> If your genome is highly fragmented or proteins do not align well then
there are other strategies.  For example, some vertebrate genomes end up
having extremely fragmented assemblies (on the order of 100,000
contigs),
> and if they are distantly related to other annotated species few
proteins
> may align to the contigs because the introns in the alignments tend to
be
> so long and exons so short that it pushes down the significance scores
too
> much.  In those cases heavy mRNAseq seems to be the best if not only way
to get enough evidence to stitch gene models together.
>
> Thanks,
> Carson
>
>
>
> On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>
>>Dear Carson and Daniel,
>>Thanks. I ran sample file for filtering genes based on AED score. The
input gff3 file was provided to option model_pred(see attached file
Scaffold1.gff), the cutoff AED score was set to 0.75. There are at least
>> 5
>>genes with AED score less than 0.75. However there were no genes
>> predicted
>>in the output file(see attached file Scaffold1_out). I have also
attached
>>the maker_opts.ctl. Could you please advice on this.
>>Thanks and regards,
>>Parul Kudtarkar
>>> Use the AED_threshold option in the maker_opts.ctl file if you just want
>>> to restrict final gene models to close matches directly within maker.
>>>On
>>> the other hand, if you are trying to build a dataset for training gene
predictors, use the maker2zff script for generating a filtered dataset
>>>for
>>> SNAP training.  There are a number of filters available. Just call the
script once without parameters to see the options.
>>> Thanks,
>>> Carson
>>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>Hi Parul,
>>>>I think the way you described (with the maker_opts.ctl file) is how
you
>>>>want to proceed. You still need to give the genome too.
>>>>Daniel
>>>>Daniel Ence
>>>>Graduate Student
>>>>Eccles Institute of Human Genetics
>>>>University of Utah
>>>>15 North 2030 East, Room 2100
>>>>Salt Lake City, UT 84112-5330
>>>>________________________________________
>>>>From: maker-devel-bounces at yandell-lab.org
>>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul Kudtarkar
[parulk at caltech.edu]
>>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>>To: Parul Kudtarkar
>>>>Cc: maker-devel at yandell-lab.org
>>>>Subject: Re: [maker-devel] AED score
>>>>Also, are there any other parameters that are required when filtering
based on AED score?
>>>>> Hello Carson,
>>>>> Just to confirm, Is there a script that would filter gene models at
specific AED score.
>>>>> Alternatively if I were to do this within maker with regards to
>>>>>parameters
>>>>> in maker_opts.ctl file I would have to provide my predicted genes gff3
>>>>> file to model_gff and  set AED_threshold at desired threshold?
Thanks and regards,
>>>>> Parul Kudtarkar
>>>>>> AED score with 1 are the ones you don't want.  0 is best and 1 is
worst
>>>>>> as
>>>>>> it is a distance metric.  You can use the AED_threshold parameter
to
>>>>>> require better matching to the evidence by setting it closer to 0.
>>>>>>You
>>>>>> can
>>>>>> also try to increase protein homology evidence as some of your
calls
>>>>>>may
>>>>>> be split genes due to lack of evidence linking them.
>>>>>> --Carson
>>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>Dear Maker community,
>>>>>>>For gene-prediction I get training data-set from evidence based
prediction, I use this data-set to train SNAP as well as Augustus
predictions, followed by boot-strapping. I would typically expect
20-30K
>>>>>>>genes however I am getting 8 times the expected gene count
>>>>>>> indicating
>>>>>>> too
>>>>>>>many false positives. Is there a way to further refine these
predication/script to retain predictions with AED score 1 and if
yes
>>>>>>>how
>>>>>>>to go about this?
>>>>>>>Thanks and regards,
>>>>>>>Parul Kudtarkar
>>>>>>>--
>>>>>>>Scientific Programmer
>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>Beckman Institute,
>>>>>>>California Institute of Technology
>>>>>>>http://www.spbase.org
>>>>>>>_______________________________________________
>>>>>>>maker-devel mailing list
>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o
rg
>>>>> --
>>>>> Scientific Programmer
>>>>> Center for Computational Regulatory Genomics
>>>>> Beckman Institute,
>>>>> California Institute of Technology
>>>>> http://www.spbase.org
>>>>--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org
>>>>_______________________________________________
>>>>maker-devel mailing list
>>>>maker-devel at box290.bluehost.com
>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
_______________________________________________
>>>>maker-devel mailing list
>>>>maker-devel at box290.bluehost.com
>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org
>
>
>


--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org








More information about the maker-devel mailing list