[maker-devel] AED score

Tue Dec 4 17:27:18 MST 2012

Dear Carson,

Thanks once again, we have limited experimental data with very short ESTs.
CEGMA is useful for us to gauge our gene-model.
On a different note to re-annotate genome(post evidence based
prediction(used as training dataset)and abinitio gene-prediction). Here
are the control parameters I am using with AED score set to 0.5, however I
get predictions that includes the ones with AED score of 1.00 in resulting
gff3 file. Though I do see the number of genes reduced to 1/3 of initial
gff3 file.

#-----Genome (Required for De-Novo Annotation)
genome=Scaffold1.fa #genome sequence file in fasta format
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
genome_gff= Scaffold1.gff #re-annotate genome based on this gff3 file
est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes, 0 = no
protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 = no

#-----MAKER Behavior Options
AED_threshold=0.5 #Maximum Annotation Edit Distance allowed (bound by 0
and 1)

Thanks and regards,
Parul Kudtarkar

> Wow 330,000 is a lot. a large portion of genes are likely to be partial at
> best.  You should seriously consider using mRNAseq to capture those by
> using maker's est_gff option to pass in results from cufflinks or trinity.
>  Also I wouldn't even try to annotate contigs less than 10kb in size, just
> have maker skip them by setting the min_contig filter in the
> maker_opts.ctl file.
>
> Thanks,
> Carson
>
>
>
>
> On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>
>>Thanks for the guidance Carson, total contig size is 330,611 with N50 of
>>39.17kb. I agree we have short ESTs. So this is the possible reason when
>>filtering based on AED score 0.75 there are no gene models predicted
>>despite the model_gff file has few genes with scores less than 0.75?
>>
>>Thanks and regards,
>>Parul Kudtarkar
>>
>>> There are certain characteristics that are apparent in this contig.
>>First
>>> it seems to be repeat rich with a very low gene density.  You also have
>>very short ESTs, and because of the lengths you are probably getting
>>many
>>> of them to align spuriously which produces very short gene models that
>>are
>>> more than likely false positives or at the very least just a piece of a
>>gene.  I would turn off est2genome as a predictor for this reason unless
>>you can get longer EST assemblies (i.e. From mRNAseq).    Your protein
>>alignments also seem to be few and far between.  You probably need to
>>add
>>> more proteins from a couple of related species, and you might consider
>>using protein2genome rather than est2genome as a predictor if you are
>>still working to generate a training set. Also est2genome produced
>>models
>>> almost always have an AED score near 0 so mixing est2genome with the
>>AED_threshold with such limited protein support does create an
>>artificial
>>> bias to get back very short and incomplete models.
>>>
>>> How many contigs do you have in total and what is the N50 value for the
>>assembly? If you have a large number of very short contigs, you will get
>>very inflated gene counts because you get genes split across contigs and
>>many contigs tend t be subtle rearrangements of other contigs just
>>assembled in a slightly different way (so you can get bits and pieces of
>>the same genes just rearranged).  This scenario is another confounding
>>factor if using the est2genome predictor with short ESTs.  I would
>>recommend running CEGMA to get an estimate for the genome completeness
>>as
>>> well as get an estimate of fragmentation as one of the statistics
>>produced
>>> is a percent of genes that are found complete (end to end) vs those
>>> that
>>are partial.  CEGMA identifies house keeping genes that tend to be
>>shorter
>>> and less intron rich than other genes in the genome, so if CEGMA gives
>>> a
>>high partial percentage and a low complete percentage, then this pattern
>>can be expected to be even more exaggerated for other genes in the
>>genome.
>>>
>>> If your genome is highly fragmented or proteins do not align well then
>>there are other strategies.  For example, some vertebrate genomes end up
>>having extremely fragmented assemblies (on the order of 100,000
>>contigs),
>>> and if they are distantly related to other annotated species few
>>proteins
>>> may align to the contigs because the introns in the alignments tend to
>>be
>>> so long and exons so short that it pushes down the significance scores
>>too
>>> much.  In those cases heavy mRNAseq seems to be the best if not only
>>> way
>>to get enough evidence to stitch gene models together.
>>>
>>> Thanks,
>>> Carson
>>>
>>>
>>>
>>> On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>
>>>>Dear Carson and Daniel,
>>>>Thanks. I ran sample file for filtering genes based on AED score. The
>>input gff3 file was provided to option model_pred(see attached file
>>Scaffold1.gff), the cutoff AED score was set to 0.75. There are at least
>>>> 5
>>>>genes with AED score less than 0.75. However there were no genes
>>>> predicted
>>>>in the output file(see attached file Scaffold1_out). I have also
>>attached
>>>>the maker_opts.ctl. Could you please advice on this.
>>>>Thanks and regards,
>>>>Parul Kudtarkar
>>>>> Use the AED_threshold option in the maker_opts.ctl file if you just
>>>>>want
>>>>> to restrict final gene models to close matches directly within maker.
>>>>>On
>>>>> the other hand, if you are trying to build a dataset for training
>>>>> gene
>>predictors, use the maker2zff script for generating a filtered dataset
>>>>>for
>>>>> SNAP training.  There are a number of filters available. Just call
>>>>> the
>>script once without parameters to see the options.
>>>>> Thanks,
>>>>> Carson
>>>>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>>>Hi Parul,
>>>>>>I think the way you described (with the maker_opts.ctl file) is how
>>you
>>>>>>want to proceed. You still need to give the genome too.
>>>>>>Daniel
>>>>>>Daniel Ence
>>>>>>Graduate Student
>>>>>>Eccles Institute of Human Genetics
>>>>>>University of Utah
>>>>>>15 North 2030 East, Room 2100
>>>>>>Salt Lake City, UT 84112-5330
>>>>>>________________________________________
>>>>>>From: maker-devel-bounces at yandell-lab.org
>>>>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul Kudtarkar
>>[parulk at caltech.edu]
>>>>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>To: Parul Kudtarkar
>>>>>>Cc: maker-devel at yandell-lab.org
>>>>>>Subject: Re: [maker-devel] AED score
>>>>>>Also, are there any other parameters that are required when filtering
>>based on AED score?
>>>>>>> Hello Carson,
>>>>>>> Just to confirm, Is there a script that would filter gene models at
>>specific AED score.
>>>>>>> Alternatively if I were to do this within maker with regards to
>>>>>>>parameters
>>>>>>> in maker_opts.ctl file I would have to provide my predicted genes
>>>>>>>gff3
>>>>>>> file to model_gff and  set AED_threshold at desired threshold?
>>Thanks and regards,
>>>>>>> Parul Kudtarkar
>>>>>>>> AED score with 1 are the ones you don't want.  0 is best and 1 is
>>worst
>>>>>>>> as
>>>>>>>> it is a distance metric.  You can use the AED_threshold parameter
>>to
>>>>>>>> require better matching to the evidence by setting it closer to 0.
>>>>>>>>You
>>>>>>>> can
>>>>>>>> also try to increase protein homology evidence as some of your
>>calls
>>>>>>>>may
>>>>>>>> be split genes due to lack of evidence linking them.
>>>>>>>> --Carson
>>>>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>>>Dear Maker community,
>>>>>>>>>For gene-prediction I get training data-set from evidence based
>>prediction, I use this data-set to train SNAP as well as Augustus
>>predictions, followed by boot-strapping. I would typically expect
>>20-30K
>>>>>>>>>genes however I am getting 8 times the expected gene count
>>>>>>>>> indicating
>>>>>>>>> too
>>>>>>>>>many false positives. Is there a way to further refine these
>>predication/script to retain predictions with AED score 1 and if
>>yes
>>>>>>>>>how
>>>>>>>>>to go about this?
>>>>>>>>>Thanks and regards,
>>>>>>>>>Parul Kudtarkar
>>>>>>>>>--
>>>>>>>>>Scientific Programmer
>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>Beckman Institute,
>>>>>>>>>California Institute of Technology
>>>>>>>>>http://www.spbase.org
>>>>>>>>>_______________________________________________
>>>>>>>>>maker-devel mailing list
>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab
>>>>>>>>>.o
>>rg
>>>>>>> --
>>>>>>> Scientific Programmer
>>>>>>> Center for Computational Regulatory Genomics
>>>>>>> Beckman Institute,
>>>>>>> California Institute of Technology
>>>>>>> http://www.spbase.org
>>>>>>--
>>>>>>Scientific Programmer
>>>>>>Center for Computational Regulatory Genomics
>>>>>>Beckman Institute,
>>>>>>California Institute of Technology
>>>>>>http://www.spbase.org
>>>>>>_______________________________________________
>>>>>>maker-devel mailing list
>>>>>>maker-devel at box290.bluehost.com
>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or
>>>>>>g
>>_______________________________________________
>>>>>>maker-devel mailing list
>>>>>>maker-devel at box290.bluehost.com
>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.or
>>>>>>g
>>>>> _______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at box290.bluehost.com
>>>>>
>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org
>>>
>>>
>>>
>>
>>
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org
>>
>>
>>
>>
>
>
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org