[maker-devel] AED score

Wed Dec 5 13:03:46 MST 2012

Dear Carson,

Thanks for a quick response. keep_preds is set to 0
Though for previous step(ab-intio gene predictions keep_preds was set to
1, see Scaffold1_input.gff). I have also attached the output file
Scaffold1_out.gff.
Please advice.

Thanks and regards,
Parul Kudtarkar

> You should always get all predictions, but should only get models
> (match/match_part) with AED scores less than 0.5.  You should never get
> models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
> keep_preds set.
>
> Do you have keeps_preds set or gene models with AED values above your
> threshold (not match/match_part features)?  If the first one is the case,
> just set it to 0.  If the second one in the case, could you send me
> example GFF3?
>
> Thanks,
> Carson
>
>
>
> On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>
>>Dear Carson,
>>
>>Thanks once again, we have limited experimental data with very short
>> ESTs.
>>CEGMA is useful for us to gauge our gene-model.
>>On a different note to re-annotate genome(post evidence based
>>prediction(used as training dataset)and abinitio gene-prediction). Here
>>are the control parameters I am using with AED score set to 0.5, however
>> I
>>get predictions that includes the ones with AED score of 1.00 in
>> resulting
>>gff3 file. Though I do see the number of genes reduced to 1/3 of initial
>>gff3 file.
>>
>>#-----Genome (Required for De-Novo Annotation)
>>genome=Scaffold1.fa #genome sequence file in fasta format
>>organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>> eukaryotic
>>
>>#-----Re-annotation Using MAKER Derived GFF3
>>genome_gff= Scaffold1.gff #re-annotate genome based on this gff3 file
>>est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes, 0 = no
>>protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
>>other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 = no
>>
>>#-----MAKER Behavior Options
>>AED_threshold=0.5 #Maximum Annotation Edit Distance allowed (bound by 0
>>and 1)
>>
>>Thanks and regards,
>>Parul Kudtarkar
>>
>>> Wow 330,000 is a lot. a large portion of genes are likely to be partial
>>>at
>>> best.  You should seriously consider using mRNAseq to capture those by
>>> using maker's est_gff option to pass in results from cufflinks or
>>>trinity.
>>>  Also I wouldn't even try to annotate contigs less than 10kb in size,
>>>just
>>> have maker skip them by setting the min_contig filter in the
>>> maker_opts.ctl file.
>>>
>>> Thanks,
>>> Carson
>>>
>>>
>>>
>>>
>>> On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>
>>>>Thanks for the guidance Carson, total contig size is 330,611 with N50
>>>> of
>>>>39.17kb. I agree we have short ESTs. So this is the possible reason
>>>> when
>>>>filtering based on AED score 0.75 there are no gene models predicted
>>>>despite the model_gff file has few genes with scores less than 0.75?
>>>>
>>>>Thanks and regards,
>>>>Parul Kudtarkar
>>>>
>>>>> There are certain characteristics that are apparent in this contig.
>>>>First
>>>>> it seems to be repeat rich with a very low gene density.  You also
>>>>>have
>>>>very short ESTs, and because of the lengths you are probably getting
>>>>many
>>>>> of them to align spuriously which produces very short gene models
>>>>> that
>>>>are
>>>>> more than likely false positives or at the very least just a piece of
>>>>>a
>>>>gene.  I would turn off est2genome as a predictor for this reason
>>>> unless
>>>>you can get longer EST assemblies (i.e. From mRNAseq).    Your protein
>>>>alignments also seem to be few and far between.  You probably need to
>>>>add
>>>>> more proteins from a couple of related species, and you might
>>>>> consider
>>>>using protein2genome rather than est2genome as a predictor if you are
>>>>still working to generate a training set. Also est2genome produced
>>>>models
>>>>> almost always have an AED score near 0 so mixing est2genome with the
>>>>AED_threshold with such limited protein support does create an
>>>>artificial
>>>>> bias to get back very short and incomplete models.
>>>>>
>>>>> How many contigs do you have in total and what is the N50 value for
>>>>>the
>>>>assembly? If you have a large number of very short contigs, you will
>>>> get
>>>>very inflated gene counts because you get genes split across contigs
>>>> and
>>>>many contigs tend t be subtle rearrangements of other contigs just
>>>>assembled in a slightly different way (so you can get bits and pieces
>>>> of
>>>>the same genes just rearranged).  This scenario is another confounding
>>>>factor if using the est2genome predictor with short ESTs.  I would
>>>>recommend running CEGMA to get an estimate for the genome completeness
>>>>as
>>>>> well as get an estimate of fragmentation as one of the statistics
>>>>produced
>>>>> is a percent of genes that are found complete (end to end) vs those
>>>>> that
>>>>are partial.  CEGMA identifies house keeping genes that tend to be
>>>>shorter
>>>>> and less intron rich than other genes in the genome, so if CEGMA
>>>>> gives
>>>>> a
>>>>high partial percentage and a low complete percentage, then this
>>>> pattern
>>>>can be expected to be even more exaggerated for other genes in the
>>>>genome.
>>>>>
>>>>> If your genome is highly fragmented or proteins do not align well
>>>>> then
>>>>there are other strategies.  For example, some vertebrate genomes end
>>>> up
>>>>having extremely fragmented assemblies (on the order of 100,000
>>>>contigs),
>>>>> and if they are distantly related to other annotated species few
>>>>proteins
>>>>> may align to the contigs because the introns in the alignments tend
>>>>> to
>>>>be
>>>>> so long and exons so short that it pushes down the significance
>>>>> scores
>>>>too
>>>>> much.  In those cases heavy mRNAseq seems to be the best if not only
>>>>> way
>>>>to get enough evidence to stitch gene models together.
>>>>>
>>>>> Thanks,
>>>>> Carson
>>>>>
>>>>>
>>>>>
>>>>> On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>
>>>>>>Dear Carson and Daniel,
>>>>>>Thanks. I ran sample file for filtering genes based on AED score. The
>>>>input gff3 file was provided to option model_pred(see attached file
>>>>Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>> least
>>>>>> 5
>>>>>>genes with AED score less than 0.75. However there were no genes
>>>>>> predicted
>>>>>>in the output file(see attached file Scaffold1_out). I have also
>>>>attached
>>>>>>the maker_opts.ctl. Could you please advice on this.
>>>>>>Thanks and regards,
>>>>>>Parul Kudtarkar
>>>>>>> Use the AED_threshold option in the maker_opts.ctl file if you just
>>>>>>>want
>>>>>>> to restrict final gene models to close matches directly within
>>>>>>>maker.
>>>>>>>On
>>>>>>> the other hand, if you are trying to build a dataset for training
>>>>>>> gene
>>>>predictors, use the maker2zff script for generating a filtered dataset
>>>>>>>for
>>>>>>> SNAP training.  There are a number of filters available. Just call
>>>>>>> the
>>>>script once without parameters to see the options.
>>>>>>> Thanks,
>>>>>>> Carson
>>>>>>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>>>>>Hi Parul,
>>>>>>>>I think the way you described (with the maker_opts.ctl file) is how
>>>>you
>>>>>>>>want to proceed. You still need to give the genome too.
>>>>>>>>Daniel
>>>>>>>>Daniel Ence
>>>>>>>>Graduate Student
>>>>>>>>Eccles Institute of Human Genetics
>>>>>>>>University of Utah
>>>>>>>>15 North 2030 East, Room 2100
>>>>>>>>Salt Lake City, UT 84112-5330
>>>>>>>>________________________________________
>>>>>>>>From: maker-devel-bounces at yandell-lab.org
>>>>>>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul Kudtarkar
>>>>[parulk at caltech.edu]
>>>>>>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>To: Parul Kudtarkar
>>>>>>>>Cc: maker-devel at yandell-lab.org
>>>>>>>>Subject: Re: [maker-devel] AED score
>>>>>>>>Also, are there any other parameters that are required when
>>>>>>>>filtering
>>>>based on AED score?
>>>>>>>>> Hello Carson,
>>>>>>>>> Just to confirm, Is there a script that would filter gene models
>>>>>>>>>at
>>>>specific AED score.
>>>>>>>>> Alternatively if I were to do this within maker with regards to
>>>>>>>>>parameters
>>>>>>>>> in maker_opts.ctl file I would have to provide my predicted genes
>>>>>>>>>gff3
>>>>>>>>> file to model_gff and  set AED_threshold at desired threshold?
>>>>Thanks and regards,
>>>>>>>>> Parul Kudtarkar
>>>>>>>>>> AED score with 1 are the ones you don't want.  0 is best and 1
>>>>>>>>>> is
>>>>worst
>>>>>>>>>> as
>>>>>>>>>> it is a distance metric.  You can use the AED_threshold
>>>>>>>>>> parameter
>>>>to
>>>>>>>>>> require better matching to the evidence by setting it closer to
>>>>>>>>>>0.
>>>>>>>>>>You
>>>>>>>>>> can
>>>>>>>>>> also try to increase protein homology evidence as some of your
>>>>calls
>>>>>>>>>>may
>>>>>>>>>> be split genes due to lack of evidence linking them.
>>>>>>>>>> --Carson
>>>>>>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>wrote:
>>>>>>>>>>>Dear Maker community,
>>>>>>>>>>>For gene-prediction I get training data-set from evidence based
>>>>prediction, I use this data-set to train SNAP as well as Augustus
>>>>predictions, followed by boot-strapping. I would typically expect
>>>>20-30K
>>>>>>>>>>>genes however I am getting 8 times the expected gene count
>>>>>>>>>>> indicating
>>>>>>>>>>> too
>>>>>>>>>>>many false positives. Is there a way to further refine these
>>>>predication/script to retain predictions with AED score 1 and if
>>>>yes
>>>>>>>>>>>how
>>>>>>>>>>>to go about this?
>>>>>>>>>>>Thanks and regards,
>>>>>>>>>>>Parul Kudtarkar
>>>>>>>>>>>--
>>>>>>>>>>>Scientific Programmer
>>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>>Beckman Institute,
>>>>>>>>>>>California Institute of Technology
>>>>>>>>>>>http://www.spbase.org
>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l
>>>>>>>>>>>ab
>>>>>>>>>>>.o
>>>>rg
>>>>>>>>> --
>>>>>>>>> Scientific Programmer
>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>> Beckman Institute,
>>>>>>>>> California Institute of Technology
>>>>>>>>> http://www.spbase.org
>>>>>>>>--
>>>>>>>>Scientific Programmer
>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>Beckman Institute,
>>>>>>>>California Institute of Technology
>>>>>>>>http://www.spbase.org
>>>>>>>>_______________________________________________
>>>>>>>>maker-devel mailing list
>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.
>>>>>>>>or
>>>>>>>>g
>>>>_______________________________________________
>>>>>>>>maker-devel mailing list
>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.
>>>>>>>>or
>>>>>>>>g
>>>>>>> _______________________________________________
>>>>>>> maker-devel mailing list
>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>
>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.o
>>>>>>>rg
>>>>>>--
>>>>>>Scientific Programmer
>>>>>>Center for Computational Regulatory Genomics
>>>>>>Beckman Institute,
>>>>>>California Institute of Technology
>>>>>>http://www.spbase.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org
>>
>
>
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Scaffold1_input.gff
Type: application/octet-stream
Size: 448707 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121205/f8b88fef/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Scaffold1_out.gff
Type: application/octet-stream
Size: 402716 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121205/f8b88fef/attachment-0003.obj>