[maker-devel] AED score

Thu Dec 6 13:28:18 MST 2012

Dear Daniel,

Thanks for clearing doubts. For augustus I am using the closest
species(lamprey) to species we are annotating. For SNAP the training
set(hmm file) is generated using predictions made from evidence based
gene-predictions(sorry the file name was mis-leading). I think STEP 3-6
are not required(more or less repetitive without further improving
genemodel). Post Step 1(generating training data for SNAP using evidence
based prediction) and Step2(ab-initio gene-prediction using SNAP and
augustus) do you have recommendations for further improving the gene
model?

Thanks and regards,
Parul Kudtarkar

> Hi Parul,
>
> In step one, if protein2genome isn't turned on, then the protein
> alignments wont be used to generate gene models. Carson said before to use
> protein2genome to generate gene models for your trainings set, but you
> should know that those data aren't being used right now.
>
> In step 2, you can include the est and protein evidence that you used in
> step one, and the ab-initio predictors will takes those data as hints to
> guide their predictions. Also, are you reannotating human here? I'm just
> wondering why the snap file is called Pult, but the augustus species model
> is human. Also, I think you should include the same masking files from
> step 1, otherwise the ab-initio predictors will be predicting on the
> unmasked sequence which will give you many spurious predictions.
>
> In step 3, I'm not certain what you mean by boot-strapping. The control
> file you sent for step 3 will just pass-through all of the data from
> before.
>
> In step 4, I think what you'll get is pretty close to what you already had
> in step 3. The gene models from step 3 are already based on the est and
> protein data from step 1, so without giving any different evidence or
> ab-initio predictors, I think that you will just get the same gene models
> that you got from step 3. Those gene models are what you should be using
> to train snap. Also, is this the same genome as in the other steps? The
> genome file here is called genome.linear.fa, but before it was called
> genome.fa.
>
> Step 5. So maker uses both evidence and ab-initio predictions to create
> models. If you don't give any evidence (EST or protein) maker will not
> annotate any gene models. You have also changed the augustus species model
> in this step, but I don't know what you've gained by going from one
> species to another. You should be training augustus with the gene models
> and then creating a new species model in augustus. As I understand it, the
> filter only operates on gene models, not ab-initio predictions or
> alignments, so it probably isn't doing anything the way you have it set.
>
> Step 6. I think step 5 and 6 should be combined. I don't know what you
> mean by boot-strapping here too.
>
> Hopefully that clears up some of the confusion with how maker works.
> Carson will probably have a lot of suggestions too.
>
> Thanks,
> Daniel
>
> Daniel Ence
> Graduate Student
> Eccles Institute of Human Genetics
> University of Utah
> 15 North 2030 East, Room 2100
> Salt Lake City, UT 84112-5330
> ________________________________________
> From: Parul Kudtarkar [parulk at caltech.edu]
> Sent: Thursday, December 06, 2012 10:08 AM
> To: carsonhh at gmail.com
> Cc: Daniel Ence; maker-devel at yandell-lab.org
> Subject: Re: [maker-devel] AED score
>
> Hi Carson,
>
> Once again thanks for a quick response. We expect to get 25,000 to 30,000
> genes.
> Here are the step
> Step1. EST and proteins are used to predict gene-models. the resulting
> file is used as training data-set for SNAP in step2. est2genome is turned
> ON
> Step2. Augustus and SNAP is used to predict genes.
> Step3. The results are re-annoated(boot-strap)
> Step4. EST and proteins are used to predict gene-models. The gff file from
> step3. is used as model_gff. The resulting file is used as training
> data-set for SNAP in step2
> Step 5. Augustus and SNAP is used to predict genes. with AED set to 0.5
> Step6. The results are re-annoated(boot-strap) with AED set to 0.5 and
> contig_size to 10kb
>
> Thanks and regards,
> Parul Kudtarkar
>
> I have attached the configuration file for each step.
>> Your output has no genes.  They've all been filtered out.  The gene
> predictions are left for reference purposes, but there are no gene models
>> in the file.  You need to look at the type columns in the GFF3 file -->
> match/match_part features are evidence and reference data but not
> models.
>> All models will have types gene/mRNA/exon/CDS.  For example if you just
> want gene models in your file you can use the gff3_merge script with the
> -g option, and it will only print out the gene models.
>> I think you may be misinterpreting what is happening at different steps,
> as well as how to read the result files.  Could you give me a detailed
> explanation of what you expect to get back together with your control
> files and I can walk you through the configuration, and indicate what to
> expect.  Also what was the report like from CEGMA. Could you include the
> report file that shows how complete your genome is and how fragmented it
> is?
>> Thanks,
>> Carson
>> On 12-12-05 3:03 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>Dear Carson,
>>>Thanks for a quick response. keep_preds is set to 0
>>>Though for previous step(ab-intio gene predictions keep_preds was set to
> 1, see Scaffold1_input.gff). I have also attached the output file
> Scaffold1_out.gff.
>>>Please advice.
>>>Thanks and regards,
>>>Parul Kudtarkar
>>>> You should always get all predictions, but should only get models
> (match/match_part) with AED scores less than 0.5.  You should never get
>>>> models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
> keep_preds set.
>>>> Do you have keeps_preds set or gene models with AED values above your
> threshold (not match/match_part features)?  If the first one is the
>>>>case,
>>>> just set it to 0.  If the second one in the case, could you send me
> example GFF3?
>>>> Thanks,
>>>> Carson
>>>> On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>Dear Carson,
>>>>>Thanks once again, we have limited experimental data with very short
>>>>> ESTs.
>>>>>CEGMA is useful for us to gauge our gene-model.
>>>>>On a different note to re-annotate genome(post evidence based
>>>>>prediction(used as training dataset)and abinitio gene-prediction).
> Here
>>>>>are the control parameters I am using with AED score set to 0.5,
>>>>> however
>>>>> I
>>>>>get predictions that includes the ones with AED score of 1.00 in
>>>>> resulting
>>>>>gff3 file. Though I do see the number of genes reduced to 1/3 of
>>>>> initial
>>>>>gff3 file.
>>>>>#-----Genome (Required for De-Novo Annotation)
>>>>>genome=Scaffold1.fa #genome sequence file in fasta format
>>>>>organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>> eukaryotic
>>>>>#-----Re-annotation Using MAKER Derived GFF3
>>>>>genome_gff= Scaffold1.gff #re-annotate genome based on this gff3 file
> est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>>>>altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes, 0 =
> no
>>>>>protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>>>>rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>>>>model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>>>>pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
> other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 =
>>>>> no
>>>>>#-----MAKER Behavior Options
>>>>>AED_threshold=0.5 #Maximum Annotation Edit Distance allowed (bound by
> 0
>>>>>and 1)
>>>>>Thanks and regards,
>>>>>Parul Kudtarkar
>>>>>> Wow 330,000 is a lot. a large portion of genes are likely to be
>>>>>>partial
>>>>>>at
>>>>>> best.  You should seriously consider using mRNAseq to capture those
> by
>>>>>> using maker's est_gff option to pass in results from cufflinks or
>>>>>>trinity.
>>>>>>  Also I wouldn't even try to annotate contigs less than 10kb in
> size,
>>>>>>just
>>>>>> have maker skip them by setting the min_contig filter in the
> maker_opts.ctl file.
>>>>>> Thanks,
>>>>>> Carson
>>>>>> On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>Thanks for the guidance Carson, total contig size is 330,611 with
> N50
>>>>>>> of
>>>>>>>39.17kb. I agree we have short ESTs. So this is the possible reason
>>>>>>> when
>>>>>>>filtering based on AED score 0.75 there are no gene models predicted
> despite the model_gff file has few genes with scores less than 0.75?
> Thanks and regards,
>>>>>>>Parul Kudtarkar
>>>>>>>> There are certain characteristics that are apparent in this
> contig.
>>>>>>>First
>>>>>>>> it seems to be repeat rich with a very low gene density.  You also
>>>>>>>>have
>>>>>>>very short ESTs, and because of the lengths you are probably getting
> many
>>>>>>>> of them to align spuriously which produces very short gene models
> that
>>>>>>>are
>>>>>>>> more than likely false positives or at the very least just a piece
>>>>>>>>of
>>>>>>>>a
>>>>>>>gene.  I would turn off est2genome as a predictor for this reason
>>>>>>> unless
>>>>>>>you can get longer EST assemblies (i.e. From mRNAseq).    Your
>>>>>>> protein
>>>>>>>alignments also seem to be few and far between.  You probably need
> to
>>>>>>>add
>>>>>>>> more proteins from a couple of related species, and you might
> consider
>>>>>>>using protein2genome rather than est2genome as a predictor if you
> are
>>>>>>>still working to generate a training set. Also est2genome produced
> models
>>>>>>>> almost always have an AED score near 0 so mixing est2genome with
> the
>>>>>>>AED_threshold with such limited protein support does create an
> artificial
>>>>>>>> bias to get back very short and incomplete models.
>>>>>>>> How many contigs do you have in total and what is the N50 value
> for
>>>>>>>>the
>>>>>>>assembly? If you have a large number of very short contigs, you will
>>>>>>> get
>>>>>>>very inflated gene counts because you get genes split across contigs
>>>>>>> and
>>>>>>>many contigs tend t be subtle rearrangements of other contigs just
> assembled in a slightly different way (so you can get bits and
> pieces
>>>>>>> of
>>>>>>>the same genes just rearranged).  This scenario is another
>>>>>>> confounding
>>>>>>>factor if using the est2genome predictor with short ESTs.  I would
> recommend running CEGMA to get an estimate for the genome
>>>>>>> completeness
>>>>>>>as
>>>>>>>> well as get an estimate of fragmentation as one of the statistics
>>>>>>>produced
>>>>>>>> is a percent of genes that are found complete (end to end) vs
> those
>>>>>>>> that
>>>>>>>are partial.  CEGMA identifies house keeping genes that tend to be
> shorter
>>>>>>>> and less intron rich than other genes in the genome, so if CEGMA
> gives
>>>>>>>> a
>>>>>>>high partial percentage and a low complete percentage, then this
>>>>>>> pattern
>>>>>>>can be expected to be even more exaggerated for other genes in the
> genome.
>>>>>>>> If your genome is highly fragmented or proteins do not align well
> then
>>>>>>>there are other strategies.  For example, some vertebrate genomes
> end
>>>>>>> up
>>>>>>>having extremely fragmented assemblies (on the order of 100,000
> contigs),
>>>>>>>> and if they are distantly related to other annotated species few
>>>>>>>proteins
>>>>>>>> may align to the contigs because the introns in the alignments
> tend
>>>>>>>> to
>>>>>>>be
>>>>>>>> so long and exons so short that it pushes down the significance
> scores
>>>>>>>too
>>>>>>>> much.  In those cases heavy mRNAseq seems to be the best if not
> only
>>>>>>>> way
>>>>>>>to get enough evidence to stitch gene models together.
>>>>>>>> Thanks,
>>>>>>>> Carson
>>>>>>>> On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>>>Dear Carson and Daniel,
>>>>>>>>>Thanks. I ran sample file for filtering genes based on AED score.
> The
>>>>>>>input gff3 file was provided to option model_pred(see attached file
> Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>>>>> least
>>>>>>>>> 5
>>>>>>>>>genes with AED score less than 0.75. However there were no genes
>>>>>>>>> predicted
>>>>>>>>>in the output file(see attached file Scaffold1_out). I have also
>>>>>>>attached
>>>>>>>>>the maker_opts.ctl. Could you please advice on this.
>>>>>>>>>Thanks and regards,
>>>>>>>>>Parul Kudtarkar
>>>>>>>>>> Use the AED_threshold option in the maker_opts.ctl file if you
>>>>>>>>>>just
>>>>>>>>>>want
>>>>>>>>>> to restrict final gene models to close matches directly within
>>>>>>>>>>maker.
>>>>>>>>>>On
>>>>>>>>>> the other hand, if you are trying to build a dataset for
> training
>>>>>>>>>> gene
>>>>>>>predictors, use the maker2zff script for generating a filtered
>>>>>>> dataset
>>>>>>>>>>for
>>>>>>>>>> SNAP training.  There are a number of filters available. Just
> call
>>>>>>>>>> the
>>>>>>>script once without parameters to see the options.
>>>>>>>>>> Thanks,
>>>>>>>>>> Carson
>>>>>>>>>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu>
>>>>>>>>>>wrote:
>>>>>>>>>>>Hi Parul,
>>>>>>>>>>>I think the way you described (with the maker_opts.ctl file) is
> how
>>>>>>>you
>>>>>>>>>>>want to proceed. You still need to give the genome too.
>>>>>>>>>>>Daniel
>>>>>>>>>>>Daniel Ence
>>>>>>>>>>>Graduate Student
>>>>>>>>>>>Eccles Institute of Human Genetics
>>>>>>>>>>>University of Utah
>>>>>>>>>>>15 North 2030 East, Room 2100
>>>>>>>>>>>Salt Lake City, UT 84112-5330
>>>>>>>>>>>________________________________________
>>>>>>>>>>>From: maker-devel-bounces at yandell-lab.org
>>>>>>>>>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul
>>>>>>>>>>> Kudtarkar
>>>>>>>[parulk at caltech.edu]
>>>>>>>>>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>>>>To: Parul Kudtarkar
>>>>>>>>>>>Cc: maker-devel at yandell-lab.org
>>>>>>>>>>>Subject: Re: [maker-devel] AED score
>>>>>>>>>>>Also, are there any other parameters that are required when
> filtering
>>>>>>>based on AED score?
>>>>>>>>>>>> Hello Carson,
>>>>>>>>>>>> Just to confirm, Is there a script that would filter gene
> models
>>>>>>>>>>>>at
>>>>>>>specific AED score.
>>>>>>>>>>>> Alternatively if I were to do this within maker with regards
> to
>>>>>>>>>>>>parameters
>>>>>>>>>>>> in maker_opts.ctl file I would have to provide my predicted
>>>>>>>>>>>>genes
>>>>>>>>>>>>gff3
>>>>>>>>>>>> file to model_gff and  set AED_threshold at desired threshold?
>>>>>>>Thanks and regards,
>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>> AED score with 1 are the ones you don't want.  0 is best and
> 1
>>>>>>>>>>>>> is
>>>>>>>worst
>>>>>>>>>>>>> as
>>>>>>>>>>>>> it is a distance metric.  You can use the AED_threshold
> parameter
>>>>>>>to
>>>>>>>>>>>>> require better matching to the evidence by setting it closer
> to
>>>>>>>>>>>>>0.
>>>>>>>>>>>>>You
>>>>>>>>>>>>> can
>>>>>>>>>>>>> also try to increase protein homology evidence as some of
> your
>>>>>>>calls
>>>>>>>>>>>>>may
>>>>>>>>>>>>> be split genes due to lack of evidence linking them.
>>>>>>>>>>>>> --Carson
>>>>>>>>>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>>>wrote:
>>>>>>>>>>>>>>Dear Maker community,
>>>>>>>>>>>>>>For gene-prediction I get training data-set from evidence
>>>>>>>>>>>>>> based
>>>>>>>prediction, I use this data-set to train SNAP as well as Augustus
> predictions, followed by boot-strapping. I would typically expect
> 20-30K
>>>>>>>>>>>>>>genes however I am getting 8 times the expected gene count
>>>>>>>>>>>>>> indicating
>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>many false positives. Is there a way to further refine these
>>>>>>>predication/script to retain predictions with AED score 1 and if yes
>>>>>>>>>>>>>>how
>>>>>>>>>>>>>>to go about this?
>>>>>>>>>>>>>>Thanks and regards,
>>>>>>>>>>>>>>Parul Kudtarkar
>>>>>>>>>>>>>>--
>>>>>>>>>>>>>>Scientific Programmer
>>>>>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>>>>>Beckman Institute,
>>>>>>>>>>>>>>California Institute of Technology
>>>>>>>>>>>>>>http://www.spbase.org
>>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
> -l
>>>>>>>>>>>>>>ab
>>>>>>>>>>>>>>.o
>>>>>>>rg
>>>>>>>>>>>> --
>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>--
>>>>>>>>>>>Scientific Programmer
>>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>>Beckman Institute,
>>>>>>>>>>>California Institute of Technology
>>>>>>>>>>>http://www.spbase.org
>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
> b.
>>>>>>>>>>>or
>>>>>>>>>>>g
>>>>>>>_______________________________________________
>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
> b.
>>>>>>>>>>>or
>>>>>>>>>>>g
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab
> .o
>>>>>>>>>>rg
>>>>>>>>>--
>>>>>>>>>Scientific Programmer
>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>Beckman Institute,
>>>>>>>>>California Institute of Technology
>>>>>>>>>http://www.spbase.org
>>>>>>>--
>>>>>>>Scientific Programmer
>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>Beckman Institute,
>>>>>>>California Institute of Technology
>>>>>>>http://www.spbase.org
>>>>>--
>>>>>Scientific Programmer
>>>>>Center for Computational Regulatory Genomics
>>>>>Beckman Institute,
>>>>>California Institute of Technology
>>>>>http://www.spbase.org
>>>--
>>>Scientific Programmer
>>>Center for Computational Regulatory Genomics
>>>Beckman Institute,
>>>California Institute of Technology
>>>http://www.spbase.org
>
>
> --
> Scientific Programmer
> Center for Computational Regulatory Genomics
> Beckman Institute,
> California Institute of Technology
> http://www.spbase.org
>
> --
> Scientific Programmer
> Center for Computational Regulatory Genomics
> Beckman Institute,
> California Institute of Technology
> http://www.spbase.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org