[maker-devel] AED score

Thu Dec 6 15:51:42 MST 2012

Dear Daniel,

Once again thanks for extensive help. We get over estimation of number of
genes at the end of step 2. So was wondering if there is a way to pick
only the best annotated. That been said assembly, ESTs and proteins are
not good and I am currently in process of running cegma.

Also for step 2(ab-initio gene-prediction using SNAP and Augustus) is it
best to turn off  keep_preds to remove any possible false-positives. Also
should est2genome turned ON? Of course as pointed in previous email use
est,proteins apart from training data to generate better gene-model and
repeat masking as well to mask repeats for ab-initio predictors?
I believe while running gff3_merge it is best to use option -g to get only
the gene-predictions and filter evidence.

Thanks and regards,
Parul Kudtarkar

> Hi Parul,
> I dont' really have many suggestions for improving the gene models after
> the annotation is done. Annotation is very dependent on the data you have
> at hand (the assembly, ESTs and proteins). One thing you could do to get
> more annotations is to run something like iprscan on the ab-initio
> predictions that didn't overlap any evidence and look for predictions that
> contain a pfam domain. Then you can send those predictions back through
> maker and promote them to gene model status.
>
> Do you have the CEGMA results that Carson asked about? That really will
> tell you what kind of annotation results you can expect. If the assembly
> doesn't have an N50 greater than the expected median gene size, then you
> can't expect very good results from automated annotation.
>
> Thanks,
> Daniel
>
> Daniel Ence
> Graduate Student
> Eccles Institute of Human Genetics
> University of Utah
> 15 North 2030 East, Room 2100
> Salt Lake City, UT 84112-5330
> ________________________________________
> From: Parul Kudtarkar [parulk at caltech.edu]
> Sent: Thursday, December 06, 2012 1:28 PM
> To: Daniel Ence
> Cc: maker-devel at yandell-lab.org
> Subject: Re: [maker-devel] AED score
>
> Dear Daniel,
>
> Thanks for clearing doubts. For augustus I am using the closest
> species(lamprey) to species we are annotating. For SNAP the training
> set(hmm file) is generated using predictions made from evidence based
> gene-predictions(sorry the file name was mis-leading). I think STEP 3-6
> are not required(more or less repetitive without further improving
> genemodel). Post Step 1(generating training data for SNAP using evidence
> based prediction) and Step2(ab-initio gene-prediction using SNAP and
> augustus) do you have recommendations for further improving the gene
> model?
>
> Thanks and regards,
> Parul Kudtarkar
>
>> Hi Parul,
>>
>> In step one, if protein2genome isn't turned on, then the protein
>> alignments wont be used to generate gene models. Carson said before to
>> use
>> protein2genome to generate gene models for your trainings set, but you
>> should know that those data aren't being used right now.
>>
>> In step 2, you can include the est and protein evidence that you used in
>> step one, and the ab-initio predictors will takes those data as hints to
>> guide their predictions. Also, are you reannotating human here? I'm just
>> wondering why the snap file is called Pult, but the augustus species
>> model
>> is human. Also, I think you should include the same masking files from
>> step 1, otherwise the ab-initio predictors will be predicting on the
>> unmasked sequence which will give you many spurious predictions.
>>
>> In step 3, I'm not certain what you mean by boot-strapping. The control
>> file you sent for step 3 will just pass-through all of the data from
>> before.
>>
>> In step 4, I think what you'll get is pretty close to what you already
>> had
>> in step 3. The gene models from step 3 are already based on the est and
>> protein data from step 1, so without giving any different evidence or
>> ab-initio predictors, I think that you will just get the same gene
>> models
>> that you got from step 3. Those gene models are what you should be using
>> to train snap. Also, is this the same genome as in the other steps? The
>> genome file here is called genome.linear.fa, but before it was called
>> genome.fa.
>>
>> Step 5. So maker uses both evidence and ab-initio predictions to create
>> models. If you don't give any evidence (EST or protein) maker will not
>> annotate any gene models. You have also changed the augustus species
>> model
>> in this step, but I don't know what you've gained by going from one
>> species to another. You should be training augustus with the gene models
>> and then creating a new species model in augustus. As I understand it,
>> the
>> filter only operates on gene models, not ab-initio predictions or
>> alignments, so it probably isn't doing anything the way you have it set.
>>
>> Step 6. I think step 5 and 6 should be combined. I don't know what you
>> mean by boot-strapping here too.
>>
>> Hopefully that clears up some of the confusion with how maker works.
>> Carson will probably have a lot of suggestions too.
>>
>> Thanks,
>> Daniel
>>
>> Daniel Ence
>> Graduate Student
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> ________________________________________
>> From: Parul Kudtarkar [parulk at caltech.edu]
>> Sent: Thursday, December 06, 2012 10:08 AM
>> To: carsonhh at gmail.com
>> Cc: Daniel Ence; maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] AED score
>>
>> Hi Carson,
>>
>> Once again thanks for a quick response. We expect to get 25,000 to
>> 30,000
>> genes.
>> Here are the step
>> Step1. EST and proteins are used to predict gene-models. the resulting
>> file is used as training data-set for SNAP in step2. est2genome is
>> turned
>> ON
>> Step2. Augustus and SNAP is used to predict genes.
>> Step3. The results are re-annoated(boot-strap)
>> Step4. EST and proteins are used to predict gene-models. The gff file
>> from
>> step3. is used as model_gff. The resulting file is used as training
>> data-set for SNAP in step2
>> Step 5. Augustus and SNAP is used to predict genes. with AED set to 0.5
>> Step6. The results are re-annoated(boot-strap) with AED set to 0.5 and
>> contig_size to 10kb
>>
>> Thanks and regards,
>> Parul Kudtarkar
>>
>> I have attached the configuration file for each step.
>>> Your output has no genes.  They've all been filtered out.  The gene
>> predictions are left for reference purposes, but there are no gene
>> models
>>> in the file.  You need to look at the type columns in the GFF3 file -->
>> match/match_part features are evidence and reference data but not
>> models.
>>> All models will have types gene/mRNA/exon/CDS.  For example if you just
>> want gene models in your file you can use the gff3_merge script with the
>> -g option, and it will only print out the gene models.
>>> I think you may be misinterpreting what is happening at different
>>> steps,
>> as well as how to read the result files.  Could you give me a detailed
>> explanation of what you expect to get back together with your control
>> files and I can walk you through the configuration, and indicate what to
>> expect.  Also what was the report like from CEGMA. Could you include the
>> report file that shows how complete your genome is and how fragmented it
>> is?
>>> Thanks,
>>> Carson
>>> On 12-12-05 3:03 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>Dear Carson,
>>>>Thanks for a quick response. keep_preds is set to 0
>>>>Though for previous step(ab-intio gene predictions keep_preds was set
>>>> to
>> 1, see Scaffold1_input.gff). I have also attached the output file
>> Scaffold1_out.gff.
>>>>Please advice.
>>>>Thanks and regards,
>>>>Parul Kudtarkar
>>>>> You should always get all predictions, but should only get models
>> (match/match_part) with AED scores less than 0.5.  You should never get
>>>>> models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
>> keep_preds set.
>>>>> Do you have keeps_preds set or gene models with AED values above your
>> threshold (not match/match_part features)?  If the first one is the
>>>>>case,
>>>>> just set it to 0.  If the second one in the case, could you send me
>> example GFF3?
>>>>> Thanks,
>>>>> Carson
>>>>> On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>Dear Carson,
>>>>>>Thanks once again, we have limited experimental data with very short
>>>>>> ESTs.
>>>>>>CEGMA is useful for us to gauge our gene-model.
>>>>>>On a different note to re-annotate genome(post evidence based
>>>>>>prediction(used as training dataset)and abinitio gene-prediction).
>> Here
>>>>>>are the control parameters I am using with AED score set to 0.5,
>>>>>> however
>>>>>> I
>>>>>>get predictions that includes the ones with AED score of 1.00 in
>>>>>> resulting
>>>>>>gff3 file. Though I do see the number of genes reduced to 1/3 of
>>>>>> initial
>>>>>>gff3 file.
>>>>>>#-----Genome (Required for De-Novo Annotation)
>>>>>>genome=Scaffold1.fa #genome sequence file in fasta format
>>>>>>organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>>> eukaryotic
>>>>>>#-----Re-annotation Using MAKER Derived GFF3
>>>>>>genome_gff= Scaffold1.gff #re-annotate genome based on this gff3 file
>> est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>>>>>altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes, 0
>>>>>> =
>> no
>>>>>>protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>>>>>rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>>>>>model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>>>>>pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
>> other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 =
>>>>>> no
>>>>>>#-----MAKER Behavior Options
>>>>>>AED_threshold=0.5 #Maximum Annotation Edit Distance allowed (bound by
>> 0
>>>>>>and 1)
>>>>>>Thanks and regards,
>>>>>>Parul Kudtarkar
>>>>>>> Wow 330,000 is a lot. a large portion of genes are likely to be
>>>>>>>partial
>>>>>>>at
>>>>>>> best.  You should seriously consider using mRNAseq to capture those
>> by
>>>>>>> using maker's est_gff option to pass in results from cufflinks or
>>>>>>>trinity.
>>>>>>>  Also I wouldn't even try to annotate contigs less than 10kb in
>> size,
>>>>>>>just
>>>>>>> have maker skip them by setting the min_contig filter in the
>> maker_opts.ctl file.
>>>>>>> Thanks,
>>>>>>> Carson
>>>>>>> On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>>Thanks for the guidance Carson, total contig size is 330,611 with
>> N50
>>>>>>>> of
>>>>>>>>39.17kb. I agree we have short ESTs. So this is the possible reason
>>>>>>>> when
>>>>>>>>filtering based on AED score 0.75 there are no gene models
>>>>>>>> predicted
>> despite the model_gff file has few genes with scores less than 0.75?
>> Thanks and regards,
>>>>>>>>Parul Kudtarkar
>>>>>>>>> There are certain characteristics that are apparent in this
>> contig.
>>>>>>>>First
>>>>>>>>> it seems to be repeat rich with a very low gene density.  You
>>>>>>>>> also
>>>>>>>>>have
>>>>>>>>very short ESTs, and because of the lengths you are probably
>>>>>>>> getting
>> many
>>>>>>>>> of them to align spuriously which produces very short gene models
>> that
>>>>>>>>are
>>>>>>>>> more than likely false positives or at the very least just a
>>>>>>>>> piece
>>>>>>>>>of
>>>>>>>>>a
>>>>>>>>gene.  I would turn off est2genome as a predictor for this reason
>>>>>>>> unless
>>>>>>>>you can get longer EST assemblies (i.e. From mRNAseq).    Your
>>>>>>>> protein
>>>>>>>>alignments also seem to be few and far between.  You probably need
>> to
>>>>>>>>add
>>>>>>>>> more proteins from a couple of related species, and you might
>> consider
>>>>>>>>using protein2genome rather than est2genome as a predictor if you
>> are
>>>>>>>>still working to generate a training set. Also est2genome produced
>> models
>>>>>>>>> almost always have an AED score near 0 so mixing est2genome with
>> the
>>>>>>>>AED_threshold with such limited protein support does create an
>> artificial
>>>>>>>>> bias to get back very short and incomplete models.
>>>>>>>>> How many contigs do you have in total and what is the N50 value
>> for
>>>>>>>>>the
>>>>>>>>assembly? If you have a large number of very short contigs, you
>>>>>>>> will
>>>>>>>> get
>>>>>>>>very inflated gene counts because you get genes split across
>>>>>>>> contigs
>>>>>>>> and
>>>>>>>>many contigs tend t be subtle rearrangements of other contigs just
>> assembled in a slightly different way (so you can get bits and
>> pieces
>>>>>>>> of
>>>>>>>>the same genes just rearranged).  This scenario is another
>>>>>>>> confounding
>>>>>>>>factor if using the est2genome predictor with short ESTs.  I would
>> recommend running CEGMA to get an estimate for the genome
>>>>>>>> completeness
>>>>>>>>as
>>>>>>>>> well as get an estimate of fragmentation as one of the statistics
>>>>>>>>produced
>>>>>>>>> is a percent of genes that are found complete (end to end) vs
>> those
>>>>>>>>> that
>>>>>>>>are partial.  CEGMA identifies house keeping genes that tend to be
>> shorter
>>>>>>>>> and less intron rich than other genes in the genome, so if CEGMA
>> gives
>>>>>>>>> a
>>>>>>>>high partial percentage and a low complete percentage, then this
>>>>>>>> pattern
>>>>>>>>can be expected to be even more exaggerated for other genes in the
>> genome.
>>>>>>>>> If your genome is highly fragmented or proteins do not align well
>> then
>>>>>>>>there are other strategies.  For example, some vertebrate genomes
>> end
>>>>>>>> up
>>>>>>>>having extremely fragmented assemblies (on the order of 100,000
>> contigs),
>>>>>>>>> and if they are distantly related to other annotated species few
>>>>>>>>proteins
>>>>>>>>> may align to the contigs because the introns in the alignments
>> tend
>>>>>>>>> to
>>>>>>>>be
>>>>>>>>> so long and exons so short that it pushes down the significance
>> scores
>>>>>>>>too
>>>>>>>>> much.  In those cases heavy mRNAseq seems to be the best if not
>> only
>>>>>>>>> way
>>>>>>>>to get enough evidence to stitch gene models together.
>>>>>>>>> Thanks,
>>>>>>>>> Carson
>>>>>>>>> On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>> wrote:
>>>>>>>>>>Dear Carson and Daniel,
>>>>>>>>>>Thanks. I ran sample file for filtering genes based on AED score.
>> The
>>>>>>>>input gff3 file was provided to option model_pred(see attached file
>> Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>>>>>> least
>>>>>>>>>> 5
>>>>>>>>>>genes with AED score less than 0.75. However there were no genes
>>>>>>>>>> predicted
>>>>>>>>>>in the output file(see attached file Scaffold1_out). I have also
>>>>>>>>attached
>>>>>>>>>>the maker_opts.ctl. Could you please advice on this.
>>>>>>>>>>Thanks and regards,
>>>>>>>>>>Parul Kudtarkar
>>>>>>>>>>> Use the AED_threshold option in the maker_opts.ctl file if you
>>>>>>>>>>>just
>>>>>>>>>>>want
>>>>>>>>>>> to restrict final gene models to close matches directly within
>>>>>>>>>>>maker.
>>>>>>>>>>>On
>>>>>>>>>>> the other hand, if you are trying to build a dataset for
>> training
>>>>>>>>>>> gene
>>>>>>>>predictors, use the maker2zff script for generating a filtered
>>>>>>>> dataset
>>>>>>>>>>>for
>>>>>>>>>>> SNAP training.  There are a number of filters available. Just
>> call
>>>>>>>>>>> the
>>>>>>>>script once without parameters to see the options.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Carson
>>>>>>>>>>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu>
>>>>>>>>>>>wrote:
>>>>>>>>>>>>Hi Parul,
>>>>>>>>>>>>I think the way you described (with the maker_opts.ctl file) is
>> how
>>>>>>>>you
>>>>>>>>>>>>want to proceed. You still need to give the genome too.
>>>>>>>>>>>>Daniel
>>>>>>>>>>>>Daniel Ence
>>>>>>>>>>>>Graduate Student
>>>>>>>>>>>>Eccles Institute of Human Genetics
>>>>>>>>>>>>University of Utah
>>>>>>>>>>>>15 North 2030 East, Room 2100
>>>>>>>>>>>>Salt Lake City, UT 84112-5330
>>>>>>>>>>>>________________________________________
>>>>>>>>>>>>From: maker-devel-bounces at yandell-lab.org
>>>>>>>>>>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul
>>>>>>>>>>>> Kudtarkar
>>>>>>>>[parulk at caltech.edu]
>>>>>>>>>>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>>>>>To: Parul Kudtarkar
>>>>>>>>>>>>Cc: maker-devel at yandell-lab.org
>>>>>>>>>>>>Subject: Re: [maker-devel] AED score
>>>>>>>>>>>>Also, are there any other parameters that are required when
>> filtering
>>>>>>>>based on AED score?
>>>>>>>>>>>>> Hello Carson,
>>>>>>>>>>>>> Just to confirm, Is there a script that would filter gene
>> models
>>>>>>>>>>>>>at
>>>>>>>>specific AED score.
>>>>>>>>>>>>> Alternatively if I were to do this within maker with regards
>> to
>>>>>>>>>>>>>parameters
>>>>>>>>>>>>> in maker_opts.ctl file I would have to provide my predicted
>>>>>>>>>>>>>genes
>>>>>>>>>>>>>gff3
>>>>>>>>>>>>> file to model_gff and  set AED_threshold at desired
>>>>>>>>>>>>> threshold?
>>>>>>>>Thanks and regards,
>>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>>> AED score with 1 are the ones you don't want.  0 is best and
>> 1
>>>>>>>>>>>>>> is
>>>>>>>>worst
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> it is a distance metric.  You can use the AED_threshold
>> parameter
>>>>>>>>to
>>>>>>>>>>>>>> require better matching to the evidence by setting it closer
>> to
>>>>>>>>>>>>>>0.
>>>>>>>>>>>>>>You
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> also try to increase protein homology evidence as some of
>> your
>>>>>>>>calls
>>>>>>>>>>>>>>may
>>>>>>>>>>>>>> be split genes due to lack of evidence linking them.
>>>>>>>>>>>>>> --Carson
>>>>>>>>>>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>>>>wrote:
>>>>>>>>>>>>>>>Dear Maker community,
>>>>>>>>>>>>>>>For gene-prediction I get training data-set from evidence
>>>>>>>>>>>>>>> based
>>>>>>>>prediction, I use this data-set to train SNAP as well as Augustus
>> predictions, followed by boot-strapping. I would typically expect
>> 20-30K
>>>>>>>>>>>>>>>genes however I am getting 8 times the expected gene count
>>>>>>>>>>>>>>> indicating
>>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>>many false positives. Is there a way to further refine these
>>>>>>>>predication/script to retain predictions with AED score 1 and if
>>>>>>>> yes
>>>>>>>>>>>>>>>how
>>>>>>>>>>>>>>>to go about this?
>>>>>>>>>>>>>>>Thanks and regards,
>>>>>>>>>>>>>>>Parul Kudtarkar
>>>>>>>>>>>>>>>--
>>>>>>>>>>>>>>>Scientific Programmer
>>>>>>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>>>>>>Beckman Institute,
>>>>>>>>>>>>>>>California Institute of Technology
>>>>>>>>>>>>>>>http://www.spbase.org
>>>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
>> -l
>>>>>>>>>>>>>>>ab
>>>>>>>>>>>>>>>.o
>>>>>>>>rg
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>>--
>>>>>>>>>>>>Scientific Programmer
>>>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>>>Beckman Institute,
>>>>>>>>>>>>California Institute of Technology
>>>>>>>>>>>>http://www.spbase.org
>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>> b.
>>>>>>>>>>>>or
>>>>>>>>>>>>g
>>>>>>>>_______________________________________________
>>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>> b.
>>>>>>>>>>>>or
>>>>>>>>>>>>g
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab
>> .o
>>>>>>>>>>>rg
>>>>>>>>>>--
>>>>>>>>>>Scientific Programmer
>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>Beckman Institute,
>>>>>>>>>>California Institute of Technology
>>>>>>>>>>http://www.spbase.org
>>>>>>>>--
>>>>>>>>Scientific Programmer
>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>Beckman Institute,
>>>>>>>>California Institute of Technology
>>>>>>>>http://www.spbase.org
>>>>>>--
>>>>>>Scientific Programmer
>>>>>>Center for Computational Regulatory Genomics
>>>>>>Beckman Institute,
>>>>>>California Institute of Technology
>>>>>>http://www.spbase.org
>>>>--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org
>>
>>
>> --
>> Scientific Programmer
>> Center for Computational Regulatory Genomics
>> Beckman Institute,
>> California Institute of Technology
>> http://www.spbase.org
>>
>> --
>> Scientific Programmer
>> Center for Computational Regulatory Genomics
>> Beckman Institute,
>> California Institute of Technology
>> http://www.spbase.org
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>
>
> --
> Scientific Programmer
> Center for Computational Regulatory Genomics
> Beckman Institute,
> California Institute of Technology
> http://www.spbase.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org