[maker-devel] AED score

Thu Dec 6 10:08:23 MST 2012

Hi Carson,

Once again thanks for a quick response. We expect to get 25,000 to 30,000
genes.
Here are the step
Step1. EST and proteins are used to predict gene-models. the resulting
file is used as training data-set for SNAP in step2. est2genome is turned
ON
Step2. Augustus and SNAP is used to predict genes.
Step3. The results are re-annoated(boot-strap)
Step4. EST and proteins are used to predict gene-models. The gff file from
step3. is used as model_gff. The resulting file is used as training
data-set for SNAP in step2
Step 5. Augustus and SNAP is used to predict genes. with AED set to 0.5
Step6. The results are re-annoated(boot-strap) with AED set to 0.5 and
contig_size to 10kb

Thanks and regards,
Parul Kudtarkar

I have attached the configuration file for each step.
> Your output has no genes.  They've all been filtered out.  The gene
predictions are left for reference purposes, but there are no gene models
> in the file.  You need to look at the type columns in the GFF3 file -->
match/match_part features are evidence and reference data but not
models.
> All models will have types gene/mRNA/exon/CDS.  For example if you just
want gene models in your file you can use the gff3_merge script with the
-g option, and it will only print out the gene models.
> I think you may be misinterpreting what is happening at different steps,
as well as how to read the result files.  Could you give me a detailed
explanation of what you expect to get back together with your control
files and I can walk you through the configuration, and indicate what to
expect.  Also what was the report like from CEGMA. Could you include the
report file that shows how complete your genome is and how fragmented it
is?
> Thanks,
> Carson
> On 12-12-05 3:03 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>Dear Carson,
>>Thanks for a quick response. keep_preds is set to 0
>>Though for previous step(ab-intio gene predictions keep_preds was set to
1, see Scaffold1_input.gff). I have also attached the output file
Scaffold1_out.gff.
>>Please advice.
>>Thanks and regards,
>>Parul Kudtarkar
>>> You should always get all predictions, but should only get models
(match/match_part) with AED scores less than 0.5.  You should never get
>>> models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
keep_preds set.
>>> Do you have keeps_preds set or gene models with AED values above your
threshold (not match/match_part features)?  If the first one is the
>>>case,
>>> just set it to 0.  If the second one in the case, could you send me
example GFF3?
>>> Thanks,
>>> Carson
>>> On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>Dear Carson,
>>>>Thanks once again, we have limited experimental data with very short
>>>> ESTs.
>>>>CEGMA is useful for us to gauge our gene-model.
>>>>On a different note to re-annotate genome(post evidence based
>>>>prediction(used as training dataset)and abinitio gene-prediction).
Here
>>>>are the control parameters I am using with AED score set to 0.5,
>>>> however
>>>> I
>>>>get predictions that includes the ones with AED score of 1.00 in
>>>> resulting
>>>>gff3 file. Though I do see the number of genes reduced to 1/3 of
>>>> initial
>>>>gff3 file.
>>>>#-----Genome (Required for De-Novo Annotation)
>>>>genome=Scaffold1.fa #genome sequence file in fasta format
>>>>organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>> eukaryotic
>>>>#-----Re-annotation Using MAKER Derived GFF3
>>>>genome_gff= Scaffold1.gff #re-annotate genome based on this gff3 file
est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>>>altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes, 0 =
no
>>>>protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>>>rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>>>model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>>>pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 =
>>>> no
>>>>#-----MAKER Behavior Options
>>>>AED_threshold=0.5 #Maximum Annotation Edit Distance allowed (bound by
0
>>>>and 1)
>>>>Thanks and regards,
>>>>Parul Kudtarkar
>>>>> Wow 330,000 is a lot. a large portion of genes are likely to be
>>>>>partial
>>>>>at
>>>>> best.  You should seriously consider using mRNAseq to capture those
by
>>>>> using maker's est_gff option to pass in results from cufflinks or
>>>>>trinity.
>>>>>  Also I wouldn't even try to annotate contigs less than 10kb in
size,
>>>>>just
>>>>> have maker skip them by setting the min_contig filter in the
maker_opts.ctl file.
>>>>> Thanks,
>>>>> Carson
>>>>> On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>Thanks for the guidance Carson, total contig size is 330,611 with
N50
>>>>>> of
>>>>>>39.17kb. I agree we have short ESTs. So this is the possible reason
>>>>>> when
>>>>>>filtering based on AED score 0.75 there are no gene models predicted
despite the model_gff file has few genes with scores less than 0.75?
Thanks and regards,
>>>>>>Parul Kudtarkar
>>>>>>> There are certain characteristics that are apparent in this
contig.
>>>>>>First
>>>>>>> it seems to be repeat rich with a very low gene density.  You also
>>>>>>>have
>>>>>>very short ESTs, and because of the lengths you are probably getting
many
>>>>>>> of them to align spuriously which produces very short gene models
that
>>>>>>are
>>>>>>> more than likely false positives or at the very least just a piece
>>>>>>>of
>>>>>>>a
>>>>>>gene.  I would turn off est2genome as a predictor for this reason
>>>>>> unless
>>>>>>you can get longer EST assemblies (i.e. From mRNAseq).    Your
>>>>>> protein
>>>>>>alignments also seem to be few and far between.  You probably need
to
>>>>>>add
>>>>>>> more proteins from a couple of related species, and you might
consider
>>>>>>using protein2genome rather than est2genome as a predictor if you
are
>>>>>>still working to generate a training set. Also est2genome produced
models
>>>>>>> almost always have an AED score near 0 so mixing est2genome with
the
>>>>>>AED_threshold with such limited protein support does create an
artificial
>>>>>>> bias to get back very short and incomplete models.
>>>>>>> How many contigs do you have in total and what is the N50 value
for
>>>>>>>the
>>>>>>assembly? If you have a large number of very short contigs, you will
>>>>>> get
>>>>>>very inflated gene counts because you get genes split across contigs
>>>>>> and
>>>>>>many contigs tend t be subtle rearrangements of other contigs just
assembled in a slightly different way (so you can get bits and
pieces
>>>>>> of
>>>>>>the same genes just rearranged).  This scenario is another
>>>>>> confounding
>>>>>>factor if using the est2genome predictor with short ESTs.  I would
recommend running CEGMA to get an estimate for the genome
>>>>>> completeness
>>>>>>as
>>>>>>> well as get an estimate of fragmentation as one of the statistics
>>>>>>produced
>>>>>>> is a percent of genes that are found complete (end to end) vs
those
>>>>>>> that
>>>>>>are partial.  CEGMA identifies house keeping genes that tend to be
shorter
>>>>>>> and less intron rich than other genes in the genome, so if CEGMA
gives
>>>>>>> a
>>>>>>high partial percentage and a low complete percentage, then this
>>>>>> pattern
>>>>>>can be expected to be even more exaggerated for other genes in the
genome.
>>>>>>> If your genome is highly fragmented or proteins do not align well
then
>>>>>>there are other strategies.  For example, some vertebrate genomes
end
>>>>>> up
>>>>>>having extremely fragmented assemblies (on the order of 100,000
contigs),
>>>>>>> and if they are distantly related to other annotated species few
>>>>>>proteins
>>>>>>> may align to the contigs because the introns in the alignments
tend
>>>>>>> to
>>>>>>be
>>>>>>> so long and exons so short that it pushes down the significance
scores
>>>>>>too
>>>>>>> much.  In those cases heavy mRNAseq seems to be the best if not
only
>>>>>>> way
>>>>>>to get enough evidence to stitch gene models together.
>>>>>>> Thanks,
>>>>>>> Carson
>>>>>>> On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>>Dear Carson and Daniel,
>>>>>>>>Thanks. I ran sample file for filtering genes based on AED score.
The
>>>>>>input gff3 file was provided to option model_pred(see attached file
Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>>>> least
>>>>>>>> 5
>>>>>>>>genes with AED score less than 0.75. However there were no genes
>>>>>>>> predicted
>>>>>>>>in the output file(see attached file Scaffold1_out). I have also
>>>>>>attached
>>>>>>>>the maker_opts.ctl. Could you please advice on this.
>>>>>>>>Thanks and regards,
>>>>>>>>Parul Kudtarkar
>>>>>>>>> Use the AED_threshold option in the maker_opts.ctl file if you
>>>>>>>>>just
>>>>>>>>>want
>>>>>>>>> to restrict final gene models to close matches directly within
>>>>>>>>>maker.
>>>>>>>>>On
>>>>>>>>> the other hand, if you are trying to build a dataset for
training
>>>>>>>>> gene
>>>>>>predictors, use the maker2zff script for generating a filtered
>>>>>> dataset
>>>>>>>>>for
>>>>>>>>> SNAP training.  There are a number of filters available. Just
call
>>>>>>>>> the
>>>>>>script once without parameters to see the options.
>>>>>>>>> Thanks,
>>>>>>>>> Carson
>>>>>>>>> On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu>
>>>>>>>>>wrote:
>>>>>>>>>>Hi Parul,
>>>>>>>>>>I think the way you described (with the maker_opts.ctl file) is
how
>>>>>>you
>>>>>>>>>>want to proceed. You still need to give the genome too.
>>>>>>>>>>Daniel
>>>>>>>>>>Daniel Ence
>>>>>>>>>>Graduate Student
>>>>>>>>>>Eccles Institute of Human Genetics
>>>>>>>>>>University of Utah
>>>>>>>>>>15 North 2030 East, Room 2100
>>>>>>>>>>Salt Lake City, UT 84112-5330
>>>>>>>>>>________________________________________
>>>>>>>>>>From: maker-devel-bounces at yandell-lab.org
>>>>>>>>>>[maker-devel-bounces at yandell-lab.org] on behalf of Parul
>>>>>>>>>> Kudtarkar
>>>>>>[parulk at caltech.edu]
>>>>>>>>>>Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>>>To: Parul Kudtarkar
>>>>>>>>>>Cc: maker-devel at yandell-lab.org
>>>>>>>>>>Subject: Re: [maker-devel] AED score
>>>>>>>>>>Also, are there any other parameters that are required when
filtering
>>>>>>based on AED score?
>>>>>>>>>>> Hello Carson,
>>>>>>>>>>> Just to confirm, Is there a script that would filter gene
models
>>>>>>>>>>>at
>>>>>>specific AED score.
>>>>>>>>>>> Alternatively if I were to do this within maker with regards
to
>>>>>>>>>>>parameters
>>>>>>>>>>> in maker_opts.ctl file I would have to provide my predicted
>>>>>>>>>>>genes
>>>>>>>>>>>gff3
>>>>>>>>>>> file to model_gff and  set AED_threshold at desired threshold?
>>>>>>Thanks and regards,
>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>> AED score with 1 are the ones you don't want.  0 is best and
1
>>>>>>>>>>>> is
>>>>>>worst
>>>>>>>>>>>> as
>>>>>>>>>>>> it is a distance metric.  You can use the AED_threshold
parameter
>>>>>>to
>>>>>>>>>>>> require better matching to the evidence by setting it closer
to
>>>>>>>>>>>>0.
>>>>>>>>>>>>You
>>>>>>>>>>>> can
>>>>>>>>>>>> also try to increase protein homology evidence as some of
your
>>>>>>calls
>>>>>>>>>>>>may
>>>>>>>>>>>> be split genes due to lack of evidence linking them.
>>>>>>>>>>>> --Carson
>>>>>>>>>>>> On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>>wrote:
>>>>>>>>>>>>>Dear Maker community,
>>>>>>>>>>>>>For gene-prediction I get training data-set from evidence
>>>>>>>>>>>>> based
>>>>>>prediction, I use this data-set to train SNAP as well as Augustus
predictions, followed by boot-strapping. I would typically expect
20-30K
>>>>>>>>>>>>>genes however I am getting 8 times the expected gene count
>>>>>>>>>>>>> indicating
>>>>>>>>>>>>> too
>>>>>>>>>>>>>many false positives. Is there a way to further refine these
>>>>>>predication/script to retain predictions with AED score 1 and if yes
>>>>>>>>>>>>>how
>>>>>>>>>>>>>to go about this?
>>>>>>>>>>>>>Thanks and regards,
>>>>>>>>>>>>>Parul Kudtarkar
>>>>>>>>>>>>>--
>>>>>>>>>>>>>Scientific Programmer
>>>>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>>>>Beckman Institute,
>>>>>>>>>>>>>California Institute of Technology
>>>>>>>>>>>>>http://www.spbase.org
>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
-l
>>>>>>>>>>>>>ab
>>>>>>>>>>>>>.o
>>>>>>rg
>>>>>>>>>>> --
>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>--
>>>>>>>>>>Scientific Programmer
>>>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>>>Beckman Institute,
>>>>>>>>>>California Institute of Technology
>>>>>>>>>>http://www.spbase.org
>>>>>>>>>>_______________________________________________
>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
b.
>>>>>>>>>>or
>>>>>>>>>>g
>>>>>>_______________________________________________
>>>>>>>>>>maker-devel mailing list
>>>>>>>>>>maker-devel at box290.bluehost.com
>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
b.
>>>>>>>>>>or
>>>>>>>>>>g
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab
.o
>>>>>>>>>rg
>>>>>>>>--
>>>>>>>>Scientific Programmer
>>>>>>>>Center for Computational Regulatory Genomics
>>>>>>>>Beckman Institute,
>>>>>>>>California Institute of Technology
>>>>>>>>http://www.spbase.org
>>>>>>--
>>>>>>Scientific Programmer
>>>>>>Center for Computational Regulatory Genomics
>>>>>>Beckman Institute,
>>>>>>California Institute of Technology
>>>>>>http://www.spbase.org
>>>>--
>>>>Scientific Programmer
>>>>Center for Computational Regulatory Genomics
>>>>Beckman Institute,
>>>>California Institute of Technology
>>>>http://www.spbase.org
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts_step1.ctl
Type: application/octet-stream
Size: 4721 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121206/c0d900b4/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts_step2.ctl
Type: application/octet-stream
Size: 4587 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121206/c0d900b4/attachment-0007.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts_step3.ctl
Type: application/octet-stream
Size: 4560 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121206/c0d900b4/attachment-0008.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts_step4.ctl
Type: application/octet-stream
Size: 4632 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121206/c0d900b4/attachment-0009.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts_step5.ctl
Type: application/octet-stream
Size: 4568 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121206/c0d900b4/attachment-0010.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts_step6.ctl
Type: application/octet-stream
Size: 4577 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121206/c0d900b4/attachment-0011.obj>