[maker-devel] AED score

Fri Dec 7 18:09:53 MST 2012

Sorry for not checking this before, just to answer my own question I just
found after reading augustus documentation that autoAugTrain.pl should
work for generating training dataset for Augustus from using the
gene-model hints from first pass run of maker and using
http://sourceforge.net/mailarchive/message.php?msg_id=29361270 script to
generate training.gb

autoAugTrain.pl [OPTIONS] --species=sname --trainingset=training.gb

I believe should work better than using lamprey to train augustus.

Thanks and regards,
Parul Kudtarkar

> Dear Carson,
>
> Thanks for detailed explanation and help. Now we know the exact parameters
> that should help us with generating good gene-model.
>
> The genome for which we are working on gene predictions is
> Echinoderm(sea-urchin). lamprey was the closest organism for training
> augustus that I could find. A quick question for training augustus, there
> is augustus_species option how would you go from training data generated
> by zff2augustus_gbk.pl > train.gb as specified here
> http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
> generating the species folder that can be specified to augustus_species
> option.
>
> The average expected gene-length in our case is ~15kb.
> We also have a good repeat library for our genome.
>
> Thanks and regards,
> Parul Kudtarkar
>
>> For step 2 and 3, use lamprey rather than human in augustus.  For some
>> reason the current version of augustus doesn't diplay it as an option
>> for
>> the species help menu (so I didn't see it), but it's there in
>> �/augustus/config/species/lamprey
>>
>> For any additional training, make copies into
>> �/augustus/config/species/lamprey2 and
>> �/augustus/config/species/lamprey3
>>
>> Thanks,
>> Carson
>>
>>
>>
>> From:  Carson Holt <carsonhh at gmail.com>
>> Date:  Friday, 7 December, 2012 10:22 AM
>> To:  Daniel Ence <dence at genetics.utah.edu>,
>> "maker-devel at yandell-lab.org"
>> <maker-devel at yandell-lab.org>, Parul Kudtarkar <parulk at caltech.edu>
>> Subject:  Re: [maker-devel] AED score
>>
>> Just to add to Daniels comments.
>>
>> Things to change in step 1:
>>> protein= <just try and add more protein evidence in general>
>>> est2genome=0
>>> protein2genome=1
>>> split_hit=20000
>>> min_contig=50000
>>
>> Reasoning:
>>> Your ESTs are very short especially if this is a lamprey species which
>>> have
>>> very long introns and really short exons.  In lamprey (i.e. Petromyzon
>>> marinus), genes tend to be very long (remember gene lengths include
>>> introns
>>> and UTR and is not just the size of the coding sequence), so contigs
>>> shorter
>>> than 50kb are useless for training as you are unlikely to get nice
>>> complete
>>> gene models on those.  Also lampreys have very long introns, so you
>>> have
>>> to
>>> allow for bigger introns in alignments (split_hit parameter).  Finally
>>> add as
>>> much protein evidence from as many sources as possible.  Your maker
>>> training
>>> run will take a long time as proteins take forever to align, but
>>> because
>>> of
>>> the evolutionary distance of lamprey from everything else and the short
>>> exon
>>> structure of its genome, very little aligns directly to its genome from
>>> other
>>> deuterstome and vertebrate species.  I'm assuming this is a lamprey
>>> species
>>> because of what you said about the  augustus species file you are
>>> using.
>>> Really the only thing closely related to lampreys unfortunately are
>>> other
>>> lampreys.  Lancelets, hagfish, and sharks are not closely related to
>>> lamprey
>>> (while they branch closely together on the tree of life, there are too
>>> many
>>> years since the last common ancestor).  So while they may have similar
>>> issues
>>> related to annotation (long introns and short exons etc.) they will not
>>> really
>>> match that well for the gene predictor or even protein alignments.
>>
>> Additional note:
>>> I have training files for the lamprey species Petromyzon marinus for
>>> both
>>> Augustus and SNAP that I could share with you in a few week, when the
>>> genome
>>> publication is is released.  But before that happens, new gene models
>>> will be
>>> available  through the UCSC browser (hopefully within a couple of
>>> weeks), and
>>> gene models are already available through ENSEMBL.  Get those protein
>>> files
>>> for training, it may be a big help for you.  If you want early access
>>> to
>>> the
>>> lamprey training files for Augustus and SNAP, you would have to request
>>> it
>>> from Weiming Li at Michigan State University (the head of that genome
>>> project).
>>>
>>
>>
>> Things to change in step 2:
>>> Optimally you would be doing de novo training using mRNAseq results,
>>> but
>>> with
>>> on;ly sparse protein alignments and such a fragmented assembly, you are
>>> probably better off just trying to adapt the human HMM files.  They
>>> won't
>>> match that well, but you probably won't have the evidence for De Novo
>>> training.  First make a copy of the augustus human species directory
>>> and
>>> rename it to lamprey (cp -R  �/augustus/config/species/human
>>> �/augustus/config/species/lamprey).  Use it as the base species for
>>> retraining
>>> augustus using your new models.  You will have to edit multiple files
>>> in
>>> the
>>> directory after you copy it so that they no longer say human or homo
>>> sapiens
>>> internally or in the file name.  Use maker2zff to generate the filtered
>>> ZFF
>>> file for training SNAP, but don't train SNAP.  Rather use the training
>>> file to
>>> better train Augustus info here (just ignore the CEGMA part) -->
>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270
>>>
>>> MAKE a backup of the step 1 maker output directory and run step 2 in
>>> the
>>> old
>>> step 1 directory (this allows you to change the parameters and reuse
>>> files
>>> form step 1 so you don't have to recalculate all the protein and EST
>>> alignments).  So control files for step 2 are identical to step 1
>>> except
>>> for
>>> these parameters.
>>>
>>> protein2genome=0
>>> augustus_species=lamprey
>>> snaphmm=lamprey.hmm #optional if you decide to use SNAP
>>>
>>> Don't both training SNAP here as you probably won't have enough data
>>> and
>>> you
>>> assembly is too fragmented for it to work well, so just stick to
>>> augustus.
>>> Try SNAP if you want just to see how well it works.  Manually open up
>>> the
>>> largest contigs in a viewer to look at the models produced from the
>>> MAKER run
>>> to see if they look reasonable (this will also help you decide whether
>>> to keep
>>> SNAP).
>>>
>>
>> Things to change in step 3:
>>> Step 3 should just be a clone of step 2 as it is bootstrapping.  But
>>> make
>>> copies of �/augustus/config/species/lamprey and save it to
>>> �/augustus/config/species/lamprey2 (editing all the files and names as
>>> you did
>>> in step 2). This way you don't loose that training data if you decide
>>> to
>>> step
>>> back.  Also Give your SNAP HMM a new name (I.e. lamprey2.hmm)
>>>
>>> augustus_species=lamprey2
>>> snaphmm=lamprey2.hmm  #optional if you decide to use SNAP
>>>
>>> Make a backup of Step 2 and run step 3 in the old Step 2 directory
>>> (This
>>> is
>>> for file reuse, so the step will run fast).  This must be the exact
>>> same
>>> step
>>> directory as step 2 for the reuse trick to work.
>>>
>>> Manually review the models and if you are satisfied move to step 4.
>>> Also note
>>> that most parameters including the protein, EST, and repeats should not
>>> change
>>> from step1-step3, and should not be removed for step 4 either, you can
>>> add
>>> more evidence, but don't remove evidence (like the repeats).
>>>
>>>
>> Things to change in step 4:
>>> For this step, just set min_contig=10000 and rerun MAKER inside the
>>> step
>>> 3
>>> directory to get the smaller contigs annotated.  This should be your
>>> final
>>> step, although you can try altering other parameters or adding more
>>> evidence
>>> sources here etc.
>>>
>>>
>> For other things to keep in mind, you should consider taking extra time
>> to
>> build a comprehensive library of repeats for your species, I know that
>> Petromyzon marinus was virtually unannotatable until we had a very deep
>> repeat library built for it.  Any repeat library should be used in all
>> steps.  Also for lamprey, you will expect between 20,000-30,000 genes
>> both
>> because of your assemblies fragmentation and because of some ancestral
>> genome duplication.  Also be aware that lampreys appear to undergo
>> programmed genome loss in somatic tissues, so any gene count you get is
>> only
>> going to represent a maximum of 75-80% of all genes unless the assembly
>> is
>> derived from germline tissue.
>>
>> Thanks,
>> Carson
>>
>>
>>>
>> On 12-12-06 2:17 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>
>>> Hi Parul,
>>>
>>> In step one, if protein2genome isn't turned on, then the protein
>>> alignments
>>> wont be used to generate gene models. Carson said before to use
>>> protein2genome
>>> to generate gene models for your trainings set, but you should know
>>> that
>>> those
>>> data aren't being used right now.
>>>
>>> In step 2, you can include the est and protein evidence that you used
>>> in
>>> step
>>> one, and the ab-initio predictors will takes those data as hints to
>>> guide
>>> their predictions. Also, are you reannotating human here? I'm just
>>> wondering
>>> why the snap file is called Pult, but the augustus species model is
>>> human.
>>> Also, I think you should include the same masking files from step 1,
>>> otherwise
>>> the ab-initio predictors will be predicting on the unmasked sequence
>>> which
>>> will give you many spurious predictions.
>>>
>>> In step 3, I'm not certain what you mean by boot-strapping. The control
>>> file
>>> you sent for step 3 will just pass-through all of the data from before.
>>>
>>> In step 4, I think what you'll get is pretty close to what you already
>>> had in
>>> step 3. The gene models from step 3 are already based on the est and
>>> protein
>>> data from step 1, so without giving any different evidence or ab-initio
>>> predictors, I think that you will just get the same gene models that
>>> you
>>> got
>>> from step 3. Those gene models are what you should be using to train
>>> snap.
>>> Also, is this the same genome as in the other steps? The genome file
>>> here is
>>> called genome.linear.fa, but before it was called genome.fa.
>>>
>>> Step 5. So maker uses both evidence and ab-initio predictions to create
>>> models. If you don't give any evidence (EST or protein) maker will not
>>> annotate any gene models. You have also changed the augustus species
>>> model in
>>> this step, but I don't know what you've gained by going from one
>>> species
>>> to
>>> another. You should be training augustus with the gene models and then
>>> creating a new species model in augustus. As I understand it, the
>>> filter
>>> only
>>> operates on gene models, not ab-initio predictions or alignments, so it
>>> probably isn't doing anything the way you have it set.
>>>
>>> Step 6. I think step 5 and 6 should be combined. I don't know what you
>>> mean by
>>> boot-strapping here too.
>>>
>>> Hopefully that clears up some of the confusion with how maker works.
>>> Carson
>>> will probably have a lot of suggestions too.
>>>
>>> Thanks,
>>> Daniel
>>>
>>> Daniel Ence
>>> Graduate Student
>>> Eccles Institute of Human Genetics
>>> University of Utah
>>> 15 North 2030 East, Room 2100
>>> Salt Lake City, UT 84112-5330
>>> ________________________________________
>>> From: Parul Kudtarkar [parulk at caltech.edu]
>>> Sent: Thursday, December 06, 2012 10:08 AM
>>> To: carsonhh at gmail.com
>>> Cc: Daniel Ence; maker-devel at yandell-lab.org
>>> Subject: Re: [maker-devel] AED score
>>>
>>> Hi Carson,
>>>
>>> Once again thanks for a quick response. We expect to get 25,000 to
>>> 30,000
>>> genes.
>>> Here are the step
>>> Step1. EST and proteins are used to predict gene-models. the resulting
>>> file is used as training data-set for SNAP in step2. est2genome is
>>> turned
>>> ON
>>> Step2. Augustus and SNAP is used to predict genes.
>>> Step3. The results are re-annoated(boot-strap)
>>> Step4. EST and proteins are used to predict gene-models. The gff file
>>> from
>>> step3. is used as model_gff. The resulting file is used as training
>>> data-set for SNAP in step2
>>> Step 5. Augustus and SNAP is used to predict genes. with AED set to 0.5
>>> Step6. The results are re-annoated(boot-strap) with AED set to 0.5 and
>>> contig_size to 10kb
>>>
>>> Thanks and regards,
>>> Parul Kudtarkar
>>>
>>> I have attached the configuration file for each step.
>>>>  Your output has no genes.  They've all been filtered out.  The gene
>>> predictions are left for reference purposes, but there are no gene
>>> models
>>>>  in the file.  You need to look at the type columns in the GFF3 file
>>>> -->
>>> match/match_part features are evidence and reference data but not
>>> models.
>>>>  All models will have types gene/mRNA/exon/CDS.  For example if you
>>>> just
>>> want gene models in your file you can use the gff3_merge script with
>>> the
>>> -g option, and it will only print out the gene models.
>>>>  I think you may be misinterpreting what is happening at different
>>>> steps,
>>> as well as how to read the result files.  Could you give me a detailed
>>> explanation of what you expect to get back together with your control
>>> files and I can walk you through the configuration, and indicate what
>>> to
>>> expect.  Also what was the report like from CEGMA. Could you include
>>> the
>>> report file that shows how complete your genome is and how fragmented
>>> it
>>> is?
>>>>  Thanks,
>>>>  Carson
>>>>  On 12-12-05 3:03 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>> Dear Carson,
>>>>> Thanks for a quick response. keep_preds is set to 0
>>>>> Though for previous step(ab-intio gene predictions keep_preds was set
>>>>> to
>>> 1, see Scaffold1_input.gff). I have also attached the output file
>>> Scaffold1_out.gff.
>>>>> Please advice.
>>>>> Thanks and regards,
>>>>> Parul Kudtarkar
>>>>>>  You should always get all predictions, but should only get models
>>> (match/match_part) with AED scores less than 0.5.  You should never get
>>>>>>  models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
>>> keep_preds set.
>>>>>>  Do you have keeps_preds set or gene models with AED values above
>>>>>> your
>>> threshold (not match/match_part features)?  If the first one is the
>>>>>> case,
>>>>>>  just set it to 0.  If the second one in the case, could you send me
>>> example GFF3?
>>>>>>  Thanks,
>>>>>>  Carson
>>>>>>  On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>> Dear Carson,
>>>>>>> Thanks once again, we have limited experimental data with very
>>>>>>> short
>>>>>>>  ESTs.
>>>>>>> CEGMA is useful for us to gauge our gene-model.
>>>>>>> On a different note to re-annotate genome(post evidence based
>>>>>>> prediction(used as training dataset)and abinitio gene-prediction).
>>> Here
>>>>>>> are the control parameters I am using with AED score set to 0.5,
>>>>>>>  however
>>>>>>>  I
>>>>>>> get predictions that includes the ones with AED score of 1.00 in
>>>>>>>  resulting
>>>>>>> gff3 file. Though I do see the number of genes reduced to 1/3 of
>>>>>>>  initial
>>>>>>> gff3 file.
>>>>>>> #-----Genome (Required for De-Novo Annotation)
>>>>>>> genome=Scaffold1.fa #genome sequence file in fasta format
>>>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>>>>  eukaryotic
>>>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>>>> genome_gff= Scaffold1.gff #re-annotate genome based on this gff3
>>>>>>> file
>>> est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>>>>>> altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes,
>>>>>>> 0
>>>>>>> =
>>> no
>>>>>>> protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>>>>>> rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>>>>>> model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>>>>>> pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0 =
>>>>>>> no
>>> other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 =
>>>>>>>  no
>>>>>>> #-----MAKER Behavior Options
>>>>>>> AED_threshold=0.5 #Maximum Annotation Edit Distance allowed (bound
>>>>>>> by
>>> 0
>>>>>>> and 1)
>>>>>>> Thanks and regards,
>>>>>>> Parul Kudtarkar
>>>>>>>>  Wow 330,000 is a lot. a large portion of genes are likely to be
>>>>>>>> partial
>>>>>>>> at
>>>>>>>>  best.  You should seriously consider using mRNAseq to capture
>>>>>>>> those
>>> by
>>>>>>>>  using maker's est_gff option to pass in results from cufflinks or
>>>>>>>> trinity.
>>>>>>>>   Also I wouldn't even try to annotate contigs less than 10kb in
>>> size,
>>>>>>>> just
>>>>>>>>  have maker skip them by setting the min_contig filter in the
>>> maker_opts.ctl file.
>>>>>>>>  Thanks,
>>>>>>>>  Carson
>>>>>>>>  On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>> wrote:
>>>>>>>>> Thanks for the guidance Carson, total contig size is 330,611 with
>>> N50
>>>>>>>>>  of
>>>>>>>>> 39.17kb. I agree we have short ESTs. So this is the possible
>>>>>>>>> reason
>>>>>>>>>  when
>>>>>>>>> filtering based on AED score 0.75 there are no gene models
>>>>>>>>> predicted
>>> despite the model_gff file has few genes with scores less than 0.75?
>>> Thanks and regards,
>>>>>>>>> Parul Kudtarkar
>>>>>>>>>  There are certain characteristics that are apparent in this
>>> contig.
>>>>>>>>> First
>>>>>>>>>  it seems to be repeat rich with a very low gene density.  You
>>>>>>>>> also
>>>>>>>>> have
>>>>>>>>> very short ESTs, and because of the lengths you are probably
>>>>>>>>> getting
>>> many
>>>>>>>>>  of them to align spuriously which produces very short gene
>>>>>>>>> models
>>> that
>>>>>>>>> are
>>>>>>>>>  more than likely false positives or at the very least just a
>>>>>>>>> piece
>>>>>>>>> of
>>>>>>>>> a
>>>>>>>>> gene.  I would turn off est2genome as a predictor for this reason
>>>>>>>>>  unless
>>>>>>>>> you can get longer EST assemblies (i.e. From mRNAseq).    Your
>>>>>>>>>  protein
>>>>>>>>> alignments also seem to be few and far between.  You probably
>>>>>>>>> need
>>> to
>>>>>>>>> add
>>>>>>>>>  more proteins from a couple of related species, and you might
>>> consider
>>>>>>>>> using protein2genome rather than est2genome as a predictor if you
>>> are
>>>>>>>>> still working to generate a training set. Also est2genome
>>>>>>>>> produced
>>> models
>>>>>>>>>  almost always have an AED score near 0 so mixing est2genome with
>>> the
>>>>>>>>> AED_threshold with such limited protein support does create an
>>> artificial
>>>>>>>>>  bias to get back very short and incomplete models.
>>>>>>>>>  How many contigs do you have in total and what is the N50 value
>>> for
>>>>>>>>> the
>>>>>>>>> assembly? If you have a large number of very short contigs, you
>>>>>>>>> will
>>>>>>>>>  get
>>>>>>>>> very inflated gene counts because you get genes split across
>>>>>>>>> contigs
>>>>>>>>>  and
>>>>>>>>> many contigs tend t be subtle rearrangements of other contigs
>>>>>>>>> just
>>> assembled in a slightly different way (so you can get bits and
>>> pieces
>>>>>>>>>  of
>>>>>>>>> the same genes just rearranged).  This scenario is another
>>>>>>>>>  confounding
>>>>>>>>> factor if using the est2genome predictor with short ESTs.  I
>>>>>>>>> would
>>> recommend running CEGMA to get an estimate for the genome
>>>>>>>>>  completeness
>>>>>>>>> as
>>>>>>>>>  well as get an estimate of fragmentation as one of the
>>>>>>>>> statistics
>>>>>>>>> produced
>>>>>>>>>  is a percent of genes that are found complete (end to end) vs
>>> those
>>>>>>>>>  that
>>>>>>>>> are partial.  CEGMA identifies house keeping genes that tend to
>>>>>>>>> be
>>> shorter
>>>>>>>>>  and less intron rich than other genes in the genome, so if CEGMA
>>> gives
>>>>>>>>>  a
>>>>>>>>> high partial percentage and a low complete percentage, then this
>>>>>>>>>  pattern
>>>>>>>>> can be expected to be even more exaggerated for other genes in
>>>>>>>>> the
>>> genome.
>>>>>>>>>  If your genome is highly fragmented or proteins do not align
>>>>>>>>> well
>>> then
>>>>>>>>> there are other strategies.  For example, some vertebrate genomes
>>> end
>>>>>>>>>  up
>>>>>>>>> having extremely fragmented assemblies (on the order of 100,000
>>> contigs),
>>>>>>>>>  and if they are distantly related to other annotated species few
>>>>>>>>> proteins
>>>>>>>>>  may align to the contigs because the introns in the alignments
>>> tend
>>>>>>>>>  to
>>>>>>>>> be
>>>>>>>>>  so long and exons so short that it pushes down the significance
>>> scores
>>>>>>>>> too
>>>>>>>>>  much.  In those cases heavy mRNAseq seems to be the best if not
>>> only
>>>>>>>>>  way
>>>>>>>>> to get enough evidence to stitch gene models together.
>>>>>>>>>  Thanks,
>>>>>>>>>  Carson
>>>>>>>>>  On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>> wrote:
>>>>>>>>> Dear Carson and Daniel,
>>>>>>>>> Thanks. I ran sample file for filtering genes based on AED score.
>>> The
>>>>>>>>> input gff3 file was provided to option model_pred(see attached
>>>>>>>>> file
>>> Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>>>>>>>  least
>>>>>>>>>  5
>>>>>>>>> genes with AED score less than 0.75. However there were no genes
>>>>>>>>>  predicted
>>>>>>>>> in the output file(see attached file Scaffold1_out). I have also
>>>>>>>>> attached
>>>>>>>>> the maker_opts.ctl. Could you please advice on this.
>>>>>>>>> Thanks and regards,
>>>>>>>>> Parul Kudtarkar
>>>>>>>>>  Use the AED_threshold option in the maker_opts.ctl file if you
>>>>>>>>> just
>>>>>>>>> want
>>>>>>>>>  to restrict final gene models to close matches directly within
>>>>>>>>> maker.
>>>>>>>>> On
>>>>>>>>>  the other hand, if you are trying to build a dataset for
>>> training
>>>>>>>>>  gene
>>>>>>>>> predictors, use the maker2zff script for generating a filtered
>>>>>>>>>  dataset
>>>>>>>>> for
>>>>>>>>>  SNAP training.  There are a number of filters available. Just
>>> call
>>>>>>>>>  the
>>>>>>>>> script once without parameters to see the options.
>>>>>>>>>  Thanks,
>>>>>>>>>  Carson
>>>>>>>>>  On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu>
>>>>>>>>> wrote:
>>>>>>>>> Hi Parul,
>>>>>>>>> I think the way you described (with the maker_opts.ctl file) is
>>> how
>>>>>>>>> you
>>>>>>>>> want to proceed. You still need to give the genome too.
>>>>>>>>> Daniel
>>>>>>>>> Daniel Ence
>>>>>>>>> Graduate Student
>>>>>>>>> Eccles Institute of Human Genetics
>>>>>>>>> University of Utah
>>>>>>>>> 15 North 2030 East, Room 2100
>>>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>>>> ________________________________________
>>>>>>>>> From: maker-devel-bounces at yandell-lab.org
>>>>>>>>> [maker-devel-bounces at yandell-lab.org] on behalf of Parul
>>>>>>>>>  Kudtarkar
>>>>>>>>> [parulk at caltech.edu]
>>>>>>>>> Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>> To: Parul Kudtarkar
>>>>>>>>> Cc: maker-devel at yandell-lab.org
>>>>>>>>> Subject: Re: [maker-devel] AED score
>>>>>>>>> Also, are there any other parameters that are required when
>>> filtering
>>>>>>>>> based on AED score?
>>>>>>>>>  Hello Carson,
>>>>>>>>>  Just to confirm, Is there a script that would filter gene
>>> models
>>>>>>>>> at
>>>>>>>>> specific AED score.
>>>>>>>>>  Alternatively if I were to do this within maker with regards
>>> to
>>>>>>>>> parameters
>>>>>>>>>  in maker_opts.ctl file I would have to provide my predicted
>>>>>>>>> genes
>>>>>>>>> gff3
>>>>>>>>>  file to model_gff and  set AED_threshold at desired threshold?
>>>>>>>>> Thanks and regards,
>>>>>>>>>  Parul Kudtarkar
>>>>>>>>>  AED score with 1 are the ones you don't want.  0 is best and
>>> 1
>>>>>>>>>  is
>>>>>>>>> worst
>>>>>>>>>  as
>>>>>>>>>  it is a distance metric.  You can use the AED_threshold
>>> parameter
>>>>>>>>> to
>>>>>>>>>  require better matching to the evidence by setting it closer
>>> to
>>>>>>>>> 0.
>>>>>>>>> You
>>>>>>>>>  can
>>>>>>>>>  also try to increase protein homology evidence as some of
>>> your
>>>>>>>>> calls
>>>>>>>>> may
>>>>>>>>>  be split genes due to lack of evidence linking them.
>>>>>>>>>  --Carson
>>>>>>>>>  On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>> wrote:
>>>>>>>>> Dear Maker community,
>>>>>>>>> For gene-prediction I get training data-set from evidence
>>>>>>>>>  based
>>>>>>>>> prediction, I use this data-set to train SNAP as well as Augustus
>>> predictions, followed by boot-strapping. I would typically expect
>>> 20-30K
>>>>>>>>> genes however I am getting 8 times the expected gene count
>>>>>>>>>  indicating
>>>>>>>>>  too
>>>>>>>>> many false positives. Is there a way to further refine these
>>>>>>>>> predication/script to retain predictions with AED score 1 and if
>>>>>>>>> yes
>>>>>>>>> how
>>>>>>>>> to go about this?
>>>>>>>>> Thanks and regards,
>>>>>>>>> Parul Kudtarkar
>>>>>>>>> --
>>>>>>>>> Scientific Programmer
>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>> Beckman Institute,
>>>>>>>>> California Institute of Technology
>>>>>>>>> http://www.spbase.org
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
>>> -l
>>>>>>>>> ab
>>>>>>>>> .o
>>>>>>>>> rg
>>>>>>>>>  --
>>>>>>>>>  Scientific Programmer
>>>>>>>>>  Center for Computational Regulatory Genomics
>>>>>>>>>  Beckman Institute,
>>>>>>>>>  California Institute of Technology
>>>>>>>>>  http://www.spbase.org
>>>>>>>>> --
>>>>>>>>> Scientific Programmer
>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>> Beckman Institute,
>>>>>>>>> California Institute of Technology
>>>>>>>>> http://www.spbase.org
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>> b.
>>>>>>>>> or
>>>>>>>>> g
>>>>>>>>> _______________________________________________
>>>>>>>>> maker-devel mailing list
>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-la
>>> b.
>>>>>>>>> or
>>>>>>>>> g
>>>>>>>>>  _______________________________________________
>>>>>>>>>  maker-devel mailing list
>>>>>>>>>  maker-devel at box290.bluehost.com
>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab
>>> .o
>>>>>>>>> rg
>>>>>>>>> --
>>>>>>>>> Scientific Programmer
>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>> Beckman Institute,
>>>>>>>>> California Institute of Technology
>>>>>>>>> http://www.spbase.org
>>>>>>>>> --
>>>>>>>>> Scientific Programmer
>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>> Beckman Institute,
>>>>>>>>> California Institute of Technology
>>>>>>>>> http://www.spbase.org
>>>>>>> --
>>>>>>> Scientific Programmer
>>>>>>> Center for Computational Regulatory Genomics
>>>>>>> Beckman Institute,
>>>>>>> California Institute of Technology
>>>>>>> http://www.spbase.org
>>>>> --
>>>>> Scientific Programmer
>>>>> Center for Computational Regulatory Genomics
>>>>> Beckman Institute,
>>>>> California Institute of Technology
>>>>> http://www.spbase.org
>>>
>>>
>>> --
>>> Scientific Programmer
>>> Center for Computational Regulatory Genomics
>>> Beckman Institute,
>>> California Institute of Technology
>>> http://www.spbase.org
>>>
>>> --
>>> Scientific Programmer
>>> Center for Computational Regulatory Genomics
>>> Beckman Institute,
>>> California Institute of Technology
>>> http://www.spbase.org
>>>
>>>
>>>
>>
>>
>>
>
>
> --
> Scientific Programmer
> Center for Computational Regulatory Genomics
> Beckman Institute,
> California Institute of Technology
> http://www.spbase.org
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org