[maker-devel] AED score

Mon Dec 10 20:28:24 MST 2012

Dear Carson, Thanks a lot for detailed explanation and all the help with
running and understanding maker2.

> Yes.  Lamprey is not a good match for sea urchin.  When training Augusuts
> you can sometimes use other genomes as starting points and let Augustus
> then modify that species to match the models you provided, which can be
> better than de novo training under certain circumstances. But given sea
> urchin's evolutionary distance from all the other species bundled with
> augustus, its probably not a good starting point.  So protein2genome
> derived models and cegma based training are probably your best options.
>
> --Carson
>
>
> On 12-12-07 8:09 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>
>>Sorry for not checking this before, just to answer my own question I just
>>found after reading augustus documentation that autoAugTrain.pl should
>>work for generating training dataset for Augustus from using the
>>gene-model hints from first pass run of maker and using
>>http://sourceforge.net/mailarchive/message.php?msg_id=29361270 script to
>>generate training.gb
>>
>>autoAugTrain.pl [OPTIONS] --species=sname --trainingset=training.gb
>>
>>I believe should work better than using lamprey to train augustus.
>>
>>Thanks and regards,
>>Parul Kudtarkar
>>
>>> Dear Carson,
>>>
>>> Thanks for detailed explanation and help. Now we know the exact
>>>parameters
>>> that should help us with generating good gene-model.
>>>
>>> The genome for which we are working on gene predictions is
>>> Echinoderm(sea-urchin). lamprey was the closest organism for training
>>> augustus that I could find. A quick question for training augustus,
>>>there
>>> is augustus_species option how would you go from training data
>>> generated
>>> by zff2augustus_gbk.pl > train.gb as specified here
>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
>>> generating the species folder that can be specified to augustus_species
>>> option.
>>>
>>> The average expected gene-length in our case is ~15kb.
>>> We also have a good repeat library for our genome.
>>>
>>> Thanks and regards,
>>> Parul Kudtarkar
>>>
>>>> For step 2 and 3, use lamprey rather than human in augustus.  For some
>>>> reason the current version of augustus doesn't diplay it as an option
>>>> for
>>>> the species help menu (so I didn't see it), but it's there in
>>>> �/augustus/config/species/lamprey
>>>>
>>>> For any additional training, make copies into
>>>> �/augustus/config/species/lamprey2 and
>>>> �/augustus/config/species/lamprey3
>>>>
>>>> Thanks,
>>>> Carson
>>>>
>>>>
>>>>
>>>> From:  Carson Holt <carsonhh at gmail.com>
>>>> Date:  Friday, 7 December, 2012 10:22 AM
>>>> To:  Daniel Ence <dence at genetics.utah.edu>,
>>>> "maker-devel at yandell-lab.org"
>>>> <maker-devel at yandell-lab.org>, Parul Kudtarkar <parulk at caltech.edu>
>>>> Subject:  Re: [maker-devel] AED score
>>>>
>>>> Just to add to Daniels comments.
>>>>
>>>> Things to change in step 1:
>>>>> protein= <just try and add more protein evidence in general>
>>>>> est2genome=0
>>>>> protein2genome=1
>>>>> split_hit=20000
>>>>> min_contig=50000
>>>>
>>>> Reasoning:
>>>>> Your ESTs are very short especially if this is a lamprey species
>>>>> which
>>>>> have
>>>>> very long introns and really short exons.  In lamprey (i.e.
>>>>> Petromyzon
>>>>> marinus), genes tend to be very long (remember gene lengths include
>>>>> introns
>>>>> and UTR and is not just the size of the coding sequence), so contigs
>>>>> shorter
>>>>> than 50kb are useless for training as you are unlikely to get nice
>>>>> complete
>>>>> gene models on those.  Also lampreys have very long introns, so you
>>>>> have
>>>>> to
>>>>> allow for bigger introns in alignments (split_hit parameter).
>>>>> Finally
>>>>> add as
>>>>> much protein evidence from as many sources as possible.  Your maker
>>>>> training
>>>>> run will take a long time as proteins take forever to align, but
>>>>> because
>>>>> of
>>>>> the evolutionary distance of lamprey from everything else and the
>>>>>short
>>>>> exon
>>>>> structure of its genome, very little aligns directly to its genome
>>>>>from
>>>>> other
>>>>> deuterstome and vertebrate species.  I'm assuming this is a lamprey
>>>>> species
>>>>> because of what you said about the  augustus species file you are
>>>>> using.
>>>>> Really the only thing closely related to lampreys unfortunately are
>>>>> other
>>>>> lampreys.  Lancelets, hagfish, and sharks are not closely related to
>>>>> lamprey
>>>>> (while they branch closely together on the tree of life, there are
>>>>> too
>>>>> many
>>>>> years since the last common ancestor).  So while they may have
>>>>> similar
>>>>> issues
>>>>> related to annotation (long introns and short exons etc.) they will
>>>>>not
>>>>> really
>>>>> match that well for the gene predictor or even protein alignments.
>>>>
>>>> Additional note:
>>>>> I have training files for the lamprey species Petromyzon marinus for
>>>>> both
>>>>> Augustus and SNAP that I could share with you in a few week, when the
>>>>> genome
>>>>> publication is is released.  But before that happens, new gene models
>>>>> will be
>>>>> available  through the UCSC browser (hopefully within a couple of
>>>>> weeks), and
>>>>> gene models are already available through ENSEMBL.  Get those protein
>>>>> files
>>>>> for training, it may be a big help for you.  If you want early access
>>>>> to
>>>>> the
>>>>> lamprey training files for Augustus and SNAP, you would have to
>>>>>request
>>>>> it
>>>>> from Weiming Li at Michigan State University (the head of that genome
>>>>> project).
>>>>>
>>>>
>>>>
>>>> Things to change in step 2:
>>>>> Optimally you would be doing de novo training using mRNAseq results,
>>>>> but
>>>>> with
>>>>> on;ly sparse protein alignments and such a fragmented assembly, you
>>>>>are
>>>>> probably better off just trying to adapt the human HMM files.  They
>>>>> won't
>>>>> match that well, but you probably won't have the evidence for De Novo
>>>>> training.  First make a copy of the augustus human species directory
>>>>> and
>>>>> rename it to lamprey (cp -R  �/augustus/config/species/human
>>>>> �/augustus/config/species/lamprey).  Use it as the base species for
>>>>> retraining
>>>>> augustus using your new models.  You will have to edit multiple files
>>>>> in
>>>>> the
>>>>> directory after you copy it so that they no longer say human or homo
>>>>> sapiens
>>>>> internally or in the file name.  Use maker2zff to generate the
>>>>>filtered
>>>>> ZFF
>>>>> file for training SNAP, but don't train SNAP.  Rather use the
>>>>> training
>>>>> file to
>>>>> better train Augustus info here (just ignore the CEGMA part) -->
>>>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270
>>>>>
>>>>> MAKE a backup of the step 1 maker output directory and run step 2 in
>>>>> the
>>>>> old
>>>>> step 1 directory (this allows you to change the parameters and reuse
>>>>> files
>>>>> form step 1 so you don't have to recalculate all the protein and EST
>>>>> alignments).  So control files for step 2 are identical to step 1
>>>>> except
>>>>> for
>>>>> these parameters.
>>>>>
>>>>> protein2genome=0
>>>>> augustus_species=lamprey
>>>>> snaphmm=lamprey.hmm #optional if you decide to use SNAP
>>>>>
>>>>> Don't both training SNAP here as you probably won't have enough data
>>>>> and
>>>>> you
>>>>> assembly is too fragmented for it to work well, so just stick to
>>>>> augustus.
>>>>> Try SNAP if you want just to see how well it works.  Manually open up
>>>>> the
>>>>> largest contigs in a viewer to look at the models produced from the
>>>>> MAKER run
>>>>> to see if they look reasonable (this will also help you decide
>>>>> whether
>>>>> to keep
>>>>> SNAP).
>>>>>
>>>>
>>>> Things to change in step 3:
>>>>> Step 3 should just be a clone of step 2 as it is bootstrapping.  But
>>>>> make
>>>>> copies of �/augustus/config/species/lamprey and save it to
>>>>> �/augustus/config/species/lamprey2 (editing all the files and names
>>>>> as
>>>>> you did
>>>>> in step 2). This way you don't loose that training data if you decide
>>>>> to
>>>>> step
>>>>> back.  Also Give your SNAP HMM a new name (I.e. lamprey2.hmm)
>>>>>
>>>>> augustus_species=lamprey2
>>>>> snaphmm=lamprey2.hmm  #optional if you decide to use SNAP
>>>>>
>>>>> Make a backup of Step 2 and run step 3 in the old Step 2 directory
>>>>> (This
>>>>> is
>>>>> for file reuse, so the step will run fast).  This must be the exact
>>>>> same
>>>>> step
>>>>> directory as step 2 for the reuse trick to work.
>>>>>
>>>>> Manually review the models and if you are satisfied move to step 4.
>>>>> Also note
>>>>> that most parameters including the protein, EST, and repeats should
>>>>>not
>>>>> change
>>>>> from step1-step3, and should not be removed for step 4 either, you
>>>>> can
>>>>> add
>>>>> more evidence, but don't remove evidence (like the repeats).
>>>>>
>>>>>
>>>> Things to change in step 4:
>>>>> For this step, just set min_contig=10000 and rerun MAKER inside the
>>>>> step
>>>>> 3
>>>>> directory to get the smaller contigs annotated.  This should be your
>>>>> final
>>>>> step, although you can try altering other parameters or adding more
>>>>> evidence
>>>>> sources here etc.
>>>>>
>>>>>
>>>> For other things to keep in mind, you should consider taking extra
>>>> time
>>>> to
>>>> build a comprehensive library of repeats for your species, I know that
>>>> Petromyzon marinus was virtually unannotatable until we had a very
>>>> deep
>>>> repeat library built for it.  Any repeat library should be used in all
>>>> steps.  Also for lamprey, you will expect between 20,000-30,000 genes
>>>> both
>>>> because of your assemblies fragmentation and because of some ancestral
>>>> genome duplication.  Also be aware that lampreys appear to undergo
>>>> programmed genome loss in somatic tissues, so any gene count you get
>>>> is
>>>> only
>>>> going to represent a maximum of 75-80% of all genes unless the
>>>> assembly
>>>> is
>>>> derived from germline tissue.
>>>>
>>>> Thanks,
>>>> Carson
>>>>
>>>>
>>>>>
>>>> On 12-12-06 2:17 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>
>>>>> Hi Parul,
>>>>>
>>>>> In step one, if protein2genome isn't turned on, then the protein
>>>>> alignments
>>>>> wont be used to generate gene models. Carson said before to use
>>>>> protein2genome
>>>>> to generate gene models for your trainings set, but you should know
>>>>> that
>>>>> those
>>>>> data aren't being used right now.
>>>>>
>>>>> In step 2, you can include the est and protein evidence that you used
>>>>> in
>>>>> step
>>>>> one, and the ab-initio predictors will takes those data as hints to
>>>>> guide
>>>>> their predictions. Also, are you reannotating human here? I'm just
>>>>> wondering
>>>>> why the snap file is called Pult, but the augustus species model is
>>>>> human.
>>>>> Also, I think you should include the same masking files from step 1,
>>>>> otherwise
>>>>> the ab-initio predictors will be predicting on the unmasked sequence
>>>>> which
>>>>> will give you many spurious predictions.
>>>>>
>>>>> In step 3, I'm not certain what you mean by boot-strapping. The
>>>>>control
>>>>> file
>>>>> you sent for step 3 will just pass-through all of the data from
>>>>>before.
>>>>>
>>>>> In step 4, I think what you'll get is pretty close to what you
>>>>> already
>>>>> had in
>>>>> step 3. The gene models from step 3 are already based on the est and
>>>>> protein
>>>>> data from step 1, so without giving any different evidence or
>>>>>ab-initio
>>>>> predictors, I think that you will just get the same gene models that
>>>>> you
>>>>> got
>>>>> from step 3. Those gene models are what you should be using to train
>>>>> snap.
>>>>> Also, is this the same genome as in the other steps? The genome file
>>>>> here is
>>>>> called genome.linear.fa, but before it was called genome.fa.
>>>>>
>>>>> Step 5. So maker uses both evidence and ab-initio predictions to
>>>>>create
>>>>> models. If you don't give any evidence (EST or protein) maker will
>>>>> not
>>>>> annotate any gene models. You have also changed the augustus species
>>>>> model in
>>>>> this step, but I don't know what you've gained by going from one
>>>>> species
>>>>> to
>>>>> another. You should be training augustus with the gene models and
>>>>> then
>>>>> creating a new species model in augustus. As I understand it, the
>>>>> filter
>>>>> only
>>>>> operates on gene models, not ab-initio predictions or alignments, so
>>>>>it
>>>>> probably isn't doing anything the way you have it set.
>>>>>
>>>>> Step 6. I think step 5 and 6 should be combined. I don't know what
>>>>> you
>>>>> mean by
>>>>> boot-strapping here too.
>>>>>
>>>>> Hopefully that clears up some of the confusion with how maker works.
>>>>> Carson
>>>>> will probably have a lot of suggestions too.
>>>>>
>>>>> Thanks,
>>>>> Daniel
>>>>>
>>>>> Daniel Ence
>>>>> Graduate Student
>>>>> Eccles Institute of Human Genetics
>>>>> University of Utah
>>>>> 15 North 2030 East, Room 2100
>>>>> Salt Lake City, UT 84112-5330
>>>>> ________________________________________
>>>>> From: Parul Kudtarkar [parulk at caltech.edu]
>>>>> Sent: Thursday, December 06, 2012 10:08 AM
>>>>> To: carsonhh at gmail.com
>>>>> Cc: Daniel Ence; maker-devel at yandell-lab.org
>>>>> Subject: Re: [maker-devel] AED score
>>>>>
>>>>> Hi Carson,
>>>>>
>>>>> Once again thanks for a quick response. We expect to get 25,000 to
>>>>> 30,000
>>>>> genes.
>>>>> Here are the step
>>>>> Step1. EST and proteins are used to predict gene-models. the
>>>>> resulting
>>>>> file is used as training data-set for SNAP in step2. est2genome is
>>>>> turned
>>>>> ON
>>>>> Step2. Augustus and SNAP is used to predict genes.
>>>>> Step3. The results are re-annoated(boot-strap)
>>>>> Step4. EST and proteins are used to predict gene-models. The gff file
>>>>> from
>>>>> step3. is used as model_gff. The resulting file is used as training
>>>>> data-set for SNAP in step2
>>>>> Step 5. Augustus and SNAP is used to predict genes. with AED set to
>>>>>0.5
>>>>> Step6. The results are re-annoated(boot-strap) with AED set to 0.5
>>>>> and
>>>>> contig_size to 10kb
>>>>>
>>>>> Thanks and regards,
>>>>> Parul Kudtarkar
>>>>>
>>>>> I have attached the configuration file for each step.
>>>>>>  Your output has no genes.  They've all been filtered out.  The gene
>>>>> predictions are left for reference purposes, but there are no gene
>>>>> models
>>>>>>  in the file.  You need to look at the type columns in the GFF3 file
>>>>>> -->
>>>>> match/match_part features are evidence and reference data but not
>>>>> models.
>>>>>>  All models will have types gene/mRNA/exon/CDS.  For example if you
>>>>>> just
>>>>> want gene models in your file you can use the gff3_merge script with
>>>>> the
>>>>> -g option, and it will only print out the gene models.
>>>>>>  I think you may be misinterpreting what is happening at different
>>>>>> steps,
>>>>> as well as how to read the result files.  Could you give me a
>>>>> detailed
>>>>> explanation of what you expect to get back together with your control
>>>>> files and I can walk you through the configuration, and indicate what
>>>>> to
>>>>> expect.  Also what was the report like from CEGMA. Could you include
>>>>> the
>>>>> report file that shows how complete your genome is and how fragmented
>>>>> it
>>>>> is?
>>>>>>  Thanks,
>>>>>>  Carson
>>>>>>  On 12-12-05 3:03 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>> Dear Carson,
>>>>>>> Thanks for a quick response. keep_preds is set to 0
>>>>>>> Though for previous step(ab-intio gene predictions keep_preds was
>>>>>>>set
>>>>>>> to
>>>>> 1, see Scaffold1_input.gff). I have also attached the output file
>>>>> Scaffold1_out.gff.
>>>>>>> Please advice.
>>>>>>> Thanks and regards,
>>>>>>> Parul Kudtarkar
>>>>>>>>  You should always get all predictions, but should only get models
>>>>> (match/match_part) with AED scores less than 0.5.  You should never
>>>>>get
>>>>>>>>  models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
>>>>> keep_preds set.
>>>>>>>>  Do you have keeps_preds set or gene models with AED values above
>>>>>>>> your
>>>>> threshold (not match/match_part features)?  If the first one is the
>>>>>>>> case,
>>>>>>>>  just set it to 0.  If the second one in the case, could you send
>>>>>>>>me
>>>>> example GFF3?
>>>>>>>>  Thanks,
>>>>>>>>  Carson
>>>>>>>>  On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>> wrote:
>>>>>>>>> Dear Carson,
>>>>>>>>> Thanks once again, we have limited experimental data with very
>>>>>>>>> short
>>>>>>>>>  ESTs.
>>>>>>>>> CEGMA is useful for us to gauge our gene-model.
>>>>>>>>> On a different note to re-annotate genome(post evidence based
>>>>>>>>> prediction(used as training dataset)and abinitio
>>>>>>>>> gene-prediction).
>>>>> Here
>>>>>>>>> are the control parameters I am using with AED score set to 0.5,
>>>>>>>>>  however
>>>>>>>>>  I
>>>>>>>>> get predictions that includes the ones with AED score of 1.00 in
>>>>>>>>>  resulting
>>>>>>>>> gff3 file. Though I do see the number of genes reduced to 1/3 of
>>>>>>>>>  initial
>>>>>>>>> gff3 file.
>>>>>>>>> #-----Genome (Required for De-Novo Annotation)
>>>>>>>>> genome=Scaffold1.fa #genome sequence file in fasta format
>>>>>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>>>>>>  eukaryotic
>>>>>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>>>>>> genome_gff= Scaffold1.gff #re-annotate genome based on this gff3
>>>>>>>>> file
>>>>> est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>>>>>>>> altest_pass=0 #use alternate organism ests in genome_gff: 1 =
>>>>>>>>> yes,
>>>>>>>>> 0
>>>>>>>>> =
>>>>> no
>>>>>>>>> protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>>>>>>>> rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>>>>>>>> model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>>>>>>>> pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0
>>>>>>>>> =
>>>>>>>>> no
>>>>> other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 =
>>>>>>>>>  no
>>>>>>>>> #-----MAKER Behavior Options
>>>>>>>>> AED_threshold=0.5 #Maximum Annotation Edit Distance allowed
>>>>>>>>> (bound
>>>>>>>>> by
>>>>> 0
>>>>>>>>> and 1)
>>>>>>>>> Thanks and regards,
>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>  Wow 330,000 is a lot. a large portion of genes are likely to be
>>>>>>>>>> partial
>>>>>>>>>> at
>>>>>>>>>>  best.  You should seriously consider using mRNAseq to capture
>>>>>>>>>> those
>>>>> by
>>>>>>>>>>  using maker's est_gff option to pass in results from cufflinks
>>>>>>>>>>or
>>>>>>>>>> trinity.
>>>>>>>>>>   Also I wouldn't even try to annotate contigs less than 10kb in
>>>>> size,
>>>>>>>>>> just
>>>>>>>>>>  have maker skip them by setting the min_contig filter in the
>>>>> maker_opts.ctl file.
>>>>>>>>>>  Thanks,
>>>>>>>>>>  Carson
>>>>>>>>>>  On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>> Thanks for the guidance Carson, total contig size is 330,611
>>>>>>>>>>>with
>>>>> N50
>>>>>>>>>>>  of
>>>>>>>>>>> 39.17kb. I agree we have short ESTs. So this is the possible
>>>>>>>>>>> reason
>>>>>>>>>>>  when
>>>>>>>>>>> filtering based on AED score 0.75 there are no gene models
>>>>>>>>>>> predicted
>>>>> despite the model_gff file has few genes with scores less than 0.75?
>>>>> Thanks and regards,
>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>  There are certain characteristics that are apparent in this
>>>>> contig.
>>>>>>>>>>> First
>>>>>>>>>>>  it seems to be repeat rich with a very low gene density.  You
>>>>>>>>>>> also
>>>>>>>>>>> have
>>>>>>>>>>> very short ESTs, and because of the lengths you are probably
>>>>>>>>>>> getting
>>>>> many
>>>>>>>>>>>  of them to align spuriously which produces very short gene
>>>>>>>>>>> models
>>>>> that
>>>>>>>>>>> are
>>>>>>>>>>>  more than likely false positives or at the very least just a
>>>>>>>>>>> piece
>>>>>>>>>>> of
>>>>>>>>>>> a
>>>>>>>>>>> gene.  I would turn off est2genome as a predictor for this
>>>>>>>>>>>reason
>>>>>>>>>>>  unless
>>>>>>>>>>> you can get longer EST assemblies (i.e. From mRNAseq).    Your
>>>>>>>>>>>  protein
>>>>>>>>>>> alignments also seem to be few and far between.  You probably
>>>>>>>>>>> need
>>>>> to
>>>>>>>>>>> add
>>>>>>>>>>>  more proteins from a couple of related species, and you might
>>>>> consider
>>>>>>>>>>> using protein2genome rather than est2genome as a predictor if
>>>>>>>>>>>you
>>>>> are
>>>>>>>>>>> still working to generate a training set. Also est2genome
>>>>>>>>>>> produced
>>>>> models
>>>>>>>>>>>  almost always have an AED score near 0 so mixing est2genome
>>>>>>>>>>>with
>>>>> the
>>>>>>>>>>> AED_threshold with such limited protein support does create an
>>>>> artificial
>>>>>>>>>>>  bias to get back very short and incomplete models.
>>>>>>>>>>>  How many contigs do you have in total and what is the N50
>>>>>>>>>>> value
>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>> assembly? If you have a large number of very short contigs, you
>>>>>>>>>>> will
>>>>>>>>>>>  get
>>>>>>>>>>> very inflated gene counts because you get genes split across
>>>>>>>>>>> contigs
>>>>>>>>>>>  and
>>>>>>>>>>> many contigs tend t be subtle rearrangements of other contigs
>>>>>>>>>>> just
>>>>> assembled in a slightly different way (so you can get bits and
>>>>> pieces
>>>>>>>>>>>  of
>>>>>>>>>>> the same genes just rearranged).  This scenario is another
>>>>>>>>>>>  confounding
>>>>>>>>>>> factor if using the est2genome predictor with short ESTs.  I
>>>>>>>>>>> would
>>>>> recommend running CEGMA to get an estimate for the genome
>>>>>>>>>>>  completeness
>>>>>>>>>>> as
>>>>>>>>>>>  well as get an estimate of fragmentation as one of the
>>>>>>>>>>> statistics
>>>>>>>>>>> produced
>>>>>>>>>>>  is a percent of genes that are found complete (end to end) vs
>>>>> those
>>>>>>>>>>>  that
>>>>>>>>>>> are partial.  CEGMA identifies house keeping genes that tend to
>>>>>>>>>>> be
>>>>> shorter
>>>>>>>>>>>  and less intron rich than other genes in the genome, so if
>>>>>>>>>>>CEGMA
>>>>> gives
>>>>>>>>>>>  a
>>>>>>>>>>> high partial percentage and a low complete percentage, then
>>>>>>>>>>> this
>>>>>>>>>>>  pattern
>>>>>>>>>>> can be expected to be even more exaggerated for other genes in
>>>>>>>>>>> the
>>>>> genome.
>>>>>>>>>>>  If your genome is highly fragmented or proteins do not align
>>>>>>>>>>> well
>>>>> then
>>>>>>>>>>> there are other strategies.  For example, some vertebrate
>>>>>>>>>>>genomes
>>>>> end
>>>>>>>>>>>  up
>>>>>>>>>>> having extremely fragmented assemblies (on the order of 100,000
>>>>> contigs),
>>>>>>>>>>>  and if they are distantly related to other annotated species
>>>>>>>>>>>few
>>>>>>>>>>> proteins
>>>>>>>>>>>  may align to the contigs because the introns in the alignments
>>>>> tend
>>>>>>>>>>>  to
>>>>>>>>>>> be
>>>>>>>>>>>  so long and exons so short that it pushes down the
>>>>>>>>>>> significance
>>>>> scores
>>>>>>>>>>> too
>>>>>>>>>>>  much.  In those cases heavy mRNAseq seems to be the best if
>>>>>>>>>>> not
>>>>> only
>>>>>>>>>>>  way
>>>>>>>>>>> to get enough evidence to stitch gene models together.
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>  Carson
>>>>>>>>>>>  On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>> Dear Carson and Daniel,
>>>>>>>>>>> Thanks. I ran sample file for filtering genes based on AED
>>>>>>>>>>>score.
>>>>> The
>>>>>>>>>>> input gff3 file was provided to option model_pred(see attached
>>>>>>>>>>> file
>>>>> Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>>>>>>>>>  least
>>>>>>>>>>>  5
>>>>>>>>>>> genes with AED score less than 0.75. However there were no
>>>>>>>>>>> genes
>>>>>>>>>>>  predicted
>>>>>>>>>>> in the output file(see attached file Scaffold1_out). I have
>>>>>>>>>>> also
>>>>>>>>>>> attached
>>>>>>>>>>> the maker_opts.ctl. Could you please advice on this.
>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>  Use the AED_threshold option in the maker_opts.ctl file if you
>>>>>>>>>>> just
>>>>>>>>>>> want
>>>>>>>>>>>  to restrict final gene models to close matches directly within
>>>>>>>>>>> maker.
>>>>>>>>>>> On
>>>>>>>>>>>  the other hand, if you are trying to build a dataset for
>>>>> training
>>>>>>>>>>>  gene
>>>>>>>>>>> predictors, use the maker2zff script for generating a filtered
>>>>>>>>>>>  dataset
>>>>>>>>>>> for
>>>>>>>>>>>  SNAP training.  There are a number of filters available. Just
>>>>> call
>>>>>>>>>>>  the
>>>>>>>>>>> script once without parameters to see the options.
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>  Carson
>>>>>>>>>>>  On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>> Hi Parul,
>>>>>>>>>>> I think the way you described (with the maker_opts.ctl file) is
>>>>> how
>>>>>>>>>>> you
>>>>>>>>>>> want to proceed. You still need to give the genome too.
>>>>>>>>>>> Daniel
>>>>>>>>>>> Daniel Ence
>>>>>>>>>>> Graduate Student
>>>>>>>>>>> Eccles Institute of Human Genetics
>>>>>>>>>>> University of Utah
>>>>>>>>>>> 15 North 2030 East, Room 2100
>>>>>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>>>>>> ________________________________________
>>>>>>>>>>> From: maker-devel-bounces at yandell-lab.org
>>>>>>>>>>> [maker-devel-bounces at yandell-lab.org] on behalf of Parul
>>>>>>>>>>>  Kudtarkar
>>>>>>>>>>> [parulk at caltech.edu]
>>>>>>>>>>> Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>>>> To: Parul Kudtarkar
>>>>>>>>>>> Cc: maker-devel at yandell-lab.org
>>>>>>>>>>> Subject: Re: [maker-devel] AED score
>>>>>>>>>>> Also, are there any other parameters that are required when
>>>>> filtering
>>>>>>>>>>> based on AED score?
>>>>>>>>>>>  Hello Carson,
>>>>>>>>>>>  Just to confirm, Is there a script that would filter gene
>>>>> models
>>>>>>>>>>> at
>>>>>>>>>>> specific AED score.
>>>>>>>>>>>  Alternatively if I were to do this within maker with regards
>>>>> to
>>>>>>>>>>> parameters
>>>>>>>>>>>  in maker_opts.ctl file I would have to provide my predicted
>>>>>>>>>>> genes
>>>>>>>>>>> gff3
>>>>>>>>>>>  file to model_gff and  set AED_threshold at desired threshold?
>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>  Parul Kudtarkar
>>>>>>>>>>>  AED score with 1 are the ones you don't want.  0 is best and
>>>>> 1
>>>>>>>>>>>  is
>>>>>>>>>>> worst
>>>>>>>>>>>  as
>>>>>>>>>>>  it is a distance metric.  You can use the AED_threshold
>>>>> parameter
>>>>>>>>>>> to
>>>>>>>>>>>  require better matching to the evidence by setting it closer
>>>>> to
>>>>>>>>>>> 0.
>>>>>>>>>>> You
>>>>>>>>>>>  can
>>>>>>>>>>>  also try to increase protein homology evidence as some of
>>>>> your
>>>>>>>>>>> calls
>>>>>>>>>>> may
>>>>>>>>>>>  be split genes due to lack of evidence linking them.
>>>>>>>>>>>  --Carson
>>>>>>>>>>>  On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>> Dear Maker community,
>>>>>>>>>>> For gene-prediction I get training data-set from evidence
>>>>>>>>>>>  based
>>>>>>>>>>> prediction, I use this data-set to train SNAP as well as
>>>>>>>>>>>Augustus
>>>>> predictions, followed by boot-strapping. I would typically expect
>>>>> 20-30K
>>>>>>>>>>> genes however I am getting 8 times the expected gene count
>>>>>>>>>>>  indicating
>>>>>>>>>>>  too
>>>>>>>>>>> many false positives. Is there a way to further refine these
>>>>>>>>>>> predication/script to retain predictions with AED score 1 and
>>>>>>>>>>> if
>>>>>>>>>>> yes
>>>>>>>>>>> how
>>>>>>>>>>> to go about this?
>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>> --
>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
>>>>> -l
>>>>>>>>>>> ab
>>>>>>>>>>> .o
>>>>>>>>>>> rg
>>>>>>>>>>>  --
>>>>>>>>>>>  Scientific Programmer
>>>>>>>>>>>  Center for Computational Regulatory Genomics
>>>>>>>>>>>  Beckman Institute,
>>>>>>>>>>>  California Institute of Technology
>>>>>>>>>>>  http://www.spbase.org
>>>>>>>>>>> --
>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l
>>>>>>>>>>>a
>>>>> b.
>>>>>>>>>>> or
>>>>>>>>>>> g
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l
>>>>>>>>>>>a
>>>>> b.
>>>>>>>>>>> or
>>>>>>>>>>> g
>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>  maker-devel mailing list
>>>>>>>>>>>  maker-devel at box290.bluehost.com
>>>>>>>>>>>
>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-l
>>>>>>>>>>>ab
>>>>> .o
>>>>>>>>>>> rg
>>>>>>>>>>> --
>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>> --
>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>> http://www.spbase.org
>>>>>>>>> --
>>>>>>>>> Scientific Programmer
>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>> Beckman Institute,
>>>>>>>>> California Institute of Technology
>>>>>>>>> http://www.spbase.org
>>>>>>> --
>>>>>>> Scientific Programmer
>>>>>>> Center for Computational Regulatory Genomics
>>>>>>> Beckman Institute,
>>>>>>> California Institute of Technology
>>>>>>> http://www.spbase.org
>>>>>
>>>>>
>>>>> --
>>>>> Scientific Programmer
>>>>> Center for Computational Regulatory Genomics
>>>>> Beckman Institute,
>>>>> California Institute of Technology
>>>>> http://www.spbase.org
>>>>>
>>>>> --
>>>>> Scientific Programmer
>>>>> Center for Computational Regulatory Genomics
>>>>> Beckman Institute,
>>>>> California Institute of Technology
>>>>> http://www.spbase.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Scientific Programmer
>>> Center for Computational Regulatory Genomics
>>> Beckman Institute,
>>> California Institute of Technology
>>> http://www.spbase.org
>>>
>>
>>
>>--
>>Scientific Programmer
>>Center for Computational Regulatory Genomics
>>Beckman Institute,
>>California Institute of Technology
>>http://www.spbase.org
>>
>
>
>

--
Scientific Programmer
Center for Computational Regulatory Genomics
Beckman Institute,
California Institute of Technology
http://www.spbase.org