[maker-devel] AED score

Tue Dec 11 07:21:44 MST 2012

Thanks Parul.  If you need any more help, just let us know.

--Carson

On 12-12-10 10:28 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:

>Dear Carson, Thanks a lot for detailed explanation and all the help with
>running and understanding maker2.
>
>
>> Yes.  Lamprey is not a good match for sea urchin.  When training
>>Augusuts
>> you can sometimes use other genomes as starting points and let Augustus
>> then modify that species to match the models you provided, which can be
>> better than de novo training under certain circumstances. But given sea
>> urchin's evolutionary distance from all the other species bundled with
>> augustus, its probably not a good starting point.  So protein2genome
>> derived models and cegma based training are probably your best options.
>>
>> --Carson
>>
>>
>> On 12-12-07 8:09 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>
>>>Sorry for not checking this before, just to answer my own question I
>>>just
>>>found after reading augustus documentation that autoAugTrain.pl should
>>>work for generating training dataset for Augustus from using the
>>>gene-model hints from first pass run of maker and using
>>>http://sourceforge.net/mailarchive/message.php?msg_id=29361270 script to
>>>generate training.gb
>>>
>>>autoAugTrain.pl [OPTIONS] --species=sname --trainingset=training.gb
>>>
>>>I believe should work better than using lamprey to train augustus.
>>>
>>>Thanks and regards,
>>>Parul Kudtarkar
>>>
>>>> Dear Carson,
>>>>
>>>> Thanks for detailed explanation and help. Now we know the exact
>>>>parameters
>>>> that should help us with generating good gene-model.
>>>>
>>>> The genome for which we are working on gene predictions is
>>>> Echinoderm(sea-urchin). lamprey was the closest organism for training
>>>> augustus that I could find. A quick question for training augustus,
>>>>there
>>>> is augustus_species option how would you go from training data
>>>> generated
>>>> by zff2augustus_gbk.pl > train.gb as specified here
>>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
>>>> generating the species folder that can be specified to
>>>>augustus_species
>>>> option.
>>>>
>>>> The average expected gene-length in our case is ~15kb.
>>>> We also have a good repeat library for our genome.
>>>>
>>>> Thanks and regards,
>>>> Parul Kudtarkar
>>>>
>>>>> For step 2 and 3, use lamprey rather than human in augustus.  For
>>>>>some
>>>>> reason the current version of augustus doesn't diplay it as an option
>>>>> for
>>>>> the species help menu (so I didn't see it), but it's there in
>>>>> �/augustus/config/species/lamprey
>>>>>
>>>>> For any additional training, make copies into
>>>>> �/augustus/config/species/lamprey2 and
>>>>> �/augustus/config/species/lamprey3
>>>>>
>>>>> Thanks,
>>>>> Carson
>>>>>
>>>>>
>>>>>
>>>>> From:  Carson Holt <carsonhh at gmail.com>
>>>>> Date:  Friday, 7 December, 2012 10:22 AM
>>>>> To:  Daniel Ence <dence at genetics.utah.edu>,
>>>>> "maker-devel at yandell-lab.org"
>>>>> <maker-devel at yandell-lab.org>, Parul Kudtarkar <parulk at caltech.edu>
>>>>> Subject:  Re: [maker-devel] AED score
>>>>>
>>>>> Just to add to Daniels comments.
>>>>>
>>>>> Things to change in step 1:
>>>>>> protein= <just try and add more protein evidence in general>
>>>>>> est2genome=0
>>>>>> protein2genome=1
>>>>>> split_hit=20000
>>>>>> min_contig=50000
>>>>>
>>>>> Reasoning:
>>>>>> Your ESTs are very short especially if this is a lamprey species
>>>>>> which
>>>>>> have
>>>>>> very long introns and really short exons.  In lamprey (i.e.
>>>>>> Petromyzon
>>>>>> marinus), genes tend to be very long (remember gene lengths include
>>>>>> introns
>>>>>> and UTR and is not just the size of the coding sequence), so contigs
>>>>>> shorter
>>>>>> than 50kb are useless for training as you are unlikely to get nice
>>>>>> complete
>>>>>> gene models on those.  Also lampreys have very long introns, so you
>>>>>> have
>>>>>> to
>>>>>> allow for bigger introns in alignments (split_hit parameter).
>>>>>> Finally
>>>>>> add as
>>>>>> much protein evidence from as many sources as possible.  Your maker
>>>>>> training
>>>>>> run will take a long time as proteins take forever to align, but
>>>>>> because
>>>>>> of
>>>>>> the evolutionary distance of lamprey from everything else and the
>>>>>>short
>>>>>> exon
>>>>>> structure of its genome, very little aligns directly to its genome
>>>>>>from
>>>>>> other
>>>>>> deuterstome and vertebrate species.  I'm assuming this is a lamprey
>>>>>> species
>>>>>> because of what you said about the  augustus species file you are
>>>>>> using.
>>>>>> Really the only thing closely related to lampreys unfortunately are
>>>>>> other
>>>>>> lampreys.  Lancelets, hagfish, and sharks are not closely related to
>>>>>> lamprey
>>>>>> (while they branch closely together on the tree of life, there are
>>>>>> too
>>>>>> many
>>>>>> years since the last common ancestor).  So while they may have
>>>>>> similar
>>>>>> issues
>>>>>> related to annotation (long introns and short exons etc.) they will
>>>>>>not
>>>>>> really
>>>>>> match that well for the gene predictor or even protein alignments.
>>>>>
>>>>> Additional note:
>>>>>> I have training files for the lamprey species Petromyzon marinus for
>>>>>> both
>>>>>> Augustus and SNAP that I could share with you in a few week, when
>>>>>>the
>>>>>> genome
>>>>>> publication is is released.  But before that happens, new gene
>>>>>>models
>>>>>> will be
>>>>>> available  through the UCSC browser (hopefully within a couple of
>>>>>> weeks), and
>>>>>> gene models are already available through ENSEMBL.  Get those
>>>>>>protein
>>>>>> files
>>>>>> for training, it may be a big help for you.  If you want early
>>>>>>access
>>>>>> to
>>>>>> the
>>>>>> lamprey training files for Augustus and SNAP, you would have to
>>>>>>request
>>>>>> it
>>>>>> from Weiming Li at Michigan State University (the head of that
>>>>>>genome
>>>>>> project).
>>>>>>
>>>>>
>>>>>
>>>>> Things to change in step 2:
>>>>>> Optimally you would be doing de novo training using mRNAseq results,
>>>>>> but
>>>>>> with
>>>>>> on;ly sparse protein alignments and such a fragmented assembly, you
>>>>>>are
>>>>>> probably better off just trying to adapt the human HMM files.  They
>>>>>> won't
>>>>>> match that well, but you probably won't have the evidence for De
>>>>>>Novo
>>>>>> training.  First make a copy of the augustus human species directory
>>>>>> and
>>>>>> rename it to lamprey (cp -R  �/augustus/config/species/human
>>>>>> �/augustus/config/species/lamprey).  Use it as the base species for
>>>>>> retraining
>>>>>> augustus using your new models.  You will have to edit multiple
>>>>>>files
>>>>>> in
>>>>>> the
>>>>>> directory after you copy it so that they no longer say human or homo
>>>>>> sapiens
>>>>>> internally or in the file name.  Use maker2zff to generate the
>>>>>>filtered
>>>>>> ZFF
>>>>>> file for training SNAP, but don't train SNAP.  Rather use the
>>>>>> training
>>>>>> file to
>>>>>> better train Augustus info here (just ignore the CEGMA part) -->
>>>>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270
>>>>>>
>>>>>> MAKE a backup of the step 1 maker output directory and run step 2 in
>>>>>> the
>>>>>> old
>>>>>> step 1 directory (this allows you to change the parameters and reuse
>>>>>> files
>>>>>> form step 1 so you don't have to recalculate all the protein and EST
>>>>>> alignments).  So control files for step 2 are identical to step 1
>>>>>> except
>>>>>> for
>>>>>> these parameters.
>>>>>>
>>>>>> protein2genome=0
>>>>>> augustus_species=lamprey
>>>>>> snaphmm=lamprey.hmm #optional if you decide to use SNAP
>>>>>>
>>>>>> Don't both training SNAP here as you probably won't have enough data
>>>>>> and
>>>>>> you
>>>>>> assembly is too fragmented for it to work well, so just stick to
>>>>>> augustus.
>>>>>> Try SNAP if you want just to see how well it works.  Manually open
>>>>>>up
>>>>>> the
>>>>>> largest contigs in a viewer to look at the models produced from the
>>>>>> MAKER run
>>>>>> to see if they look reasonable (this will also help you decide
>>>>>> whether
>>>>>> to keep
>>>>>> SNAP).
>>>>>>
>>>>>
>>>>> Things to change in step 3:
>>>>>> Step 3 should just be a clone of step 2 as it is bootstrapping.  But
>>>>>> make
>>>>>> copies of �/augustus/config/species/lamprey and save it to
>>>>>> �/augustus/config/species/lamprey2 (editing all the files and names
>>>>>> as
>>>>>> you did
>>>>>> in step 2). This way you don't loose that training data if you
>>>>>>decide
>>>>>> to
>>>>>> step
>>>>>> back.  Also Give your SNAP HMM a new name (I.e. lamprey2.hmm)
>>>>>>
>>>>>> augustus_species=lamprey2
>>>>>> snaphmm=lamprey2.hmm  #optional if you decide to use SNAP
>>>>>>
>>>>>> Make a backup of Step 2 and run step 3 in the old Step 2 directory
>>>>>> (This
>>>>>> is
>>>>>> for file reuse, so the step will run fast).  This must be the exact
>>>>>> same
>>>>>> step
>>>>>> directory as step 2 for the reuse trick to work.
>>>>>>
>>>>>> Manually review the models and if you are satisfied move to step 4.
>>>>>> Also note
>>>>>> that most parameters including the protein, EST, and repeats should
>>>>>>not
>>>>>> change
>>>>>> from step1-step3, and should not be removed for step 4 either, you
>>>>>> can
>>>>>> add
>>>>>> more evidence, but don't remove evidence (like the repeats).
>>>>>>
>>>>>>
>>>>> Things to change in step 4:
>>>>>> For this step, just set min_contig=10000 and rerun MAKER inside the
>>>>>> step
>>>>>> 3
>>>>>> directory to get the smaller contigs annotated.  This should be your
>>>>>> final
>>>>>> step, although you can try altering other parameters or adding more
>>>>>> evidence
>>>>>> sources here etc.
>>>>>>
>>>>>>
>>>>> For other things to keep in mind, you should consider taking extra
>>>>> time
>>>>> to
>>>>> build a comprehensive library of repeats for your species, I know
>>>>>that
>>>>> Petromyzon marinus was virtually unannotatable until we had a very
>>>>> deep
>>>>> repeat library built for it.  Any repeat library should be used in
>>>>>all
>>>>> steps.  Also for lamprey, you will expect between 20,000-30,000 genes
>>>>> both
>>>>> because of your assemblies fragmentation and because of some
>>>>>ancestral
>>>>> genome duplication.  Also be aware that lampreys appear to undergo
>>>>> programmed genome loss in somatic tissues, so any gene count you get
>>>>> is
>>>>> only
>>>>> going to represent a maximum of 75-80% of all genes unless the
>>>>> assembly
>>>>> is
>>>>> derived from germline tissue.
>>>>>
>>>>> Thanks,
>>>>> Carson
>>>>>
>>>>>
>>>>>>
>>>>> On 12-12-06 2:17 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>>
>>>>>> Hi Parul,
>>>>>>
>>>>>> In step one, if protein2genome isn't turned on, then the protein
>>>>>> alignments
>>>>>> wont be used to generate gene models. Carson said before to use
>>>>>> protein2genome
>>>>>> to generate gene models for your trainings set, but you should know
>>>>>> that
>>>>>> those
>>>>>> data aren't being used right now.
>>>>>>
>>>>>> In step 2, you can include the est and protein evidence that you
>>>>>>used
>>>>>> in
>>>>>> step
>>>>>> one, and the ab-initio predictors will takes those data as hints to
>>>>>> guide
>>>>>> their predictions. Also, are you reannotating human here? I'm just
>>>>>> wondering
>>>>>> why the snap file is called Pult, but the augustus species model is
>>>>>> human.
>>>>>> Also, I think you should include the same masking files from step 1,
>>>>>> otherwise
>>>>>> the ab-initio predictors will be predicting on the unmasked sequence
>>>>>> which
>>>>>> will give you many spurious predictions.
>>>>>>
>>>>>> In step 3, I'm not certain what you mean by boot-strapping. The
>>>>>>control
>>>>>> file
>>>>>> you sent for step 3 will just pass-through all of the data from
>>>>>>before.
>>>>>>
>>>>>> In step 4, I think what you'll get is pretty close to what you
>>>>>> already
>>>>>> had in
>>>>>> step 3. The gene models from step 3 are already based on the est and
>>>>>> protein
>>>>>> data from step 1, so without giving any different evidence or
>>>>>>ab-initio
>>>>>> predictors, I think that you will just get the same gene models that
>>>>>> you
>>>>>> got
>>>>>> from step 3. Those gene models are what you should be using to train
>>>>>> snap.
>>>>>> Also, is this the same genome as in the other steps? The genome file
>>>>>> here is
>>>>>> called genome.linear.fa, but before it was called genome.fa.
>>>>>>
>>>>>> Step 5. So maker uses both evidence and ab-initio predictions to
>>>>>>create
>>>>>> models. If you don't give any evidence (EST or protein) maker will
>>>>>> not
>>>>>> annotate any gene models. You have also changed the augustus species
>>>>>> model in
>>>>>> this step, but I don't know what you've gained by going from one
>>>>>> species
>>>>>> to
>>>>>> another. You should be training augustus with the gene models and
>>>>>> then
>>>>>> creating a new species model in augustus. As I understand it, the
>>>>>> filter
>>>>>> only
>>>>>> operates on gene models, not ab-initio predictions or alignments, so
>>>>>>it
>>>>>> probably isn't doing anything the way you have it set.
>>>>>>
>>>>>> Step 6. I think step 5 and 6 should be combined. I don't know what
>>>>>> you
>>>>>> mean by
>>>>>> boot-strapping here too.
>>>>>>
>>>>>> Hopefully that clears up some of the confusion with how maker works.
>>>>>> Carson
>>>>>> will probably have a lot of suggestions too.
>>>>>>
>>>>>> Thanks,
>>>>>> Daniel
>>>>>>
>>>>>> Daniel Ence
>>>>>> Graduate Student
>>>>>> Eccles Institute of Human Genetics
>>>>>> University of Utah
>>>>>> 15 North 2030 East, Room 2100
>>>>>> Salt Lake City, UT 84112-5330
>>>>>> ________________________________________
>>>>>> From: Parul Kudtarkar [parulk at caltech.edu]
>>>>>> Sent: Thursday, December 06, 2012 10:08 AM
>>>>>> To: carsonhh at gmail.com
>>>>>> Cc: Daniel Ence; maker-devel at yandell-lab.org
>>>>>> Subject: Re: [maker-devel] AED score
>>>>>>
>>>>>> Hi Carson,
>>>>>>
>>>>>> Once again thanks for a quick response. We expect to get 25,000 to
>>>>>> 30,000
>>>>>> genes.
>>>>>> Here are the step
>>>>>> Step1. EST and proteins are used to predict gene-models. the
>>>>>> resulting
>>>>>> file is used as training data-set for SNAP in step2. est2genome is
>>>>>> turned
>>>>>> ON
>>>>>> Step2. Augustus and SNAP is used to predict genes.
>>>>>> Step3. The results are re-annoated(boot-strap)
>>>>>> Step4. EST and proteins are used to predict gene-models. The gff
>>>>>>file
>>>>>> from
>>>>>> step3. is used as model_gff. The resulting file is used as training
>>>>>> data-set for SNAP in step2
>>>>>> Step 5. Augustus and SNAP is used to predict genes. with AED set to
>>>>>>0.5
>>>>>> Step6. The results are re-annoated(boot-strap) with AED set to 0.5
>>>>>> and
>>>>>> contig_size to 10kb
>>>>>>
>>>>>> Thanks and regards,
>>>>>> Parul Kudtarkar
>>>>>>
>>>>>> I have attached the configuration file for each step.
>>>>>>>  Your output has no genes.  They've all been filtered out.  The
>>>>>>>gene
>>>>>> predictions are left for reference purposes, but there are no gene
>>>>>> models
>>>>>>>  in the file.  You need to look at the type columns in the GFF3
>>>>>>>file
>>>>>>> -->
>>>>>> match/match_part features are evidence and reference data but not
>>>>>> models.
>>>>>>>  All models will have types gene/mRNA/exon/CDS.  For example if you
>>>>>>> just
>>>>>> want gene models in your file you can use the gff3_merge script with
>>>>>> the
>>>>>> -g option, and it will only print out the gene models.
>>>>>>>  I think you may be misinterpreting what is happening at different
>>>>>>> steps,
>>>>>> as well as how to read the result files.  Could you give me a
>>>>>> detailed
>>>>>> explanation of what you expect to get back together with your
>>>>>>control
>>>>>> files and I can walk you through the configuration, and indicate
>>>>>>what
>>>>>> to
>>>>>> expect.  Also what was the report like from CEGMA. Could you include
>>>>>> the
>>>>>> report file that shows how complete your genome is and how
>>>>>>fragmented
>>>>>> it
>>>>>> is?
>>>>>>>  Thanks,
>>>>>>>  Carson
>>>>>>>  On 12-12-05 3:03 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>>>>> Dear Carson,
>>>>>>>> Thanks for a quick response. keep_preds is set to 0
>>>>>>>> Though for previous step(ab-intio gene predictions keep_preds was
>>>>>>>>set
>>>>>>>> to
>>>>>> 1, see Scaffold1_input.gff). I have also attached the output file
>>>>>> Scaffold1_out.gff.
>>>>>>>> Please advice.
>>>>>>>> Thanks and regards,
>>>>>>>> Parul Kudtarkar
>>>>>>>>>  You should always get all predictions, but should only get
>>>>>>>>>models
>>>>>> (match/match_part) with AED scores less than 0.5.  You should never
>>>>>>get
>>>>>>>>>  models (gene/mRNA/CDS) with an AED score of 1.00 unless you have
>>>>>> keep_preds set.
>>>>>>>>>  Do you have keeps_preds set or gene models with AED values above
>>>>>>>>> your
>>>>>> threshold (not match/match_part features)?  If the first one is the
>>>>>>>>> case,
>>>>>>>>>  just set it to 0.  If the second one in the case, could you send
>>>>>>>>>me
>>>>>> example GFF3?
>>>>>>>>>  Thanks,
>>>>>>>>>  Carson
>>>>>>>>>  On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>> wrote:
>>>>>>>>>> Dear Carson,
>>>>>>>>>> Thanks once again, we have limited experimental data with very
>>>>>>>>>> short
>>>>>>>>>>  ESTs.
>>>>>>>>>> CEGMA is useful for us to gauge our gene-model.
>>>>>>>>>> On a different note to re-annotate genome(post evidence based
>>>>>>>>>> prediction(used as training dataset)and abinitio
>>>>>>>>>> gene-prediction).
>>>>>> Here
>>>>>>>>>> are the control parameters I am using with AED score set to 0.5,
>>>>>>>>>>  however
>>>>>>>>>>  I
>>>>>>>>>> get predictions that includes the ones with AED score of 1.00 in
>>>>>>>>>>  resulting
>>>>>>>>>> gff3 file. Though I do see the number of genes reduced to 1/3 of
>>>>>>>>>>  initial
>>>>>>>>>> gff3 file.
>>>>>>>>>> #-----Genome (Required for De-Novo Annotation)
>>>>>>>>>> genome=Scaffold1.fa #genome sequence file in fasta format
>>>>>>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>>>>>>>  eukaryotic
>>>>>>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>>>>>>> genome_gff= Scaffold1.gff #re-annotate genome based on this gff3
>>>>>>>>>> file
>>>>>> est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>>>>>>>>> altest_pass=0 #use alternate organism ests in genome_gff: 1 =
>>>>>>>>>> yes,
>>>>>>>>>> 0
>>>>>>>>>> =
>>>>>> no
>>>>>>>>>> protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>>>>>>>>> rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>>>>>>>>> model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>>>>>>>>> pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes, 0
>>>>>>>>>> =
>>>>>>>>>> no
>>>>>> other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0
>>>>>>=
>>>>>>>>>>  no
>>>>>>>>>> #-----MAKER Behavior Options
>>>>>>>>>> AED_threshold=0.5 #Maximum Annotation Edit Distance allowed
>>>>>>>>>> (bound
>>>>>>>>>> by
>>>>>> 0
>>>>>>>>>> and 1)
>>>>>>>>>> Thanks and regards,
>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>  Wow 330,000 is a lot. a large portion of genes are likely to
>>>>>>>>>>>be
>>>>>>>>>>> partial
>>>>>>>>>>> at
>>>>>>>>>>>  best.  You should seriously consider using mRNAseq to capture
>>>>>>>>>>> those
>>>>>> by
>>>>>>>>>>>  using maker's est_gff option to pass in results from cufflinks
>>>>>>>>>>>or
>>>>>>>>>>> trinity.
>>>>>>>>>>>   Also I wouldn't even try to annotate contigs less than 10kb
>>>>>>>>>>>in
>>>>>> size,
>>>>>>>>>>> just
>>>>>>>>>>>  have maker skip them by setting the min_contig filter in the
>>>>>> maker_opts.ctl file.
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>  Carson
>>>>>>>>>>>  On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Thanks for the guidance Carson, total contig size is 330,611
>>>>>>>>>>>>with
>>>>>> N50
>>>>>>>>>>>>  of
>>>>>>>>>>>> 39.17kb. I agree we have short ESTs. So this is the possible
>>>>>>>>>>>> reason
>>>>>>>>>>>>  when
>>>>>>>>>>>> filtering based on AED score 0.75 there are no gene models
>>>>>>>>>>>> predicted
>>>>>> despite the model_gff file has few genes with scores less than 0.75?
>>>>>> Thanks and regards,
>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>  There are certain characteristics that are apparent in this
>>>>>> contig.
>>>>>>>>>>>> First
>>>>>>>>>>>>  it seems to be repeat rich with a very low gene density.  You
>>>>>>>>>>>> also
>>>>>>>>>>>> have
>>>>>>>>>>>> very short ESTs, and because of the lengths you are probably
>>>>>>>>>>>> getting
>>>>>> many
>>>>>>>>>>>>  of them to align spuriously which produces very short gene
>>>>>>>>>>>> models
>>>>>> that
>>>>>>>>>>>> are
>>>>>>>>>>>>  more than likely false positives or at the very least just a
>>>>>>>>>>>> piece
>>>>>>>>>>>> of
>>>>>>>>>>>> a
>>>>>>>>>>>> gene.  I would turn off est2genome as a predictor for this
>>>>>>>>>>>>reason
>>>>>>>>>>>>  unless
>>>>>>>>>>>> you can get longer EST assemblies (i.e. From mRNAseq).    Your
>>>>>>>>>>>>  protein
>>>>>>>>>>>> alignments also seem to be few and far between.  You probably
>>>>>>>>>>>> need
>>>>>> to
>>>>>>>>>>>> add
>>>>>>>>>>>>  more proteins from a couple of related species, and you might
>>>>>> consider
>>>>>>>>>>>> using protein2genome rather than est2genome as a predictor if
>>>>>>>>>>>>you
>>>>>> are
>>>>>>>>>>>> still working to generate a training set. Also est2genome
>>>>>>>>>>>> produced
>>>>>> models
>>>>>>>>>>>>  almost always have an AED score near 0 so mixing est2genome
>>>>>>>>>>>>with
>>>>>> the
>>>>>>>>>>>> AED_threshold with such limited protein support does create an
>>>>>> artificial
>>>>>>>>>>>>  bias to get back very short and incomplete models.
>>>>>>>>>>>>  How many contigs do you have in total and what is the N50
>>>>>>>>>>>> value
>>>>>> for
>>>>>>>>>>>> the
>>>>>>>>>>>> assembly? If you have a large number of very short contigs,
>>>>>>>>>>>>you
>>>>>>>>>>>> will
>>>>>>>>>>>>  get
>>>>>>>>>>>> very inflated gene counts because you get genes split across
>>>>>>>>>>>> contigs
>>>>>>>>>>>>  and
>>>>>>>>>>>> many contigs tend t be subtle rearrangements of other contigs
>>>>>>>>>>>> just
>>>>>> assembled in a slightly different way (so you can get bits and
>>>>>> pieces
>>>>>>>>>>>>  of
>>>>>>>>>>>> the same genes just rearranged).  This scenario is another
>>>>>>>>>>>>  confounding
>>>>>>>>>>>> factor if using the est2genome predictor with short ESTs.  I
>>>>>>>>>>>> would
>>>>>> recommend running CEGMA to get an estimate for the genome
>>>>>>>>>>>>  completeness
>>>>>>>>>>>> as
>>>>>>>>>>>>  well as get an estimate of fragmentation as one of the
>>>>>>>>>>>> statistics
>>>>>>>>>>>> produced
>>>>>>>>>>>>  is a percent of genes that are found complete (end to end) vs
>>>>>> those
>>>>>>>>>>>>  that
>>>>>>>>>>>> are partial.  CEGMA identifies house keeping genes that tend
>>>>>>>>>>>>to
>>>>>>>>>>>> be
>>>>>> shorter
>>>>>>>>>>>>  and less intron rich than other genes in the genome, so if
>>>>>>>>>>>>CEGMA
>>>>>> gives
>>>>>>>>>>>>  a
>>>>>>>>>>>> high partial percentage and a low complete percentage, then
>>>>>>>>>>>> this
>>>>>>>>>>>>  pattern
>>>>>>>>>>>> can be expected to be even more exaggerated for other genes in
>>>>>>>>>>>> the
>>>>>> genome.
>>>>>>>>>>>>  If your genome is highly fragmented or proteins do not align
>>>>>>>>>>>> well
>>>>>> then
>>>>>>>>>>>> there are other strategies.  For example, some vertebrate
>>>>>>>>>>>>genomes
>>>>>> end
>>>>>>>>>>>>  up
>>>>>>>>>>>> having extremely fragmented assemblies (on the order of
>>>>>>>>>>>>100,000
>>>>>> contigs),
>>>>>>>>>>>>  and if they are distantly related to other annotated species
>>>>>>>>>>>>few
>>>>>>>>>>>> proteins
>>>>>>>>>>>>  may align to the contigs because the introns in the
>>>>>>>>>>>>alignments
>>>>>> tend
>>>>>>>>>>>>  to
>>>>>>>>>>>> be
>>>>>>>>>>>>  so long and exons so short that it pushes down the
>>>>>>>>>>>> significance
>>>>>> scores
>>>>>>>>>>>> too
>>>>>>>>>>>>  much.  In those cases heavy mRNAseq seems to be the best if
>>>>>>>>>>>> not
>>>>>> only
>>>>>>>>>>>>  way
>>>>>>>>>>>> to get enough evidence to stitch gene models together.
>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>  Carson
>>>>>>>>>>>>  On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Dear Carson and Daniel,
>>>>>>>>>>>> Thanks. I ran sample file for filtering genes based on AED
>>>>>>>>>>>>score.
>>>>>> The
>>>>>>>>>>>> input gff3 file was provided to option model_pred(see attached
>>>>>>>>>>>> file
>>>>>> Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>>>>>>>>>>  least
>>>>>>>>>>>>  5
>>>>>>>>>>>> genes with AED score less than 0.75. However there were no
>>>>>>>>>>>> genes
>>>>>>>>>>>>  predicted
>>>>>>>>>>>> in the output file(see attached file Scaffold1_out). I have
>>>>>>>>>>>> also
>>>>>>>>>>>> attached
>>>>>>>>>>>> the maker_opts.ctl. Could you please advice on this.
>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>  Use the AED_threshold option in the maker_opts.ctl file if
>>>>>>>>>>>>you
>>>>>>>>>>>> just
>>>>>>>>>>>> want
>>>>>>>>>>>>  to restrict final gene models to close matches directly
>>>>>>>>>>>>within
>>>>>>>>>>>> maker.
>>>>>>>>>>>> On
>>>>>>>>>>>>  the other hand, if you are trying to build a dataset for
>>>>>> training
>>>>>>>>>>>>  gene
>>>>>>>>>>>> predictors, use the maker2zff script for generating a filtered
>>>>>>>>>>>>  dataset
>>>>>>>>>>>> for
>>>>>>>>>>>>  SNAP training.  There are a number of filters available. Just
>>>>>> call
>>>>>>>>>>>>  the
>>>>>>>>>>>> script once without parameters to see the options.
>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>  Carson
>>>>>>>>>>>>  On 12-11-27 5:55 PM, "Daniel Ence" <dence at genetics.utah.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi Parul,
>>>>>>>>>>>> I think the way you described (with the maker_opts.ctl file)
>>>>>>>>>>>>is
>>>>>> how
>>>>>>>>>>>> you
>>>>>>>>>>>> want to proceed. You still need to give the genome too.
>>>>>>>>>>>> Daniel
>>>>>>>>>>>> Daniel Ence
>>>>>>>>>>>> Graduate Student
>>>>>>>>>>>> Eccles Institute of Human Genetics
>>>>>>>>>>>> University of Utah
>>>>>>>>>>>> 15 North 2030 East, Room 2100
>>>>>>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>> From: maker-devel-bounces at yandell-lab.org
>>>>>>>>>>>> [maker-devel-bounces at yandell-lab.org] on behalf of Parul
>>>>>>>>>>>>  Kudtarkar
>>>>>>>>>>>> [parulk at caltech.edu]
>>>>>>>>>>>> Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>>>>> To: Parul Kudtarkar
>>>>>>>>>>>> Cc: maker-devel at yandell-lab.org
>>>>>>>>>>>> Subject: Re: [maker-devel] AED score
>>>>>>>>>>>> Also, are there any other parameters that are required when
>>>>>> filtering
>>>>>>>>>>>> based on AED score?
>>>>>>>>>>>>  Hello Carson,
>>>>>>>>>>>>  Just to confirm, Is there a script that would filter gene
>>>>>> models
>>>>>>>>>>>> at
>>>>>>>>>>>> specific AED score.
>>>>>>>>>>>>  Alternatively if I were to do this within maker with regards
>>>>>> to
>>>>>>>>>>>> parameters
>>>>>>>>>>>>  in maker_opts.ctl file I would have to provide my predicted
>>>>>>>>>>>> genes
>>>>>>>>>>>> gff3
>>>>>>>>>>>>  file to model_gff and  set AED_threshold at desired
>>>>>>>>>>>>threshold?
>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>  Parul Kudtarkar
>>>>>>>>>>>>  AED score with 1 are the ones you don't want.  0 is best and
>>>>>> 1
>>>>>>>>>>>>  is
>>>>>>>>>>>> worst
>>>>>>>>>>>>  as
>>>>>>>>>>>>  it is a distance metric.  You can use the AED_threshold
>>>>>> parameter
>>>>>>>>>>>> to
>>>>>>>>>>>>  require better matching to the evidence by setting it closer
>>>>>> to
>>>>>>>>>>>> 0.
>>>>>>>>>>>> You
>>>>>>>>>>>>  can
>>>>>>>>>>>>  also try to increase protein homology evidence as some of
>>>>>> your
>>>>>>>>>>>> calls
>>>>>>>>>>>> may
>>>>>>>>>>>>  be split genes due to lack of evidence linking them.
>>>>>>>>>>>>  --Carson
>>>>>>>>>>>>  On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Dear Maker community,
>>>>>>>>>>>> For gene-prediction I get training data-set from evidence
>>>>>>>>>>>>  based
>>>>>>>>>>>> prediction, I use this data-set to train SNAP as well as
>>>>>>>>>>>>Augustus
>>>>>> predictions, followed by boot-strapping. I would typically expect
>>>>>> 20-30K
>>>>>>>>>>>> genes however I am getting 8 times the expected gene count
>>>>>>>>>>>>  indicating
>>>>>>>>>>>>  too
>>>>>>>>>>>> many false positives. Is there a way to further refine these
>>>>>>>>>>>> predication/script to retain predictions with AED score 1 and
>>>>>>>>>>>> if
>>>>>>>>>>>> yes
>>>>>>>>>>>> how
>>>>>>>>>>>> to go about this?
>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>> --
>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>> 
>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
>>>>>> -l
>>>>>>>>>>>> ab
>>>>>>>>>>>> .o
>>>>>>>>>>>> rg
>>>>>>>>>>>>  --
>>>>>>>>>>>>  Scientific Programmer
>>>>>>>>>>>>  Center for Computational Regulatory Genomics
>>>>>>>>>>>>  Beckman Institute,
>>>>>>>>>>>>  California Institute of Technology
>>>>>>>>>>>>  http://www.spbase.org
>>>>>>>>>>>> --
>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>>
>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
>>>>>>>>>>>>-l
>>>>>>>>>>>>a
>>>>>> b.
>>>>>>>>>>>> or
>>>>>>>>>>>> g
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>>
>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
>>>>>>>>>>>>-l
>>>>>>>>>>>>a
>>>>>> b.
>>>>>>>>>>>> or
>>>>>>>>>>>> g
>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>  maker-devel mailing list
>>>>>>>>>>>>  maker-devel at box290.bluehost.com
>>>>>>>>>>>>
>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell
>>>>>>>>>>>>-l
>>>>>>>>>>>>ab
>>>>>> .o
>>>>>>>>>>>> rg
>>>>>>>>>>>> --
>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>> --
>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>> --
>>>>>>>>>> Scientific Programmer
>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>> Beckman Institute,
>>>>>>>>>> California Institute of Technology
>>>>>>>>>> http://www.spbase.org
>>>>>>>> --
>>>>>>>> Scientific Programmer
>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>> Beckman Institute,
>>>>>>>> California Institute of Technology
>>>>>>>> http://www.spbase.org
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Scientific Programmer
>>>>>> Center for Computational Regulatory Genomics
>>>>>> Beckman Institute,
>>>>>> California Institute of Technology
>>>>>> http://www.spbase.org
>>>>>>
>>>>>> --
>>>>>> Scientific Programmer
>>>>>> Center for Computational Regulatory Genomics
>>>>>> Beckman Institute,
>>>>>> California Institute of Technology
>>>>>> http://www.spbase.org
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Scientific Programmer
>>>> Center for Computational Regulatory Genomics
>>>> Beckman Institute,
>>>> California Institute of Technology
>>>> http://www.spbase.org
>>>>
>>>
>>>
>>>--
>>>Scientific Programmer
>>>Center for Computational Regulatory Genomics
>>>Beckman Institute,
>>>California Institute of Technology
>>>http://www.spbase.org
>>>
>>
>>
>>
>
>
>--
>Scientific Programmer
>Center for Computational Regulatory Genomics
>Beckman Institute,
>California Institute of Technology
>http://www.spbase.org
>