[maker-devel] AED score

Fri Dec 14 14:17:55 MST 2012

You can either train with cegma first and then train a second time with
the protein2genome calls, or combine the cegma and protein2genome ZFF
files (from maker2zff and cegma2zff), then use that one file to generate
the training data for augustus.  Augustus allows you to keep training, so
rather than training a new species each time, you can specify an existing
species you trained before, and then refine the existing model using the
additional training data.

--Carson

On 12-12-13 5:16 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:

>Dear Carson,
>
>Could you please advice how to combine results from protein2genome
>derived models and cegma based training to be provided as training species
>for augustus or are theses supposed to be run as separate maker2 run(i.e.
>run maker2 with training set from protein2genome first followed by
>cegma-based training set)
>
>Thanks and regards,
>Parul
>
>> Thanks Parul.  If you need any more help, just let us know.
>>
>> --Carson
>>
>>
>>
>> On 12-12-10 10:28 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>
>>>Dear Carson, Thanks a lot for detailed explanation and all the help with
>>>running and understanding maker2.
>>>
>>>
>>>> Yes.  Lamprey is not a good match for sea urchin.  When training
>>>>Augusuts
>>>> you can sometimes use other genomes as starting points and let
>>>>Augustus
>>>> then modify that species to match the models you provided, which can
>>>>be
>>>> better than de novo training under certain circumstances. But given
>>>>sea
>>>> urchin's evolutionary distance from all the other species bundled with
>>>> augustus, its probably not a good starting point.  So protein2genome
>>>> derived models and cegma based training are probably your best
>>>>options.
>>>>
>>>> --Carson
>>>>
>>>>
>>>> On 12-12-07 8:09 PM, "Parul Kudtarkar" <parulk at caltech.edu> wrote:
>>>>
>>>>>Sorry for not checking this before, just to answer my own question I
>>>>>just
>>>>>found after reading augustus documentation that autoAugTrain.pl should
>>>>>work for generating training dataset for Augustus from using the
>>>>>gene-model hints from first pass run of maker and using
>>>>>http://sourceforge.net/mailarchive/message.php?msg_id=29361270 script
>>>>> to
>>>>>generate training.gb
>>>>>
>>>>>autoAugTrain.pl [OPTIONS] --species=sname --trainingset=training.gb
>>>>>
>>>>>I believe should work better than using lamprey to train augustus.
>>>>>
>>>>>Thanks and regards,
>>>>>Parul Kudtarkar
>>>>>
>>>>>> Dear Carson,
>>>>>>
>>>>>> Thanks for detailed explanation and help. Now we know the exact
>>>>>>parameters
>>>>>> that should help us with generating good gene-model.
>>>>>>
>>>>>> The genome for which we are working on gene predictions is
>>>>>> Echinoderm(sea-urchin). lamprey was the closest organism for
>>>>>>training
>>>>>> augustus that I could find. A quick question for training augustus,
>>>>>>there
>>>>>> is augustus_species option how would you go from training data
>>>>>> generated
>>>>>> by zff2augustus_gbk.pl > train.gb as specified here
>>>>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270 to
>>>>>> generating the species folder that can be specified to
>>>>>>augustus_species
>>>>>> option.
>>>>>>
>>>>>> The average expected gene-length in our case is ~15kb.
>>>>>> We also have a good repeat library for our genome.
>>>>>>
>>>>>> Thanks and regards,
>>>>>> Parul Kudtarkar
>>>>>>
>>>>>>> For step 2 and 3, use lamprey rather than human in augustus.  For
>>>>>>>some
>>>>>>> reason the current version of augustus doesn't diplay it as an
>>>>>>> option
>>>>>>> for
>>>>>>> the species help menu (so I didn't see it), but it's there in
>>>>>>> �/augustus/config/species/lamprey
>>>>>>>
>>>>>>> For any additional training, make copies into
>>>>>>> �/augustus/config/species/lamprey2 and
>>>>>>> �/augustus/config/species/lamprey3
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Carson
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From:  Carson Holt <carsonhh at gmail.com>
>>>>>>> Date:  Friday, 7 December, 2012 10:22 AM
>>>>>>> To:  Daniel Ence <dence at genetics.utah.edu>,
>>>>>>> "maker-devel at yandell-lab.org"
>>>>>>> <maker-devel at yandell-lab.org>, Parul Kudtarkar <parulk at caltech.edu>
>>>>>>> Subject:  Re: [maker-devel] AED score
>>>>>>>
>>>>>>> Just to add to Daniels comments.
>>>>>>>
>>>>>>> Things to change in step 1:
>>>>>>>> protein= <just try and add more protein evidence in general>
>>>>>>>> est2genome=0
>>>>>>>> protein2genome=1
>>>>>>>> split_hit=20000
>>>>>>>> min_contig=50000
>>>>>>>
>>>>>>> Reasoning:
>>>>>>>> Your ESTs are very short especially if this is a lamprey species
>>>>>>>> which
>>>>>>>> have
>>>>>>>> very long introns and really short exons.  In lamprey (i.e.
>>>>>>>> Petromyzon
>>>>>>>> marinus), genes tend to be very long (remember gene lengths
>>>>>>>>include
>>>>>>>> introns
>>>>>>>> and UTR and is not just the size of the coding sequence), so
>>>>>>>> contigs
>>>>>>>> shorter
>>>>>>>> than 50kb are useless for training as you are unlikely to get nice
>>>>>>>> complete
>>>>>>>> gene models on those.  Also lampreys have very long introns, so
>>>>>>>>you
>>>>>>>> have
>>>>>>>> to
>>>>>>>> allow for bigger introns in alignments (split_hit parameter).
>>>>>>>> Finally
>>>>>>>> add as
>>>>>>>> much protein evidence from as many sources as possible.  Your
>>>>>>>>maker
>>>>>>>> training
>>>>>>>> run will take a long time as proteins take forever to align, but
>>>>>>>> because
>>>>>>>> of
>>>>>>>> the evolutionary distance of lamprey from everything else and the
>>>>>>>>short
>>>>>>>> exon
>>>>>>>> structure of its genome, very little aligns directly to its genome
>>>>>>>>from
>>>>>>>> other
>>>>>>>> deuterstome and vertebrate species.  I'm assuming this is a
>>>>>>>>lamprey
>>>>>>>> species
>>>>>>>> because of what you said about the  augustus species file you are
>>>>>>>> using.
>>>>>>>> Really the only thing closely related to lampreys unfortunately
>>>>>>>>are
>>>>>>>> other
>>>>>>>> lampreys.  Lancelets, hagfish, and sharks are not closely related
>>>>>>>> to
>>>>>>>> lamprey
>>>>>>>> (while they branch closely together on the tree of life, there are
>>>>>>>> too
>>>>>>>> many
>>>>>>>> years since the last common ancestor).  So while they may have
>>>>>>>> similar
>>>>>>>> issues
>>>>>>>> related to annotation (long introns and short exons etc.) they
>>>>>>>>will
>>>>>>>>not
>>>>>>>> really
>>>>>>>> match that well for the gene predictor or even protein alignments.
>>>>>>>
>>>>>>> Additional note:
>>>>>>>> I have training files for the lamprey species Petromyzon marinus
>>>>>>>> for
>>>>>>>> both
>>>>>>>> Augustus and SNAP that I could share with you in a few week, when
>>>>>>>>the
>>>>>>>> genome
>>>>>>>> publication is is released.  But before that happens, new gene
>>>>>>>>models
>>>>>>>> will be
>>>>>>>> available  through the UCSC browser (hopefully within a couple of
>>>>>>>> weeks), and
>>>>>>>> gene models are already available through ENSEMBL.  Get those
>>>>>>>>protein
>>>>>>>> files
>>>>>>>> for training, it may be a big help for you.  If you want early
>>>>>>>>access
>>>>>>>> to
>>>>>>>> the
>>>>>>>> lamprey training files for Augustus and SNAP, you would have to
>>>>>>>>request
>>>>>>>> it
>>>>>>>> from Weiming Li at Michigan State University (the head of that
>>>>>>>>genome
>>>>>>>> project).
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Things to change in step 2:
>>>>>>>> Optimally you would be doing de novo training using mRNAseq
>>>>>>>> results,
>>>>>>>> but
>>>>>>>> with
>>>>>>>> on;ly sparse protein alignments and such a fragmented assembly,
>>>>>>>>you
>>>>>>>>are
>>>>>>>> probably better off just trying to adapt the human HMM files.
>>>>>>>>They
>>>>>>>> won't
>>>>>>>> match that well, but you probably won't have the evidence for De
>>>>>>>>Novo
>>>>>>>> training.  First make a copy of the augustus human species
>>>>>>>> directory
>>>>>>>> and
>>>>>>>> rename it to lamprey (cp -R  �/augustus/config/species/human
>>>>>>>> �/augustus/config/species/lamprey).  Use it as the base species
>>>>>>>>for
>>>>>>>> retraining
>>>>>>>> augustus using your new models.  You will have to edit multiple
>>>>>>>>files
>>>>>>>> in
>>>>>>>> the
>>>>>>>> directory after you copy it so that they no longer say human or
>>>>>>>> homo
>>>>>>>> sapiens
>>>>>>>> internally or in the file name.  Use maker2zff to generate the
>>>>>>>>filtered
>>>>>>>> ZFF
>>>>>>>> file for training SNAP, but don't train SNAP.  Rather use the
>>>>>>>> training
>>>>>>>> file to
>>>>>>>> better train Augustus info here (just ignore the CEGMA part) -->
>>>>>>>> http://sourceforge.net/mailarchive/message.php?msg_id=29361270
>>>>>>>>
>>>>>>>> MAKE a backup of the step 1 maker output directory and run step 2
>>>>>>>> in
>>>>>>>> the
>>>>>>>> old
>>>>>>>> step 1 directory (this allows you to change the parameters and
>>>>>>>> reuse
>>>>>>>> files
>>>>>>>> form step 1 so you don't have to recalculate all the protein and
>>>>>>>> EST
>>>>>>>> alignments).  So control files for step 2 are identical to step 1
>>>>>>>> except
>>>>>>>> for
>>>>>>>> these parameters.
>>>>>>>>
>>>>>>>> protein2genome=0
>>>>>>>> augustus_species=lamprey
>>>>>>>> snaphmm=lamprey.hmm #optional if you decide to use SNAP
>>>>>>>>
>>>>>>>> Don't both training SNAP here as you probably won't have enough
>>>>>>>> data
>>>>>>>> and
>>>>>>>> you
>>>>>>>> assembly is too fragmented for it to work well, so just stick to
>>>>>>>> augustus.
>>>>>>>> Try SNAP if you want just to see how well it works.  Manually open
>>>>>>>>up
>>>>>>>> the
>>>>>>>> largest contigs in a viewer to look at the models produced from
>>>>>>>>the
>>>>>>>> MAKER run
>>>>>>>> to see if they look reasonable (this will also help you decide
>>>>>>>> whether
>>>>>>>> to keep
>>>>>>>> SNAP).
>>>>>>>>
>>>>>>>
>>>>>>> Things to change in step 3:
>>>>>>>> Step 3 should just be a clone of step 2 as it is bootstrapping.
>>>>>>>> But
>>>>>>>> make
>>>>>>>> copies of �/augustus/config/species/lamprey and save it to
>>>>>>>> �/augustus/config/species/lamprey2 (editing all the files and
>>>>>>>>names
>>>>>>>> as
>>>>>>>> you did
>>>>>>>> in step 2). This way you don't loose that training data if you
>>>>>>>>decide
>>>>>>>> to
>>>>>>>> step
>>>>>>>> back.  Also Give your SNAP HMM a new name (I.e. lamprey2.hmm)
>>>>>>>>
>>>>>>>> augustus_species=lamprey2
>>>>>>>> snaphmm=lamprey2.hmm  #optional if you decide to use SNAP
>>>>>>>>
>>>>>>>> Make a backup of Step 2 and run step 3 in the old Step 2 directory
>>>>>>>> (This
>>>>>>>> is
>>>>>>>> for file reuse, so the step will run fast).  This must be the
>>>>>>>>exact
>>>>>>>> same
>>>>>>>> step
>>>>>>>> directory as step 2 for the reuse trick to work.
>>>>>>>>
>>>>>>>> Manually review the models and if you are satisfied move to step
>>>>>>>>4.
>>>>>>>> Also note
>>>>>>>> that most parameters including the protein, EST, and repeats
>>>>>>>>should
>>>>>>>>not
>>>>>>>> change
>>>>>>>> from step1-step3, and should not be removed for step 4 either, you
>>>>>>>> can
>>>>>>>> add
>>>>>>>> more evidence, but don't remove evidence (like the repeats).
>>>>>>>>
>>>>>>>>
>>>>>>> Things to change in step 4:
>>>>>>>> For this step, just set min_contig=10000 and rerun MAKER inside
>>>>>>>>the
>>>>>>>> step
>>>>>>>> 3
>>>>>>>> directory to get the smaller contigs annotated.  This should be
>>>>>>>> your
>>>>>>>> final
>>>>>>>> step, although you can try altering other parameters or adding
>>>>>>>>more
>>>>>>>> evidence
>>>>>>>> sources here etc.
>>>>>>>>
>>>>>>>>
>>>>>>> For other things to keep in mind, you should consider taking extra
>>>>>>> time
>>>>>>> to
>>>>>>> build a comprehensive library of repeats for your species, I know
>>>>>>>that
>>>>>>> Petromyzon marinus was virtually unannotatable until we had a very
>>>>>>> deep
>>>>>>> repeat library built for it.  Any repeat library should be used in
>>>>>>>all
>>>>>>> steps.  Also for lamprey, you will expect between 20,000-30,000
>>>>>>> genes
>>>>>>> both
>>>>>>> because of your assemblies fragmentation and because of some
>>>>>>>ancestral
>>>>>>> genome duplication.  Also be aware that lampreys appear to undergo
>>>>>>> programmed genome loss in somatic tissues, so any gene count you
>>>>>>>get
>>>>>>> is
>>>>>>> only
>>>>>>> going to represent a maximum of 75-80% of all genes unless the
>>>>>>> assembly
>>>>>>> is
>>>>>>> derived from germline tissue.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Carson
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>> On 12-12-06 2:17 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>>>>>>
>>>>>>>> Hi Parul,
>>>>>>>>
>>>>>>>> In step one, if protein2genome isn't turned on, then the protein
>>>>>>>> alignments
>>>>>>>> wont be used to generate gene models. Carson said before to use
>>>>>>>> protein2genome
>>>>>>>> to generate gene models for your trainings set, but you should
>>>>>>>>know
>>>>>>>> that
>>>>>>>> those
>>>>>>>> data aren't being used right now.
>>>>>>>>
>>>>>>>> In step 2, you can include the est and protein evidence that you
>>>>>>>>used
>>>>>>>> in
>>>>>>>> step
>>>>>>>> one, and the ab-initio predictors will takes those data as hints
>>>>>>>>to
>>>>>>>> guide
>>>>>>>> their predictions. Also, are you reannotating human here? I'm just
>>>>>>>> wondering
>>>>>>>> why the snap file is called Pult, but the augustus species model
>>>>>>>>is
>>>>>>>> human.
>>>>>>>> Also, I think you should include the same masking files from step
>>>>>>>> 1,
>>>>>>>> otherwise
>>>>>>>> the ab-initio predictors will be predicting on the unmasked
>>>>>>>> sequence
>>>>>>>> which
>>>>>>>> will give you many spurious predictions.
>>>>>>>>
>>>>>>>> In step 3, I'm not certain what you mean by boot-strapping. The
>>>>>>>>control
>>>>>>>> file
>>>>>>>> you sent for step 3 will just pass-through all of the data from
>>>>>>>>before.
>>>>>>>>
>>>>>>>> In step 4, I think what you'll get is pretty close to what you
>>>>>>>> already
>>>>>>>> had in
>>>>>>>> step 3. The gene models from step 3 are already based on the est
>>>>>>>> and
>>>>>>>> protein
>>>>>>>> data from step 1, so without giving any different evidence or
>>>>>>>>ab-initio
>>>>>>>> predictors, I think that you will just get the same gene models
>>>>>>>> that
>>>>>>>> you
>>>>>>>> got
>>>>>>>> from step 3. Those gene models are what you should be using to
>>>>>>>> train
>>>>>>>> snap.
>>>>>>>> Also, is this the same genome as in the other steps? The genome
>>>>>>>> file
>>>>>>>> here is
>>>>>>>> called genome.linear.fa, but before it was called genome.fa.
>>>>>>>>
>>>>>>>> Step 5. So maker uses both evidence and ab-initio predictions to
>>>>>>>>create
>>>>>>>> models. If you don't give any evidence (EST or protein) maker will
>>>>>>>> not
>>>>>>>> annotate any gene models. You have also changed the augustus
>>>>>>>> species
>>>>>>>> model in
>>>>>>>> this step, but I don't know what you've gained by going from one
>>>>>>>> species
>>>>>>>> to
>>>>>>>> another. You should be training augustus with the gene models and
>>>>>>>> then
>>>>>>>> creating a new species model in augustus. As I understand it, the
>>>>>>>> filter
>>>>>>>> only
>>>>>>>> operates on gene models, not ab-initio predictions or alignments,
>>>>>>>> so
>>>>>>>>it
>>>>>>>> probably isn't doing anything the way you have it set.
>>>>>>>>
>>>>>>>> Step 6. I think step 5 and 6 should be combined. I don't know what
>>>>>>>> you
>>>>>>>> mean by
>>>>>>>> boot-strapping here too.
>>>>>>>>
>>>>>>>> Hopefully that clears up some of the confusion with how maker
>>>>>>>> works.
>>>>>>>> Carson
>>>>>>>> will probably have a lot of suggestions too.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>> Daniel Ence
>>>>>>>> Graduate Student
>>>>>>>> Eccles Institute of Human Genetics
>>>>>>>> University of Utah
>>>>>>>> 15 North 2030 East, Room 2100
>>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>>> ________________________________________
>>>>>>>> From: Parul Kudtarkar [parulk at caltech.edu]
>>>>>>>> Sent: Thursday, December 06, 2012 10:08 AM
>>>>>>>> To: carsonhh at gmail.com
>>>>>>>> Cc: Daniel Ence; maker-devel at yandell-lab.org
>>>>>>>> Subject: Re: [maker-devel] AED score
>>>>>>>>
>>>>>>>> Hi Carson,
>>>>>>>>
>>>>>>>> Once again thanks for a quick response. We expect to get 25,000 to
>>>>>>>> 30,000
>>>>>>>> genes.
>>>>>>>> Here are the step
>>>>>>>> Step1. EST and proteins are used to predict gene-models. the
>>>>>>>> resulting
>>>>>>>> file is used as training data-set for SNAP in step2. est2genome is
>>>>>>>> turned
>>>>>>>> ON
>>>>>>>> Step2. Augustus and SNAP is used to predict genes.
>>>>>>>> Step3. The results are re-annoated(boot-strap)
>>>>>>>> Step4. EST and proteins are used to predict gene-models. The gff
>>>>>>>>file
>>>>>>>> from
>>>>>>>> step3. is used as model_gff. The resulting file is used as
>>>>>>>>training
>>>>>>>> data-set for SNAP in step2
>>>>>>>> Step 5. Augustus and SNAP is used to predict genes. with AED set
>>>>>>>>to
>>>>>>>>0.5
>>>>>>>> Step6. The results are re-annoated(boot-strap) with AED set to 0.5
>>>>>>>> and
>>>>>>>> contig_size to 10kb
>>>>>>>>
>>>>>>>> Thanks and regards,
>>>>>>>> Parul Kudtarkar
>>>>>>>>
>>>>>>>> I have attached the configuration file for each step.
>>>>>>>>>  Your output has no genes.  They've all been filtered out.  The
>>>>>>>>>gene
>>>>>>>> predictions are left for reference purposes, but there are no gene
>>>>>>>> models
>>>>>>>>>  in the file.  You need to look at the type columns in the GFF3
>>>>>>>>>file
>>>>>>>>> -->
>>>>>>>> match/match_part features are evidence and reference data but not
>>>>>>>> models.
>>>>>>>>>  All models will have types gene/mRNA/exon/CDS.  For example if
>>>>>>>>> you
>>>>>>>>> just
>>>>>>>> want gene models in your file you can use the gff3_merge script
>>>>>>>> with
>>>>>>>> the
>>>>>>>> -g option, and it will only print out the gene models.
>>>>>>>>>  I think you may be misinterpreting what is happening at
>>>>>>>>>different
>>>>>>>>> steps,
>>>>>>>> as well as how to read the result files.  Could you give me a
>>>>>>>> detailed
>>>>>>>> explanation of what you expect to get back together with your
>>>>>>>>control
>>>>>>>> files and I can walk you through the configuration, and indicate
>>>>>>>>what
>>>>>>>> to
>>>>>>>> expect.  Also what was the report like from CEGMA. Could you
>>>>>>>> include
>>>>>>>> the
>>>>>>>> report file that shows how complete your genome is and how
>>>>>>>>fragmented
>>>>>>>> it
>>>>>>>> is?
>>>>>>>>>  Thanks,
>>>>>>>>>  Carson
>>>>>>>>>  On 12-12-05 3:03 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>> wrote:
>>>>>>>>>> Dear Carson,
>>>>>>>>>> Thanks for a quick response. keep_preds is set to 0
>>>>>>>>>> Though for previous step(ab-intio gene predictions keep_preds
>>>>>>>>>>was
>>>>>>>>>>set
>>>>>>>>>> to
>>>>>>>> 1, see Scaffold1_input.gff). I have also attached the output file
>>>>>>>> Scaffold1_out.gff.
>>>>>>>>>> Please advice.
>>>>>>>>>> Thanks and regards,
>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>  You should always get all predictions, but should only get
>>>>>>>>>>>models
>>>>>>>> (match/match_part) with AED scores less than 0.5.  You should
>>>>>>>>never
>>>>>>>>get
>>>>>>>>>>>  models (gene/mRNA/CDS) with an AED score of 1.00 unless you
>>>>>>>>>>> have
>>>>>>>> keep_preds set.
>>>>>>>>>>>  Do you have keeps_preds set or gene models with AED values
>>>>>>>>>>> above
>>>>>>>>>>> your
>>>>>>>> threshold (not match/match_part features)?  If the first one is
>>>>>>>>the
>>>>>>>>>>> case,
>>>>>>>>>>>  just set it to 0.  If the second one in the case, could you
>>>>>>>>>>> send
>>>>>>>>>>>me
>>>>>>>> example GFF3?
>>>>>>>>>>>  Thanks,
>>>>>>>>>>>  Carson
>>>>>>>>>>>  On 12-12-04 7:27 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Dear Carson,
>>>>>>>>>>>> Thanks once again, we have limited experimental data with very
>>>>>>>>>>>> short
>>>>>>>>>>>>  ESTs.
>>>>>>>>>>>> CEGMA is useful for us to gauge our gene-model.
>>>>>>>>>>>> On a different note to re-annotate genome(post evidence based
>>>>>>>>>>>> prediction(used as training dataset)and abinitio
>>>>>>>>>>>> gene-prediction).
>>>>>>>> Here
>>>>>>>>>>>> are the control parameters I am using with AED score set to
>>>>>>>>>>>> 0.5,
>>>>>>>>>>>>  however
>>>>>>>>>>>>  I
>>>>>>>>>>>> get predictions that includes the ones with AED score of 1.00
>>>>>>>>>>>> in
>>>>>>>>>>>>  resulting
>>>>>>>>>>>> gff3 file. Though I do see the number of genes reduced to 1/3
>>>>>>>>>>>> of
>>>>>>>>>>>>  initial
>>>>>>>>>>>> gff3 file.
>>>>>>>>>>>> #-----Genome (Required for De-Novo Annotation)
>>>>>>>>>>>> genome=Scaffold1.fa #genome sequence file in fasta format
>>>>>>>>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default
>>>>>>>>>>>>is
>>>>>>>>>>>>  eukaryotic
>>>>>>>>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>>>>>>>>> genome_gff= Scaffold1.gff #re-annotate genome based on this
>>>>>>>>>>>> gff3
>>>>>>>>>>>> file
>>>>>>>> est_pass=1 #use ests in genome_gff: 1 = yes, 0 = no
>>>>>>>>>>>> altest_pass=0 #use alternate organism ests in genome_gff: 1 =
>>>>>>>>>>>> yes,
>>>>>>>>>>>> 0
>>>>>>>>>>>> =
>>>>>>>> no
>>>>>>>>>>>> protein_pass=1 #use proteins in genome_gff: 1 = yes, 0 = no
>>>>>>>>>>>> rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
>>>>>>>>>>>> model_pass=1 #use gene models in genome_gff: 1 = yes, 0 = no
>>>>>>>>>>>> pred_pass=1 #use ab-initio predictions in genome_gff: 1 = yes,
>>>>>>>>>>>> 0
>>>>>>>>>>>> =
>>>>>>>>>>>> no
>>>>>>>> other_pass=0 #passthrough everything else in genome_gff: 1 = yes,
>>>>>>>>0
>>>>>>>>=
>>>>>>>>>>>>  no
>>>>>>>>>>>> #-----MAKER Behavior Options
>>>>>>>>>>>> AED_threshold=0.5 #Maximum Annotation Edit Distance allowed
>>>>>>>>>>>> (bound
>>>>>>>>>>>> by
>>>>>>>> 0
>>>>>>>>>>>> and 1)
>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>>  Wow 330,000 is a lot. a large portion of genes are likely to
>>>>>>>>>>>>>be
>>>>>>>>>>>>> partial
>>>>>>>>>>>>> at
>>>>>>>>>>>>>  best.  You should seriously consider using mRNAseq to
>>>>>>>>>>>>>capture
>>>>>>>>>>>>> those
>>>>>>>> by
>>>>>>>>>>>>>  using maker's est_gff option to pass in results from
>>>>>>>>>>>>> cufflinks
>>>>>>>>>>>>>or
>>>>>>>>>>>>> trinity.
>>>>>>>>>>>>>   Also I wouldn't even try to annotate contigs less than 10kb
>>>>>>>>>>>>>in
>>>>>>>> size,
>>>>>>>>>>>>> just
>>>>>>>>>>>>>  have maker skip them by setting the min_contig filter in the
>>>>>>>> maker_opts.ctl file.
>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>  Carson
>>>>>>>>>>>>>  On 12-11-29 7:31 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Thanks for the guidance Carson, total contig size is 330,611
>>>>>>>>>>>>>>with
>>>>>>>> N50
>>>>>>>>>>>>>>  of
>>>>>>>>>>>>>> 39.17kb. I agree we have short ESTs. So this is the possible
>>>>>>>>>>>>>> reason
>>>>>>>>>>>>>>  when
>>>>>>>>>>>>>> filtering based on AED score 0.75 there are no gene models
>>>>>>>>>>>>>> predicted
>>>>>>>> despite the model_gff file has few genes with scores less than
>>>>>>>> 0.75?
>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>>>  There are certain characteristics that are apparent in this
>>>>>>>> contig.
>>>>>>>>>>>>>> First
>>>>>>>>>>>>>>  it seems to be repeat rich with a very low gene density.
>>>>>>>>>>>>>> You
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> very short ESTs, and because of the lengths you are probably
>>>>>>>>>>>>>> getting
>>>>>>>> many
>>>>>>>>>>>>>>  of them to align spuriously which produces very short gene
>>>>>>>>>>>>>> models
>>>>>>>> that
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>  more than likely false positives or at the very least just
>>>>>>>>>>>>>>a
>>>>>>>>>>>>>> piece
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> gene.  I would turn off est2genome as a predictor for this
>>>>>>>>>>>>>>reason
>>>>>>>>>>>>>>  unless
>>>>>>>>>>>>>> you can get longer EST assemblies (i.e. From mRNAseq).
>>>>>>>>>>>>>> Your
>>>>>>>>>>>>>>  protein
>>>>>>>>>>>>>> alignments also seem to be few and far between.  You
>>>>>>>>>>>>>>probably
>>>>>>>>>>>>>> need
>>>>>>>> to
>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>  more proteins from a couple of related species, and you
>>>>>>>>>>>>>> might
>>>>>>>> consider
>>>>>>>>>>>>>> using protein2genome rather than est2genome as a predictor
>>>>>>>>>>>>>>if
>>>>>>>>>>>>>>you
>>>>>>>> are
>>>>>>>>>>>>>> still working to generate a training set. Also est2genome
>>>>>>>>>>>>>> produced
>>>>>>>> models
>>>>>>>>>>>>>>  almost always have an AED score near 0 so mixing est2genome
>>>>>>>>>>>>>>with
>>>>>>>> the
>>>>>>>>>>>>>> AED_threshold with such limited protein support does create
>>>>>>>>>>>>>> an
>>>>>>>> artificial
>>>>>>>>>>>>>>  bias to get back very short and incomplete models.
>>>>>>>>>>>>>>  How many contigs do you have in total and what is the N50
>>>>>>>>>>>>>> value
>>>>>>>> for
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> assembly? If you have a large number of very short contigs,
>>>>>>>>>>>>>>you
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>  get
>>>>>>>>>>>>>> very inflated gene counts because you get genes split across
>>>>>>>>>>>>>> contigs
>>>>>>>>>>>>>>  and
>>>>>>>>>>>>>> many contigs tend t be subtle rearrangements of other
>>>>>>>>>>>>>>contigs
>>>>>>>>>>>>>> just
>>>>>>>> assembled in a slightly different way (so you can get bits and
>>>>>>>> pieces
>>>>>>>>>>>>>>  of
>>>>>>>>>>>>>> the same genes just rearranged).  This scenario is another
>>>>>>>>>>>>>>  confounding
>>>>>>>>>>>>>> factor if using the est2genome predictor with short ESTs.  I
>>>>>>>>>>>>>> would
>>>>>>>> recommend running CEGMA to get an estimate for the genome
>>>>>>>>>>>>>>  completeness
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>  well as get an estimate of fragmentation as one of the
>>>>>>>>>>>>>> statistics
>>>>>>>>>>>>>> produced
>>>>>>>>>>>>>>  is a percent of genes that are found complete (end to end)
>>>>>>>>>>>>>> vs
>>>>>>>> those
>>>>>>>>>>>>>>  that
>>>>>>>>>>>>>> are partial.  CEGMA identifies house keeping genes that tend
>>>>>>>>>>>>>>to
>>>>>>>>>>>>>> be
>>>>>>>> shorter
>>>>>>>>>>>>>>  and less intron rich than other genes in the genome, so if
>>>>>>>>>>>>>>CEGMA
>>>>>>>> gives
>>>>>>>>>>>>>>  a
>>>>>>>>>>>>>> high partial percentage and a low complete percentage, then
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>  pattern
>>>>>>>>>>>>>> can be expected to be even more exaggerated for other genes
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>> genome.
>>>>>>>>>>>>>>  If your genome is highly fragmented or proteins do not
>>>>>>>>>>>>>>align
>>>>>>>>>>>>>> well
>>>>>>>> then
>>>>>>>>>>>>>> there are other strategies.  For example, some vertebrate
>>>>>>>>>>>>>>genomes
>>>>>>>> end
>>>>>>>>>>>>>>  up
>>>>>>>>>>>>>> having extremely fragmented assemblies (on the order of
>>>>>>>>>>>>>>100,000
>>>>>>>> contigs),
>>>>>>>>>>>>>>  and if they are distantly related to other annotated
>>>>>>>>>>>>>>species
>>>>>>>>>>>>>>few
>>>>>>>>>>>>>> proteins
>>>>>>>>>>>>>>  may align to the contigs because the introns in the
>>>>>>>>>>>>>>alignments
>>>>>>>> tend
>>>>>>>>>>>>>>  to
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>  so long and exons so short that it pushes down the
>>>>>>>>>>>>>> significance
>>>>>>>> scores
>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>  much.  In those cases heavy mRNAseq seems to be the best if
>>>>>>>>>>>>>> not
>>>>>>>> only
>>>>>>>>>>>>>>  way
>>>>>>>>>>>>>> to get enough evidence to stitch gene models together.
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>  Carson
>>>>>>>>>>>>>>  On 12-11-28 4:40 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Dear Carson and Daniel,
>>>>>>>>>>>>>> Thanks. I ran sample file for filtering genes based on AED
>>>>>>>>>>>>>>score.
>>>>>>>> The
>>>>>>>>>>>>>> input gff3 file was provided to option model_pred(see
>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>> file
>>>>>>>> Scaffold1.gff), the cutoff AED score was set to 0.75. There are at
>>>>>>>>>>>>>>  least
>>>>>>>>>>>>>>  5
>>>>>>>>>>>>>> genes with AED score less than 0.75. However there were no
>>>>>>>>>>>>>> genes
>>>>>>>>>>>>>>  predicted
>>>>>>>>>>>>>> in the output file(see attached file Scaffold1_out). I have
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>> attached
>>>>>>>>>>>>>> the maker_opts.ctl. Could you please advice on this.
>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>>>  Use the AED_threshold option in the maker_opts.ctl file if
>>>>>>>>>>>>>>you
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>  to restrict final gene models to close matches directly
>>>>>>>>>>>>>>within
>>>>>>>>>>>>>> maker.
>>>>>>>>>>>>>> On
>>>>>>>>>>>>>>  the other hand, if you are trying to build a dataset for
>>>>>>>> training
>>>>>>>>>>>>>>  gene
>>>>>>>>>>>>>> predictors, use the maker2zff script for generating a
>>>>>>>>>>>>>> filtered
>>>>>>>>>>>>>>  dataset
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>  SNAP training.  There are a number of filters available.
>>>>>>>>>>>>>> Just
>>>>>>>> call
>>>>>>>>>>>>>>  the
>>>>>>>>>>>>>> script once without parameters to see the options.
>>>>>>>>>>>>>>  Thanks,
>>>>>>>>>>>>>>  Carson
>>>>>>>>>>>>>>  On 12-11-27 5:55 PM, "Daniel Ence"
>>>>>>>>>>>>>><dence at genetics.utah.edu>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi Parul,
>>>>>>>>>>>>>> I think the way you described (with the maker_opts.ctl file)
>>>>>>>>>>>>>>is
>>>>>>>> how
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> want to proceed. You still need to give the genome too.
>>>>>>>>>>>>>> Daniel
>>>>>>>>>>>>>> Daniel Ence
>>>>>>>>>>>>>> Graduate Student
>>>>>>>>>>>>>> Eccles Institute of Human Genetics
>>>>>>>>>>>>>> University of Utah
>>>>>>>>>>>>>> 15 North 2030 East, Room 2100
>>>>>>>>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>> From: maker-devel-bounces at yandell-lab.org
>>>>>>>>>>>>>> [maker-devel-bounces at yandell-lab.org] on behalf of Parul
>>>>>>>>>>>>>>  Kudtarkar
>>>>>>>>>>>>>> [parulk at caltech.edu]
>>>>>>>>>>>>>> Sent: Tuesday, November 27, 2012 3:41 PM
>>>>>>>>>>>>>> To: Parul Kudtarkar
>>>>>>>>>>>>>> Cc: maker-devel at yandell-lab.org
>>>>>>>>>>>>>> Subject: Re: [maker-devel] AED score
>>>>>>>>>>>>>> Also, are there any other parameters that are required when
>>>>>>>> filtering
>>>>>>>>>>>>>> based on AED score?
>>>>>>>>>>>>>>  Hello Carson,
>>>>>>>>>>>>>>  Just to confirm, Is there a script that would filter gene
>>>>>>>> models
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> specific AED score.
>>>>>>>>>>>>>>  Alternatively if I were to do this within maker with
>>>>>>>>>>>>>>regards
>>>>>>>> to
>>>>>>>>>>>>>> parameters
>>>>>>>>>>>>>>  in maker_opts.ctl file I would have to provide my predicted
>>>>>>>>>>>>>> genes
>>>>>>>>>>>>>> gff3
>>>>>>>>>>>>>>  file to model_gff and  set AED_threshold at desired
>>>>>>>>>>>>>>threshold?
>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>>  Parul Kudtarkar
>>>>>>>>>>>>>>  AED score with 1 are the ones you don't want.  0 is best
>>>>>>>>>>>>>>and
>>>>>>>> 1
>>>>>>>>>>>>>>  is
>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>  as
>>>>>>>>>>>>>>  it is a distance metric.  You can use the AED_threshold
>>>>>>>> parameter
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>  require better matching to the evidence by setting it
>>>>>>>>>>>>>>closer
>>>>>>>> to
>>>>>>>>>>>>>> 0.
>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>  can
>>>>>>>>>>>>>>  also try to increase protein homology evidence as some of
>>>>>>>> your
>>>>>>>>>>>>>> calls
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>  be split genes due to lack of evidence linking them.
>>>>>>>>>>>>>>  --Carson
>>>>>>>>>>>>>>  On 12-11-26 4:35 PM, "Parul Kudtarkar" <parulk at caltech.edu>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Dear Maker community,
>>>>>>>>>>>>>> For gene-prediction I get training data-set from evidence
>>>>>>>>>>>>>>  based
>>>>>>>>>>>>>> prediction, I use this data-set to train SNAP as well as
>>>>>>>>>>>>>>Augustus
>>>>>>>> predictions, followed by boot-strapping. I would typically expect
>>>>>>>> 20-30K
>>>>>>>>>>>>>> genes however I am getting 8 times the expected gene count
>>>>>>>>>>>>>>  indicating
>>>>>>>>>>>>>>  too
>>>>>>>>>>>>>> many false positives. Is there a way to further refine these
>>>>>>>>>>>>>> predication/script to retain predictions with AED score 1
>>>>>>>>>>>>>>and
>>>>>>>>>>>>>> if
>>>>>>>>>>>>>> yes
>>>>>>>>>>>>>> how
>>>>>>>>>>>>>> to go about this?
>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>> Parul Kudtarkar
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yande
>>>>>>>>>>>>>>ll
>>>>>>>> -l
>>>>>>>>>>>>>> ab
>>>>>>>>>>>>>> .o
>>>>>>>>>>>>>> rg
>>>>>>>>>>>>>>  --
>>>>>>>>>>>>>>  Scientific Programmer
>>>>>>>>>>>>>>  Center for Computational Regulatory Genomics
>>>>>>>>>>>>>>  Beckman Institute,
>>>>>>>>>>>>>>  California Institute of Technology
>>>>>>>>>>>>>>  http://www.spbase.org
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yande
>>>>>>>>>>>>>>ll
>>>>>>>>>>>>>>-l
>>>>>>>>>>>>>>a
>>>>>>>> b.
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> g
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yande
>>>>>>>>>>>>>>ll
>>>>>>>>>>>>>>-l
>>>>>>>>>>>>>>a
>>>>>>>> b.
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> g
>>>>>>>>>>>>>>  _______________________________________________
>>>>>>>>>>>>>>  maker-devel mailing list
>>>>>>>>>>>>>>  maker-devel at box290.bluehost.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yande
>>>>>>>>>>>>>>ll
>>>>>>>>>>>>>>-l
>>>>>>>>>>>>>>ab
>>>>>>>> .o
>>>>>>>>>>>>>> rg
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>>>> --
>>>>>>>>>>>> Scientific Programmer
>>>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>>>> Beckman Institute,
>>>>>>>>>>>> California Institute of Technology
>>>>>>>>>>>> http://www.spbase.org
>>>>>>>>>> --
>>>>>>>>>> Scientific Programmer
>>>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>>>> Beckman Institute,
>>>>>>>>>> California Institute of Technology
>>>>>>>>>> http://www.spbase.org
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Scientific Programmer
>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>> Beckman Institute,
>>>>>>>> California Institute of Technology
>>>>>>>> http://www.spbase.org
>>>>>>>>
>>>>>>>> --
>>>>>>>> Scientific Programmer
>>>>>>>> Center for Computational Regulatory Genomics
>>>>>>>> Beckman Institute,
>>>>>>>> California Institute of Technology
>>>>>>>> http://www.spbase.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Scientific Programmer
>>>>>> Center for Computational Regulatory Genomics
>>>>>> Beckman Institute,
>>>>>> California Institute of Technology
>>>>>> http://www.spbase.org
>>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>Scientific Programmer
>>>>>Center for Computational Regulatory Genomics
>>>>>Beckman Institute,
>>>>>California Institute of Technology
>>>>>http://www.spbase.org
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>--
>>>Scientific Programmer
>>>Center for Computational Regulatory Genomics
>>>Beckman Institute,
>>>California Institute of Technology
>>>http://www.spbase.org
>>>
>>
>>
>>
>
>
>--
>Scientific Programmer
>Center for Computational Regulatory Genomics
>Beckman Institute,
>California Institute of Technology
>http://www.spbase.org
>