[maker-devel] guidance for first and subsequent annotation parameters
Devon O'Rourke
devon.orourke at gmail.com
Fri Mar 20 05:30:56 MDT 2020
With so many posts on the forum it's been challenging to determine what the
best practices are for performing multiple rounds of annotation with Maker.
My first round used est, altest, and protein fasta files with a custom GFF
repeat masked file. The resulting vertebrate genome produced 21,970 gene
models with a mean length of about 9016 bp; the BUSCO score was
C:66.0%[S:64.2%,D:1.8%],F:4.2%,M:29.8%,n:9226 (mammalia_odb10 set). Things
seemed to be on the right track, so I set up the next Maker round using
both SNAP and Augustus-trained information in the round2 maker_opts.ctl
file. At the end of that second round, I noticed a marked *decrease* in
BUSCO score (C:53.3%[S:51.0%,D:2.3%],F:11.6%,M:35.1%,n:9226), yet an
increase in the number of gene models (28,646) and mean length (16266 bp).
This got me to wondering if I was setting up the _opts.ctl file
incorrectly? I'm concerned with a few things (and maybe missing even more I
should be concerned about!?):
- I specified the evidence to come from EST/Protein instead of using the
section available under "#-----Re-annotation Using MAKER Derived GFF3".
Maybe that was a fundamental mistake? What is the expected change in
behavior if I moved my round1 Maker output into that category instead of
using the EST/Protein Homology evidence sections as I did below?
- I wasn't sure what to do with the RepeatMasking GFF files in Round2.
The RepeatMasker GFF I included in Round1 consisted of just complex repeats
(setting model_org=simple and softmask=1 to effectively only hard mask
those complex areas for the initial alignments). But what should be used in
Round2 - the output GFF of Round1, or the input GFF from Round1?
Here's what I did for the Round2 maker_opts.ctl file:
#-----Genome (these are always required)
genome=/scratch/dro49/myluwork/annotation/input_files/mylu_hic_rails_noMasks.fa
organism_type=eukaryotic
#-----EST Evidence (for best results provide a file for at least one)
est_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.est2genome.gff
altest_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.cdna2genome.gff
#-----Protein Homology Evidence (for best results provide a file for at
least one)
protein_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.protein2genome.gff
#-----Repeat Masking (leave values blank to skip repeat masking)
rm_gff=/scratch/dro49/myluwork/annotation/maker_rd2/mylu_rnd1.all.maker.repeats.gff
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change
this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg
and dust filtering)
#-----Gene Prediction
snaphmm=/scratch/dro49/myluwork/annotation/maker_rd2/snap_rd1/lu_rnd1.zff.length50_aed0.25.hmm
#SNAP HMM file
augustus_species=mylu #Augustus gene prediction species model
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 =
yes, 0 = no
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for
default)
Thank you for your insights and support,
Devon
--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200320/d4c829e4/attachment-0003.html>
More information about the maker-devel
mailing list