[maker-devel] *maker.proteins and *non_overlapping_ab_initio.proteins files

Daniel Ence dence at genetics.utah.edu
Tue Apr 16 09:52:07 MDT 2013


Hi Huiquan,

1)The default behavior for Maker is that it will only annotate gene models when there is support from both the evidence (est and protein alignments) and from the ab-initio predictors.

How many transcripts did you get from PASA? I expect there are about 254 sequences, which is about how many genes you annotated. If you want to get more gene models, then you need to supply more evidence. For our annotation projects, we often use some derivation of Swiss-prot, which is a hand-curated database of proteins across all kingdoms.

2) The non-overlapping ab-initio file includes ab-initio predictions that didn't overlap any gene models. If augustus and genemark predictions overlap, I think it should include both, but if the one prediction completely covers the other, I think the longer of the two would be included.

Does that answer your questions?

Thanks,
Daniel


Daniel Ence
Graduate Student
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
________________________________
From: maker-devel-bounces at yandell-lab.org [maker-devel-bounces at yandell-lab.org] on behalf of 刘慧泉 [liuhuiquan at nwsuaf.edu.cn]
Sent: Tuesday, April 16, 2013 2:16 AM
To: maker-devel at yandell-lab.org
Subject: [maker-devel] *maker.proteins and *non_overlapping_ab_initio.proteins files

Hello maker users and developers,

I’m trying to annotate a small fungal genome by using Maker-2.27-beta. For test purpose, I just used the augustus and genemark for de novo gene prediction and supplied the PASA assembled transcripts to the est option. When maker2 finished, I used the gff3_merge and fasta_merge scripts to extract the results. There were 5608, 6255, 5084, and 254 sequences in the resulting protein files: augustus_masked, genemark, non-overlapping ab initio, and maker, respectively. My questions are:

 1. by view the gff file produced by maker2, I have found most of the predicted gene loci have est matches. but why only 254 gene annotations got by maker2 ?

2. in the “non-overlapping ab initio”file, I found sequences are all from augustus_masked prediction. Does the non-overlapping file only include the best gene modes from predicted by both augustus and genemark?  Does it include genemark- or augustus-specific genes ?

Thanks in advance for any advice. I appreciate your help!

best,
Huiquan

the maker_opts.ctl file:

#-----Genome (these are always required)
genome=my_gnm.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----EST Evidence (for best results provide a file for at least one)
est=my_est.fa #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=fungi #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=RepeatPeps.lib #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm=my_ges.mod #GeneMark HMM file
augustus_species=my2 #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=14 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=20 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=1 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=1500 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=200 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=1 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20130416/3b0cb2a7/attachment-0003.html>


More information about the maker-devel mailing list