[maker-devel] Re-annotation, fewer gene predictions

Xabier Vázquez-Campos xvazquezc at gmail.com
Tue Feb 5 15:42:40 MST 2019


Don't you use SNAP? It usually produces quite decent results. And easier to
train than any of the other predictors

In any case, the Augustus gene model is way off in both cases
GM doesn't seem bad if your fungus has a rather usual genome... in the
first. For the second, it looks bad

I'm not too familiar with the reannotation but I'd rather create the gene
models from scratch rather than reuse the ones from the Illumina-only
genomes.
Note that assemblies with long-reads, have a higher proportion of
repetitive elements that need masking and RepeatMasker only may not be
enough. In theory, this shouldn't affect Augustus model if trained through
BUSCO as it uses defined conserved markers to create the gene model, but
I'm not so sure about GM.

If you trained Augustus with BUSCO, and this is the result, I'd discard the
gene model and train it again by the "traditional way", i.e. as it used to
be when we only had CEGMA. I had good results just by changing the training
method.

Hope it helps,
Xabi




On Wed, 6 Feb 2019 at 02:19, morgan sobol <morgan_starr_s at live.com> wrote:

> Thank you, Xabi for the response.
> The number of proteins from each source is greatly lower than before.
> Previous numbers were 325, 10,899, and 11,243 for augustus, genemark, and
> maker respectively.
> The more recent numbers are 25, 857, 4418 respectively.
>
> So do you think maybe this hints that something is wrong from genemark?
>
> Morgan
>
>
> ------------------------------
> *From:* Xabier Vázquez-Campos <xvazquezc at gmail.com>
> *Sent:* Sunday, February 3, 2019 4:43 PM
> *To:* morgan sobol
> *Cc:* maker-devel at yandell-lab.org
> *Subject:* Re: [maker-devel] Re-annotation, fewer gene predictions
>
> Hi Morgan,
>
> We had a similar issue with AUGUSTUS underpredicting when using a
> BUSCO-derived gene model
> https://groups.google.com/d/msg/maker-devel/ocnDG4nq1A8/NyCPzzRgAgAJ
>
> Also, check the number of proteins by each individual predictor. If the
> numbers from one of them are off, you may find a possible source of issues.
> We didn't have a very good experience with GM, as it used to overpredict
> an absurd number of proteins.
>
> Xabi
>
> On Mon, 4 Feb 2019 at 06:15, morgan sobol <morgan_starr_s at live.com> wrote:
>
> Hello,
>
> I previously used Maker to annotate two different fungal genomes that were
> created using Illumina sequences only. For these genomes, I had over 11,000
> genes predicted.
> I recently obtained PacBio sequences for the same genomes, so I created
> two hybrid assemblies. Both assemblies were very familiar in length and
> completed number of orthologs to the Illumina only assembly, but had much
> fewer, but longer contigs.
>
> I re-ran Maker using the settings below. For one of my genomes, I got
> around 11,000 genes predicted again, as expected. However, for the other
> genome, I am continuously getting ~4,400 predicted genes.
>
> I am asking for help as to how I can determine why I keep getting fewer
> predicted genes for only one of my genomes, even though I ran them the same?
>
> Thanks,
> Morgan S.
>
> maker_opts.log
> #-----Genome (these are always required)
> genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked
> #genome sequence (fasta file or$
> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
>
> #-----Re-annotation Using MAKER Derived GFF3
> maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff
> #MAKER derive$
> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
> altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
> protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>
> #-----EST Evidence (for best results provide a file for at least one)
> est= #set of ESTs or assembled mRNA-seq in fasta format
> altest= #EST/cDNA sequence file in fasta format from an alternate organism
> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>
> #-----Protein Homology Evidence (for best results provide a file for at
> least one)
> protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta
> #protein sequence file in fasta format (i.e. from mutiple oransisms)
> protein_gff=  #aligned protein homology evidence from an external GFF3 file
>
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org= #select a model organism for RepBase masking in RepeatMasker
> rmlib= #provide an organism specific repeat library in fasta format for
> RepeatMasker
> repeat_protein= #provide a fasta file of transposable element proteins for
> RepeatRunner
> rm_gff= #pre-identified repeat elements from an external GFF3 file
> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change
> this), 1 = yes, 0 = no
> softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg
> and dust filtering)
>
> #-----Gene Prediction
> snaphmm= #SNAP HMM file
> gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
> augustus_species=1368D_uni #Augustus gene prediction species model
> fgenesh_par_file= #FGENESH parameter file
> pred_gff= #ab-initio predictions from an external GFF3 file
> model_gff= #annotated gene models from an external GFF3 file (annotation
> pass-through)
> est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
> protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
> trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
> snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
> unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 =
> yes, 0 = no
>
> #-----Other Annotation Feature Types (features MAKER doesn't recognize)
> other_gff= #extra features to pass-through to final MAKER generated GFF3
> file
>
> #-----External Application Behavior Options
> alt_peptide=C #amino acid used to replace non-standard amino acids in
> BLAST databases
> cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI,
> leave 1 when using MPI)
>
> #-----MAKER Behavior Options
> max_dna_len=100000 #length for dividing up contigs into chunks
> (increases/decreases memory usage)
> min_contig=1 #skip genome contigs below this length (under 10kb are often
> useless)
>
> pred_flank=200 #flank for extending evidence clusters sent to gene
> predictors
> pred_stats=1 #report AED and QI statistics for all predictions as well as
> models
> AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and
> 1)
> min_protein=0 #require at least this many amino acids in predicted proteins
> alt_splice=0 #Take extra steps to try and find alternative splicing, 1 =
> yes, 0 = no
> always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0
> = no
> map_forward=0 #map names and attributes forward from old GFF3 genes, 1 =
> yes, 0 = no
> keep_preds=1 #Concordance threshold to add unsupported gene prediction
> (bound by 0 and 1)
>
> split_hit=10000 #length for the splitting of hits (expected max intron
> size for evidence alignments)
> single_exon=1 #consider single exon EST evidence when generating
> annotations, 1 = yes, 0 = no
> single_length=250 #min length required for single exon ESTs if
> 'single_exon is enabled'
> correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion
> genes
>
> tries=2 #number of times to try a contig if there is a failure for some
> reason
> clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0
> = no
> clean_up=0 #removes theVoid directory with individual analysis files, 1 =
> yes, 0 = no
> TMP= #specify a directory other than the system default temporary
> directory for temporary files
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> --
> Xabier Vázquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>


-- 
Xabier Vázquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20190206/ed24fbe6/attachment-0003.html>


More information about the maker-devel mailing list