[maker-devel] Re-annotation, fewer gene predictions
Xabier Vázquez-Campos
xvazquezc at gmail.com
Wed Feb 6 15:33:47 MST 2019
SNAP is easy to train, works well in fungal genomes and it's explained in
Maker's wiki:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors
Oh, sorry, I didn't explain myself well. What I was trying to say is that
before BUSCO, when we only had CEGMA, we would proceed in a different way
to train Augustus as CEGMA wouldn't produce Augustus gene models
automatically. I don't mean you to use CEGMA.
This is what I have on my own documentation about how to train Augustus
"the old way"
> AUGUSTUS… the old way
>
> Alternatively, you can train AUGUSTUS in a more “manual” way, like when we
> were using CEGMA. The training starts with the output from the second
> instance of fathom in the SNAP training section.
>
> cd ${MYGENOME_DIR}/maker/snap1
> perl ~/bin/zff2augustus_gbk.pl > ${MYGENOME}.train1.gb
>
> zff2augustus_gbk.pl generates a GenBank file from export.dna.
>
> The actual training of AUGUSTUS will be through the *webAUGUSTUS server*.
>
> Before proceed, it is recommended to rename the fasta headers, specially
> if they contain special characters and/or very long headers. This is the
> main reason of failure for the jobs submitted to webAUGUSTUS. You can use
> the simplifyFastaHeaders.pl
> <http://bioinf.uni-greifswald.de/bioinf/downloads/simplifyFastaHeaders.pl>
> script for that:
>
> perl ~/bin/simplifyFastaHeaders.pl ${MYGENOME}_assembly.fasta nameStem ${MYGENOME}_contigs_rename.fasta ${MYGENOME}_contigs.map
>
> perl ~/bin/simplifyFastaHeaders.pl ${MYGENOME}_transcripts_assembled.fasta nameStem ${MYGENOME}_rna_rename.fasta ${MYGENOME}_rna.map
>
> nameStem is the base name for naming each of the sequences in the
> multifasta files. Use a value with something appropriate. Use *contig*
> and *rna* for the assembly and RNA-seq files, respectively; or something
> based on that. For example, ‘pgcontig’ and ‘pgrna’ for contigs and RNA from *Puccinia
> graminis*
> *DO NOT* give the same nameStem to both fasta files, and don’t use any
> special character.
>
> We need the following files (minimum):
>
> - ${MYGENOME}_assembly.fasta as *Genome file*
> - ${MYGENOME}.train1.gb as *Training gene structure file*
>
> If we also have RNA-seq data:
>
> - ${MYGENOME}_assembled_transcripts.fasta as *cDNA file*
>
> Use ${MYGENOME}_v1 as *Species name*. We will need to have a different
> species name in the retraining step. Otherwise when Maker2 is rerun, Maker2
> will see the same name and will not rerun AUGUSTUS, even though the species
> profile is different. So, ${MYGENOME}_v1 just do the job and tracks
> version.
>
> Once the job is finished, the *Species parameter archive* (
> parameters.tar.gz) will contain a folder with the model files for your
> species. Copy it to the species folder of your AUGUSTUS installation.
>
Hope this helps
PS: hit reply all so this is logged in Maker's mail list in case anybody
else experiences similar issues
On Thu, 7 Feb 2019 at 06:36, morgan sobol <morgan_starr_s at live.com> wrote:
> I have not used SNAP or CEGMA, however, I see that CEGMA was discontinued
> in 2015.
> Do you think that will be a problem, or is it still worth using the old
> version?
>
>
> ------------------------------
> *From:* Xabier Vázquez-Campos <xvazquezc at gmail.com>
> *Sent:* Tuesday, February 5, 2019 4:42 PM
> *To:* morgan sobol; Maker Mailing List
> *Subject:* Re: [maker-devel] Re-annotation, fewer gene predictions
>
> Don't you use SNAP? It usually produces quite decent results. And easier
> to train than any of the other predictors
>
> In any case, the Augustus gene model is way off in both cases
> GM doesn't seem bad if your fungus has a rather usual genome... in the
> first. For the second, it looks bad
>
> I'm not too familiar with the reannotation but I'd rather create the gene
> models from scratch rather than reuse the ones from the Illumina-only
> genomes.
> Note that assemblies with long-reads, have a higher proportion of
> repetitive elements that need masking and RepeatMasker only may not be
> enough. In theory, this shouldn't affect Augustus model if trained through
> BUSCO as it uses defined conserved markers to create the gene model, but
> I'm not so sure about GM.
>
> If you trained Augustus with BUSCO, and this is the result, I'd discard
> the gene model and train it again by the "traditional way", i.e. as it used
> to be when we only had CEGMA. I had good results just by changing the
> training method.
>
> Hope it helps,
> Xabi
>
>
>
>
> On Wed, 6 Feb 2019 at 02:19, morgan sobol <morgan_starr_s at live.com> wrote:
>
> Thank you, Xabi for the response.
> The number of proteins from each source is greatly lower than before.
> Previous numbers were 325, 10,899, and 11,243 for augustus, genemark, and
> maker respectively.
> The more recent numbers are 25, 857, 4418 respectively.
>
> So do you think maybe this hints that something is wrong from genemark?
>
> Morgan
>
>
> ------------------------------
> *From:* Xabier Vázquez-Campos <xvazquezc at gmail.com>
> *Sent:* Sunday, February 3, 2019 4:43 PM
> *To:* morgan sobol
> *Cc:* maker-devel at yandell-lab.org
> *Subject:* Re: [maker-devel] Re-annotation, fewer gene predictions
>
> Hi Morgan,
>
> We had a similar issue with AUGUSTUS underpredicting when using a
> BUSCO-derived gene model
> https://groups.google.com/d/msg/maker-devel/ocnDG4nq1A8/NyCPzzRgAgAJ
>
> Also, check the number of proteins by each individual predictor. If the
> numbers from one of them are off, you may find a possible source of issues.
> We didn't have a very good experience with GM, as it used to overpredict
> an absurd number of proteins.
>
> Xabi
>
> On Mon, 4 Feb 2019 at 06:15, morgan sobol <morgan_starr_s at live.com> wrote:
>
> Hello,
>
> I previously used Maker to annotate two different fungal genomes that were
> created using Illumina sequences only. For these genomes, I had over 11,000
> genes predicted.
> I recently obtained PacBio sequences for the same genomes, so I created
> two hybrid assemblies. Both assemblies were very familiar in length and
> completed number of orthologs to the Illumina only assembly, but had much
> fewer, but longer contigs.
>
> I re-ran Maker using the settings below. For one of my genomes, I got
> around 11,000 genes predicted again, as expected. However, for the other
> genome, I am continuously getting ~4,400 predicted genes.
>
> I am asking for help as to how I can determine why I keep getting fewer
> predicted genes for only one of my genomes, even though I ran them the same?
>
> Thanks,
> Morgan S.
>
> maker_opts.log
> #-----Genome (these are always required)
> genome=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/repeatmasker/unicycler/1368D_unicycler_contigs.fasta.masked
> #genome sequence (fasta file or$
> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
>
> #-----Re-annotation Using MAKER Derived GFF3
> maker_gff=/work/Geomicrobiology/msobol/IODP_329_SPG/1368D2H1/maker/1368D_2H1_contigs.fasta.maker.output/1368D_2H1_contigs.fasta.all.gff
> #MAKER derive$
> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
> altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
> protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>
> #-----EST Evidence (for best results provide a file for at least one)
> est= #set of ESTs or assembled mRNA-seq in fasta format
> altest= #EST/cDNA sequence file in fasta format from an alternate organism
> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>
> #-----Protein Homology Evidence (for best results provide a file for at
> least one)
> protein=/work/Geomicrobiology/msobol/IODP_329_SPG/uniprot_sprot.fasta
> #protein sequence file in fasta format (i.e. from mutiple oransisms)
> protein_gff= #aligned protein homology evidence from an external GFF3 file
>
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org= #select a model organism for RepBase masking in RepeatMasker
> rmlib= #provide an organism specific repeat library in fasta format for
> RepeatMasker
> repeat_protein= #provide a fasta file of transposable element proteins for
> RepeatRunner
> rm_gff= #pre-identified repeat elements from an external GFF3 file
> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change
> this), 1 = yes, 0 = no
> softmask=0 #use soft-masking rather than hard-masking in BLAST (i.e. seg
> and dust filtering)
>
> #-----Gene Prediction
> snaphmm= #SNAP HMM file
> gmhmm=/home/msobol/genemark/68D_2/output/gmhmm.mod #GeneMark HMM file
> augustus_species=1368D_uni #Augustus gene prediction species model
> fgenesh_par_file= #FGENESH parameter file
> pred_gff= #ab-initio predictions from an external GFF3 file
> model_gff= #annotated gene models from an external GFF3 file (annotation
> pass-through)
> est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
> protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
> trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
> snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
> unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 =
> yes, 0 = no
>
> #-----Other Annotation Feature Types (features MAKER doesn't recognize)
> other_gff= #extra features to pass-through to final MAKER generated GFF3
> file
>
> #-----External Application Behavior Options
> alt_peptide=C #amino acid used to replace non-standard amino acids in
> BLAST databases
> cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI,
> leave 1 when using MPI)
>
> #-----MAKER Behavior Options
> max_dna_len=100000 #length for dividing up contigs into chunks
> (increases/decreases memory usage)
> min_contig=1 #skip genome contigs below this length (under 10kb are often
> useless)
>
> pred_flank=200 #flank for extending evidence clusters sent to gene
> predictors
> pred_stats=1 #report AED and QI statistics for all predictions as well as
> models
> AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and
> 1)
> min_protein=0 #require at least this many amino acids in predicted proteins
> alt_splice=0 #Take extra steps to try and find alternative splicing, 1 =
> yes, 0 = no
> always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0
> = no
> map_forward=0 #map names and attributes forward from old GFF3 genes, 1 =
> yes, 0 = no
> keep_preds=1 #Concordance threshold to add unsupported gene prediction
> (bound by 0 and 1)
>
> split_hit=10000 #length for the splitting of hits (expected max intron
> size for evidence alignments)
> single_exon=1 #consider single exon EST evidence when generating
> annotations, 1 = yes, 0 = no
> single_length=250 #min length required for single exon ESTs if
> 'single_exon is enabled'
> correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion
> genes
>
> tries=2 #number of times to try a contig if there is a failure for some
> reason
> clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0
> = no
> clean_up=0 #removes theVoid directory with individual analysis files, 1 =
> yes, 0 = no
> TMP= #specify a directory other than the system default temporary
> directory for temporary files
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> --
> Xabier Vázquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>
> --
> Xabier Vázquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
--
Xabier Vázquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20190207/e334d07a/attachment-0003.html>
More information about the maker-devel
mailing list