[maker-devel] First time using maker- Train or not to train?

Wed Dec 16 16:41:48 MST 2015

Hi Daniel,

Have you guys heard about BUSCO <http://busco.ezlab.org/>? It's kind of a
replacement for CEGMA, which was based in a rather limited set of genes
(according to their devels we should stop using). BUSCO does not only
produces a more thorough completeness profile but it also generates the
Augustus species training profile (it needs access to your local Augustus
species folder). According to the manual, if you use the --long option it
is similar to a training and retraining step in the old training method.

I recently used it for training Augustus for my fungal genomes and it works
well. Unfortunately, it may not apply for this case as they don't have the
plant profile dataset ready yet. You may request early access to it though

I used to use the CEGMA output plus the webAugustus training service, a bit
more tedious but not that complicated. I copy below what I had in my old
protocol, nonetheless I would recommend any other user not dealing with
plant genomes to use BUSCO instead:

Augustus gff files are a bit different from CEGMA ones. Get the CEGMA
> output and run the following script:
>     cegma2zff output.cegma.gff > augustus.gff
>
> Upload the genome file (e.g. contigs.fa from velvet) and the "training
> gene structure file" (augustus.gff) to
> http://bioinf.uni-greifswald.de/webaugustus/training/create
>
> Once finished, the "Species parameter archive" (parameters.tar.gz) will
> contain a folder with the model files for your species. Copy it to the
> species folder of Augustus (augustus/config/species).
>
> Re-training
>
> From Maker's output, follow the the same initial instructions as for SNAP
> training detailed in the Maker tutorial:
> In the directory that contains MYGENOME.maker.output/ folder:
>     mkdir snap
>     cd snap
>     gff3_merge -d
> ../MYGENOME.maker.output/MYGENOME_master_datastore_index.log
>     maker2zff -n MYGENOME.all.gff
> The option -n is not included in the original tutorial but you may end
> with empty genome.ann and genome.dna files.
> From this point we generate training files for both SNAP and Augustus:
>
>     fathom genome.ann genome.dna -categorize 1000
>     fathom uni.ann uni.dna -export 1000 -plus
>     forge export.ann export.dna
>
> For Augustus, we need the script "zff2augustus_gbk.pl". This will take
> the export.dna generated by fathom and generate a *.gb file that will be
> used as "training gene structure file" in a new training submission in
> WebAugustus, but remember to give it a new name in the submission, e.g.
> MYGENOME_v2, or Maker won't see the difference (same name):
>     perl PATH/TO/SCRIPT/zff2augustus_gbk.pl > MYGENOME_v2.train.gb
>

Xabier

On 17 December 2015 at 05:07, Daniel Ence <dence at genetics.utah.edu> wrote:

> Hi Elyssa,
>
> Setting est2genome=1 tells MAKER to promote all of the est2genome
> alignments to a gene model, which is not what you want for a final gene
> set. That being said, since your gene models are basically the unmodified
> alignments, I’m surprised that all of them have an AED of 1, since that
> means that they’re not supported by any of the evidence (either est or
> protein).
>
> Did you get gene models from snap or augustus? You can gather those with
> the fasta_merge script. Those should be a good starting place for training
> ab initio predictors. Instructions for training snap can be found here:
> http://gmod.org/wiki/MAKER_Tutorial#Training_ab_initio_Gene_Predictors
>
> Augustus can also be trained but is much more involved.
>
> ~Daniel
>
>
> Daniel Ence
> Graduate Student
> Eccles Institute of Human Genetics
> University of Utah
> 15 North 2030 East, Room 2100
> Salt Lake City, UT 84112-5330
>
> On Dec 11, 2015, at 10:43 AM, Elyssa Garza <elyssa_garza at yahoo.com> wrote:
>
> Hello,
>
> I have recently begun running Maker.  I am currently trying to annotate my
> Caulanthus Genome (~372Mb); a relative to Arabidopsis.  I am unsure about
> the parameters I have chosen for my first run in maker, which include:
>
> genome=CAB_assembly.fasta (1044 contigs)
> est=Representative_transcript_loci.fasta (assembled transcripts btw
> 200-20000bp long)
> protein=TAIR10pep.fasta (Arabidopsis proteins)
> —
> *Repeat masking*
> model_org=arabidopsis
> rmlib=list of Brassicaceae and common plant repeats
> repeat_protein=te_proteins.fasta
> *Gene Prediction*
> snaphmm=A.thaliana.hmm
> augustus_species=arabidopsis
> est2genome=1
>
> I have run a sample file of scaffolds, as well as the entire genome.
> In the sample file of scaffolds, I gff3merged the gffs and then ran
> evaluator.  I noticed that my AED are all 1.  Is this bad?  What should I
> try next?
>
> I am also unsure on how to train files and if this should be done in my
> case.
>
> Can anyone advise me on these issues?
>
> -Elyssa
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>

-- 
Xabier Vázquez-Campos, *PhD*
*Research Associate*
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20151217/dc1caa69/attachment-0003.html>