[maker-devel] maker problem

Mon Oct 8 11:31:04 MDT 2018

Alright, I had gone through all those tutorials. But my question is - why maker generating only gff as an output ? there is neither transcripts.fasta nor proteins.fasta in output directory. So I can only use gff3_merge but not fasta_merge because there is no fasta files. This happened to all scaffolds.

Below is the example of my datastore_index.log file for that scaffold :

ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ STARTED
ScJhAqd_1;HRSCAF=2 Sh_masked_rd2_datastore/18/62/ScJhAqd_1%3BHRSCAF=2/ FINISHED

Output directory of that scaffold looks like:

[Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ ll
total 160
drwxr-xr-x 3 guptapa pi     3 Oct  5 15:51 ../
-rw-r--r-- 1 guptapa pi 27740 Oct  5 15:51 run.log
-rw-r--r-- 1 guptapa pi 34268 Oct  5 15:51 ScJhAqd_1%3BHRSCAF=2.gff
drwxr-xr-x 2 guptapa pi    75 Oct  5 15:51 theVoid.ScJhAqd_1%3BHRSCAF=2/
drwxr-xr-x 3 guptapa pi     5 Oct  5 15:51 ./

gff looks like:

Linux at waterman ScJhAqd_1%3BHRSCAF=2]$ head ScJhAqd_1%3BHRSCAF=2.gff
##gff-version 3
ScJhAqd_1%3BHRSCAF%3D2 . contig 1 2578 . . . ID=ScJhAqd_1%3BHRSCAF%3D2;Name=ScJhAqd_1%3BHRSCAF%3D2;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1782 2024 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:0;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:0;Name=Mlong585_29391-RA;Target=Mlong585_29391-RA 132 212;Gap=M81;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1785 2578 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2477 2578 112 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:1;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 28 61;Gap=M34;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 1785 2042 175 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:2;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:1;Name=Mlong585_37101-RA;Target=Mlong585_37101-RA 154 239;Gap=M86;
ScJhAqd_1%3BHRSCAF%3D2 blastx protein_match 1806 2578 128 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2471 2578 132 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:3;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 117 152;Gap=M36;
ScJhAqd_1%3BHRSCAF%3D2 blastx match_part 2299 2379 89 - . ID=ScJhAqd_1%3BHRSCAF%3D2:hsp:4;Parent=ScJhAqd_1%3BHRSCAF%3D2:hit:2;Name=Mlong585_11451-RA;Target=Mlong585_11451-RA 153 179;Gap=M27;

Regards,
Parul

On Oct 8, 2018, at 11:34 AM, Carson Hinton Holt <carson.holt at genetics.utah.edu<mailto:carson.holt at genetics.utah.edu>> wrote:

GFF3 should have the assembly fasta at the bottom. That is part of the format. Please familiarize yourself with GFF3 here —> https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Particularly look at the different kinds of expected features (example gene/mRNA/exon/CDS gene models vs match/match_part evidence alignments).

Also you need to familiarize yourself with the MAKER documentation, and perhaps follow one of the step by step tutorials in the MAKER wiki (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page). The 2014 tutorial has a video you can follow along with. Output files are described in the documentation and the wiki. Particularly look at the necessary gff3_merge and fasta_merge scripts described in the wiki with multiple examples. Individual contigs will have results like so —>
contig-dpp-500-500.gff
contig-dpp-500-500.maker.proteins.fasta
contig-dpp-500-500.maker.transcripts.fasta

The merge scripts will collect all the individual contig results of into merged files. Example datasets for all of the wiki tutorials are included in the …/maker/data directory as well as the .../maker/MWAS/data/ directory (you can use them to follow along with the wiki pages).

If you follow the tutorial steps from training snap on a new genome and you get empty training files, then the issue is the evidence training sets you gave (example from the e-mail list archive) —>
https://groups.google.com/forum/#!searchin/maker-devel/maker2zff%7Csort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ<https://groups.google.com/forum/#!searchin/maker-devel/maker2zff|sort:date/maker-devel/TculOM5oxl4/UWENIGN7EQAJ>

You can also browse through the archive for more info on training SNAP and Augustus.

—Carson

On Oct 8, 2018, at 10:12 AM, Gupta, Parul <Parul.Gupta at oregonstate.edu<mailto:Parul.Gupta at oregonstate.edu>> wrote:

Hi Carson,
As per your suggestion, I turned on the est2genome=1 and protein2genome=1 but similar result are generated. gff of each scaffold has fasta (transcripts) sequence at the end instead of generating transcripts.fasta and protein.fasta separately. I don’t know how to use such gffs for further processing as training SNAP (for gene prediction). Need you suggestion.
Is there option to provided trained data from Augustus (generated from Augustus standalone rather from maker) instead of Augustus species in maker_opts.ctl ?

Thanks,
Parul

On Oct 4, 2018, at 6:43 PM, Gupta, Parul <Parul.Gupta at oregonstate.edu<mailto:Parul.Gupta at oregonstate.edu>> wrote:

Thank you Carson.

Sent from my iPad

On Oct 4, 2018, at 3:11 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:

You must turn on at least 1 prediction method. It can est2genome-1, protein2genome=1, or a species file to run SNAP/Augustus. The first two option are for building models to train with.

If you don’t provide a prediction method, MAKER will align evidence, but you won’t get any gene models.

Example:
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Training_ab_initio_Gene_Predictors

—Carson

On Oct 1, 2018, at 1:05 PM, Gupta, Parul <Parul.Gupta at oregonstate.edu<mailto:Parul.Gupta at oregonstate.edu>> wrote:

Hi Carson,
I am a new user of maker pipeline and wanted to get gene prediction for a new plant genome. I used following options for maker_opts.ctl file for the first round :
genome=masked_genome.fasta
est=transcripts.fasta (from same species for which genome fasta is provided)
atleast=transcripts.fasta (from alternative organism)
protein=proteins.fasta

Output files are only gff (no fasta), however gff for each scaffold has fasta sequences in bottom. I wonder, is that the correct output I am getting?
In order to train snap, I used gff3_merge to concatenate all gffs from datastore_index.log to get all.gff (which also has fasta sequences). Then, all.gff was used for maker2zff and it generated zero size files (genome.ann and genome.dna). I am wondering whether I did any mistake or not provides all input files. For repeat masking I used Repeatmasker separate from maker pipeline. My datastore_index.log file shows many “RETRY” and “FAILED” scaffolds.
FYI, I subscribed to "maker-devel" google group but "new topic” button is greyed out.

Yours suggestion??

Thanks in advance.

Parul

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181008/99f4a8ea/attachment-0003.html>