[maker-devel] maker problem

Gupta, Parul Parul.Gupta at oregonstate.edu
Mon Oct 8 14:40:06 MDT 2018


 I used Augustus to generate training set (separately from maker) based on transcripts (fasta) so how I can use that Augustus generated trained data (hints in gff3 format) in maker for gene prediction? I can see only Augustus species option there in maker_opts.ctl. Which option I need to turn on in opts.ctl to put Augustus generated hints file?  I have augustus.gff as predicted hints.

est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training.

I used est_fasta not the est_gff.

Find a contig with protein2genome results in the GFF3

yes I can see protein2genome results in gff3:

ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 31566 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31566 31775 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532540;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 82 154;Gap=M14 I3 M56;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 31872 32621 1426 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532541;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446673;Name=Mlong585_07911-RA;Target=Mlong585_07911-RA 155 409;Gap=M126 I5 M124;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 33816 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 34916 35829 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532542;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 41 343;Gap=M27 D1 M276 F2;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 33816 34182 1394 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532543;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446674;Name=Mlong585_12901-RA;Target=Mlong585_12901-RA 344 466;Gap=R2 M123;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome protein_match 49636 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;
ScJhAqd_2184%3BHRSCAF%3D3164 protein2genome match_part 51354 51466 1091 - . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1532544;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:446675;Name=Mlong585_07901-RA;Target=Mlong585_07901-RA 1 36;Gap=M20 D1 M16 F2;

and est2genome in gff3 as well:

ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871792;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889982 48890708 16239 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871793;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547163;Name=Sh_Salba_v2_61181;Target=Sh_Salba_v2_61181 2591 3317 +;Gap=M727;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48887305 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48887305 48889881 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871794;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 1 2590 +;Gap=M285 D1 M288 I10 M5 I4 M1998;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome match_part 48889949 48890708 16412 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hsp:1871795;Parent=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547164;Name=Sh_Salba_v2_61182;Target=Sh_Salba_v2_61182 2591 3350 +;Gap=M760;
ScJhAqd_2184%3BHRSCAF%3D3164 est2genome expressed_sequence_match 48895479 48899036 9582 + . ID=ScJhAqd_2184%3BHRSCAF%3D3164:hit:547165;Name=Sh_Salba_v2_108280;

Thanks,
Parul

On Oct 8, 2018, at 3:11 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:


We had run BUSCO and there is no problem in genome assembly. I used RepeatMasker (separately from maker pipeline) for masking the repeats using custom generated library (denovo repeats and repeat library from other species as well). The masked genome was used as input in maker_opts.ctl.

Let MAKER run masking if possible. Also BUSCO can be used to train Augustus which can then become the gene predictor in MAKER.


Transcripts-
We have RNA-Seq data assembled using velvet /oases from the same species as for genome sequenced. I globally aligned transcripts over assembled genome using GMAP with gave ~99% mapping. Gff3 generated from GMAP was also checked on genome browser. Those transcripts were used as est input in maker_opts.ctl.  These assembled transcripts may have redundancy.

est2genome doesn't work with est_gff. You must provide fasta of assembled transcripts. You can revert back to the GFF3 if you want after training.


Proteins-
I used protein (fasta seq) sequences downloaded from uniprot for 5 closely related species and one from in-house sequenced genome (already published). Protein sequences from all 6 organisms are concatenated in one file and used as protein evidence in maker_opts.ctl.

Look at the contigs in a browser. Find a contig with protein2genome results in the GFF3  (i.e. the column is marked protein2genome in the GFF3), and look at it specifically. If you don’t find any, then the issue is either your pre-masking or the evidence proteins you gave. I’d recommend using UniProt/Swiss-Prot which conains a broad set of curated and conserved proteins.


atleast=transcripts.fasta (from in-house sequenced genome (already published))

These will being ignored until you have a trained HMM (this type of alignment can only be used as hints to the trained predictor).

—Carson


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20181008/b6bbec44/attachment-0003.html>


More information about the maker-devel mailing list