[maker-devel] how to input a masked assembly for annotation into Maker
Valerie Soza
vsoza at uw.edu
Fri Jun 1 13:36:10 MDT 2018
Hi Maker community
I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.
Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.
For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains).
I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong?
Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.
Annotation A default build steps:
$ maker -base Rwill10 -fix_nucleotides
$ maker -base Rwill10 -fix_nucleotides
$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
11983 11983 312159
#should be 11985
$ maker -dsindex -base Rwill10 -fix_nucleotides
$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
11985 11985 312211
$ gff3_merge -d Rwill10_master_datastore_index.log
$ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
21960
$ fasta_merge -d Rwill10_master_datastore_index.log
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.default.log
Type: application/octet-stream
Size: 4650 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0008.obj>
-------------- next part --------------
Annotation A standard build steps:
$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta
#genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
#IDs in .tsv file are called "processed-gene" from .fasta file,
#but in .gff file, I think these are called "abinit-gene"
#best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
$ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
#extract list of IDs only to grep for
cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
$ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
$ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
#used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff
$ maker -base Rwill10standard2 -fix_nucleotides
$ maker -base Rwill10standard2 -fix_nucleotides
$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
11975 11975 311953
#should be 11985
$ maker -dsindex -base Rwill10standard2 -fix_nucleotides
$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
11985 11985 312211
$ gff3_merge -d Rwill10standard2_master_datastore_index.log
$ grep -cP '\tgene\t' Rwill10standard2.all.gff
23559
$ fasta_merge -d Rwill10standard2_master_datastore_index.log
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.standard.log
Type: application/octet-stream
Size: 4529 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0009.obj>
-------------- next part --------------
Annotation B default build steps:
$ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta
#Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
$ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks
#use script to extract ordered scaffolds for each chromosome
$ ./extract_scaffolds_synteny.sh
#use script to create pseudochromosomal sequence for each chromosome
$ ./create_pseudo_chromosome_allLGs.sh
#concatenate these into one fasta file
cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta
$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
13 13 312
$ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log
$ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
18465
$ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.default.log
Type: application/octet-stream
Size: 4604 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0010.obj>
-------------- next part --------------
Annotation B standard build steps:
$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta
$ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
$ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
$ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
$ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
#used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
13 13 312
$ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log
$ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
20830
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.standard.log
Type: application/octet-stream
Size: 4558 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0011.obj>
-------------- next part --------------
-Valerie
Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/
More information about the maker-devel
mailing list