[maker-devel] how to input a masked assembly for annotation into Maker

Fri Jun 1 13:36:10 MDT 2018

Hi Maker community

I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.

Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.

For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 

I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 

Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.

Annotation A default build steps:

$ maker -base Rwill10 -fix_nucleotides
$ maker -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11983   11983  312159
#should be 11985

$ maker -dsindex -base Rwill10 -fix_nucleotides

$ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10_master_datastore_index.log

$ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
21960

$ fasta_merge -d Rwill10_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.default.log
Type: application/octet-stream
Size: 4650 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0008.obj>
-------------- next part --------------

Annotation A standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta

#genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
#IDs in .tsv file are called "processed-gene" from .fasta file, 
#but in .gff file, I think these are called "abinit-gene"
#best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
$ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

#extract list of IDs only to grep for
cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs

#used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff

$ maker -base Rwill10standard2 -fix_nucleotides
$ maker -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11975   11975  311953
#should be 11985

$ maker -dsindex -base Rwill10standard2 -fix_nucleotides

$ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
  11985   11985  312211

$ gff3_merge -d Rwill10standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10standard2.all.gff
23559

$ fasta_merge -d Rwill10standard2_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationA.standard.log
Type: application/octet-stream
Size: 4529 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0009.obj>
-------------- next part --------------

Annotation B default build steps:

$ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta

#Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
$ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 

#use script to extract ordered scaffolds for each chromosome
$ ./extract_scaffolds_synteny.sh

#use script to create pseudochromosomal sequence for each chromosome
$ ./create_pseudo_chromosome_allLGs.sh

#concatenate these into one fasta file
cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta

$ maker -base Rwill10.pseudochromos -fix_nucleotides
$ maker -base Rwill10.pseudochromos -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
18465

$ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.default.log
Type: application/octet-stream
Size: 4604 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0010.obj>
-------------- next part --------------

Annotation B standard build steps:

$ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta

$ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed

$ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs

$ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  

$ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs

#used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff

$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
$ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides

$ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
     13      13     312

$ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log

$ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
20830

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.log.AnnotationB.standard.log
Type: application/octet-stream
Size: 4558 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180601/de59aa1c/attachment-0011.obj>
-------------- next part --------------

-Valerie

Valerie Soza, Ph.D.
c/o Hall Lab
Department of Biology
University of Washington
Johnson Hall 202A
Box 351800
Seattle, WA 98195-1800
206-543-6740
http://staff.washington.edu/vsoza/