[maker-devel] how to input a masked assembly for annotation into Maker

Mon Jun 11 10:46:13 MDT 2018

Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs).

—Carson

> On Jun 1, 2018, at 1:36 PM, Valerie Soza <vsoza at uw.edu> wrote:
> 
> Hi Maker community
> 
> I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly.
> 
> Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above.
> 
> For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). 
> 
> I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file.
> I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? 
> 
> Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch.
> 
> 
> Annotation A default build steps:
> 
> $ maker -base Rwill10 -fix_nucleotides
> $ maker -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11983   11983  312159
> #should be 11985
> 
> $ maker -dsindex -base Rwill10 -fix_nucleotides
> 
> $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff
> 21960
> 
> $ fasta_merge -d Rwill10_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.default.log>
> 
> 
> Annotation A standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file
> #IDs in .tsv file are called "processed-gene" from .fasta file, 
> #but in .gff file, I think these are called "abinit-gene"
> #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> #extract list of IDs only to grep for
> cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  1599   1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff
> 
> $ maker -base Rwill10standard2 -fix_nucleotides
> $ maker -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11975   11975  311953
> #should be 11985
> 
> $ maker -dsindex -base Rwill10standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>  11985   11985  312211
> 
> $ gff3_merge -d Rwill10standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10standard2.all.gff
> 23559
> 
> $ fasta_merge -d Rwill10standard2_master_datastore_index.log
> 
> <maker_opts.log.AnnotationA.standard.log>
> 
> 
> Annotation B default build steps:
> 
> $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta
> 
> #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header
> $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks 
> 
> #use script to extract ordered scaffolds for each chromosome
> $ ./extract_scaffolds_synteny.sh
> 
> #use script to create pseudochromosomal sequence for each chromosome
> $ ./create_pseudo_chromosome_allLGs.sh
> 
> #concatenate these into one fasta file
> cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta
> 
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> $ maker -base Rwill10.pseudochromos -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff
> 18465
> 
> $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log
> 
> <maker_opts.log.AnnotationB.default.log>
> 
> 
> Annotation B standard build steps:
> 
> $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta
> 
> $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed
> 
> $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs
> 
> $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs  
> 
> $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 
>  2365   2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs
> 
> #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff
> 
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides
> 
> $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc
>     13      13     312
> 
> $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log
> 
> $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff
> 20830
> 
> <maker_opts.log.AnnotationB.standard.log>
> 
> 
> -Valerie
> 
> Valerie Soza, Ph.D.
> c/o Hall Lab
> Department of Biology
> University of Washington
> Johnson Hall 202A
> Box 351800
> Seattle, WA 98195-1800
> 206-543-6740
> http://staff.washington.edu/vsoza/
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org