From vsoza at uw.edu Fri Jun 1 14:36:10 2018 From: vsoza at uw.edu (Valerie Soza) Date: Fri, 1 Jun 2018 12:36:10 -0700 Subject: [maker-devel] how to input a masked assembly for annotation into Maker Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Hi Maker community I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. Annotation A default build steps: $ maker -base Rwill10 -fix_nucleotides $ maker -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11983 11983 312159 #should be 11985 $ maker -dsindex -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10_master_datastore_index.log $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff 21960 $ fasta_merge -d Rwill10_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.default.log Type: application/octet-stream Size: 4650 bytes Desc: not available URL: -------------- next part -------------- Annotation A standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file #IDs in .tsv file are called "processed-gene" from .fasta file, #but in .gff file, I think these are called "abinit-gene" #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed #extract list of IDs only to grep for cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff $ maker -base Rwill10standard2 -fix_nucleotides $ maker -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11975 11975 311953 #should be 11985 $ maker -dsindex -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10standard2.all.gff 23559 $ fasta_merge -d Rwill10standard2_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.standard.log Type: application/octet-stream Size: 4529 bytes Desc: not available URL: -------------- next part -------------- Annotation B default build steps: $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks #use script to extract ordered scaffolds for each chromosome $ ./extract_scaffolds_synteny.sh #use script to create pseudochromosomal sequence for each chromosome $ ./create_pseudo_chromosome_allLGs.sh #concatenate these into one fasta file cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta $ maker -base Rwill10.pseudochromos -fix_nucleotides $ maker -base Rwill10.pseudochromos -fix_nucleotides $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff 18465 $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.default.log Type: application/octet-stream Size: 4604 bytes Desc: not available URL: -------------- next part -------------- Annotation B standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff 20830 -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.standard.log Type: application/octet-stream Size: 4558 bytes Desc: not available URL: -------------- next part -------------- -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From carsonhh at gmail.com Fri Jun 1 17:01:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 Jun 2018 16:01:13 -0600 Subject: [maker-devel] Building MAKER with specific perl version In-Reply-To: References: Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com> You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation. ?Carson > On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko wrote: > > Hi, > > I have been banging my head for a while now, trying to install MAKER with my specific perl. > > I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ > > However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL. > > I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. > > Any tips of what do I need to adjust in Build.PL? > > Many thanks, > Ksenia > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 11:46:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 10:46:13 -0600 Subject: [maker-devel] how to input a masked assembly for annotation into Maker In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com> Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs). ?Carson > On Jun 1, 2018, at 1:36 PM, Valerie Soza wrote: > > Hi Maker community > > I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. > > Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. > > For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). > > I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. > I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? > > Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. > > > Annotation A default build steps: > > $ maker -base Rwill10 -fix_nucleotides > $ maker -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11983 11983 312159 > #should be 11985 > > $ maker -dsindex -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10_master_datastore_index.log > > $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff > 21960 > > $ fasta_merge -d Rwill10_master_datastore_index.log > > > > > Annotation A standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta > > #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file > #IDs in .tsv file are called "processed-gene" from .fasta file, > #but in .gff file, I think these are called "abinit-gene" > #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff > $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > #extract list of IDs only to grep for > cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff > > $ maker -base Rwill10standard2 -fix_nucleotides > $ maker -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11975 11975 311953 > #should be 11985 > > $ maker -dsindex -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10standard2.all.gff > 23559 > > $ fasta_merge -d Rwill10standard2_master_datastore_index.log > > > > > Annotation B default build steps: > > $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta > > #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header > $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks > > #use script to extract ordered scaffolds for each chromosome > $ ./extract_scaffolds_synteny.sh > > #use script to create pseudochromosomal sequence for each chromosome > $ ./create_pseudo_chromosome_allLGs.sh > > #concatenate these into one fasta file > cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta > > $ maker -base Rwill10.pseudochromos -fix_nucleotides > $ maker -base Rwill10.pseudochromos -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff > 18465 > > $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log > > > > > Annotation B standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta > > $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff > > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff > 20830 > > > > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From flopezo84 at gmail.com Sat Jun 9 15:06:48 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Sat, 9 Jun 2018 16:06:48 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores Message-ID: Hello, I'm using MAKER's "quality_filter.pl" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kissaj at miamioh.edu Mon Jun 11 12:56:46 2018 From: kissaj at miamioh.edu (Andor J Kiss) Date: Mon, 11 Jun 2018 13:56:46 -0400 Subject: [maker-devel] largest genome annotated? Message-ID: <1528739806.4677.97.camel@miamioh.edu> What's the largest genome that's been annotated with Maker2? Thanks, -- ________________________________________________________________________________________________________________________ Andor J Kiss, PhD Director - Center for Bioinformatics & Functional Genomics 086 Pearson Hall - Miami University 700 East High Street, Oxford Ohio 45056 USA eMAIL:?KissAJ at MiamiOH.edu? Telephone: +1 (513) 529-4280 Fax: +1 (513) 529-2431 Ring ID:?andorjkiss URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/? URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 13:05:07 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:05:07 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <1528739806.4677.97.camel@miamioh.edu> References: <1528739806.4677.97.camel@miamioh.edu> Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. ?Carson > On Jun 11, 2018, at 11:56 AM, Andor J Kiss wrote: > > What's the largest genome that's been annotated with Maker2? > > Thanks, > -- > ________________________________________________________________________________________________________________________ > Andor J Kiss, PhD > Director - Center for Bioinformatics & Functional Genomics > 086 Pearson Hall - Miami University > 700 East High Street, Oxford > Ohio 45056 > USA > > eMAIL: KissAJ at MiamiOH.edu > Telephone: +1 (513) 529-4280 > Fax: +1 (513) 529-2431 > Ring ID: andorjkiss > > URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ > URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics > URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 13:13:28 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:13:28 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> References: <1528739806.4677.97.camel@miamioh.edu> <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com> Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could. ?Carson > On Jun 11, 2018, at 12:05 PM, Carson Holt wrote: > > The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. > > ?Carson > > > >> On Jun 11, 2018, at 11:56 AM, Andor J Kiss > wrote: >> >> What's the largest genome that's been annotated with Maker2? >> >> Thanks, >> -- >> ________________________________________________________________________________________________________________________ >> Andor J Kiss, PhD >> Director - Center for Bioinformatics & Functional Genomics >> 086 Pearson Hall - Miami University >> 700 East High Street, Oxford >> Ohio 45056 >> USA >> >> eMAIL: KissAJ at MiamiOH.edu >> Telephone: +1 (513) 529-4280 >> Fax: +1 (513) 529-2431 >> Ring ID: andorjkiss >> >> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ >> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics >> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jennifer.anderson at ebc.uu.se Tue Jun 12 10:59:31 2018 From: jennifer.anderson at ebc.uu.se (Jennifer Anderson) Date: Tue, 12 Jun 2018 17:59:31 +0200 Subject: [maker-devel] Merge warning = 1 Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Hello, I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. 000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 Best, Jenni N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 12 11:03:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 10:03:37 -0600 Subject: [maker-devel] Merge warning = 1 In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Message-ID: It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear. ?Carson > On Jun 12, 2018, at 9:59 AM, Jennifer Anderson wrote: > > Hello, > > I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). > > I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. > > > 000030F|arrow maker gene > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 > 000030F|arrow > maker mRNA > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 > 000030F|arrow maker exon > 9838 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker exon > 9255 9762 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9838 9992 > . - > 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9255 9762 > . - > 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > > Best, > > Jenni > > > > > > > > > > > > > > > > > > > > > > > > > N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > > E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Tue Jun 12 13:08:19 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 18:08:19 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Dear Carson and maker-devel group, In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. Thanks, Josh Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Tue Jun 12 15:19:19 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 14:19:19 -0600 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. ?Carson > On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: > > Dear Carson and maker-devel group, > > In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. > > How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? > Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. > > Thanks, > Josh > > > Joshua Stein, PhD > Manager, Sci. Informatics III > Cold Spring Harbor Laboratory > steinj at cshl.edu > http://ware.cshl.org/ > > > From steinj at cshl.edu Tue Jun 12 15:31:13 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 20:31:13 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> Message-ID: Hi Carson, Thanks for identifying the problem. I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there. Best, Josh > On Jun 12, 2018, at 4:19 PM, Carson Holt wrote: > > The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. > > On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. > > ?Carson > > >> On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: >> >> Dear Carson and maker-devel group, >> >> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. >> >> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? >> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. >> >> Thanks, >> Josh >> >> >> Joshua Stein, PhD >> Manager, Sci. Informatics III >> Cold Spring Harbor Laboratory >> steinj at cshl.edu >> http://ware.cshl.org/ >> >> >> > Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Wed Jun 13 12:46:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 11:46:12 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. ?Carson > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ss2489 at cornell.edu Wed Jun 13 14:34:27 2018 From: ss2489 at cornell.edu (Surya Saha) Date: Wed, 13 Jun 2018 15:34:27 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: Hi Carson, We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks -Surya On Wed, Jun 13, 2018 at 2:03 PM Carson Holt wrote: > The eAED score also take protein reading frame into account and it can > infers support for exons when both introns are validated (i.e. can be lower > than AED in some cases). For your case where eAED is 1 but AED less than 1 > means that you evidence support is from an overlapping protein, but it is > never in the same reading frame as the gene model. So the positive evidence > support may be suspect, or it may be real and the model is poor because of > the assembly, gaps, etc. To use eAED instead in the quality_filter.pl > script, you would have to to manually edit the script and replace ?_AED' > with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower > quality assemblies (places where the predictors make the best model they > can and not the correct model because the assembly won?t allow for the > correct model but there is evidence that there is a gene locus). So make > sure to always view suspect regions in browser first. > > ?Carson > > > > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl" with the default option (AED<1). > However, I have noticed cases in which models have low AED scores and high > eAED scores (1.00), so presumably the good AED scores are the result of > spurious evidence alignments. Is there a way to filter models based on eAED > scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Surya Saha Sol Genomics Network Boyce Thompson Institute, Ithaca, NY, USA https://citrusgreening.org/ http://www.linkedin.com/in/suryasaha https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 13 14:57:46 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 13:57:46 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score). ?Carson > On Jun 13, 2018, at 1:34 PM, Surya Saha wrote: > > Hi Carson, > > We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks > > -Surya > > On Wed, Jun 13, 2018 at 2:03 PM Carson Holt > wrote: > The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. > > ?Carson > > > >> On Jun 9, 2018, at 2:06 PM, Federico L?pez > wrote: >> >> Hello, >> >> I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? >> >> Thank you. >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > > Surya Saha > Sol Genomics Network > Boyce Thompson Institute, Ithaca, NY, USA > https://citrusgreening.org/ > http://www.linkedin.com/in/suryasaha > https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdolby at asu.edu Fri Jun 15 11:29:16 2018 From: gdolby at asu.edu (Greer Dolby) Date: Fri, 15 Jun 2018 09:29:16 -0700 Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line 1774) Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu> Hello, I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks! Best, Greer ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1 ...processing 8 of 12 total clusters:44 now processing 0 ...processing 0 of 3 ...processing 1 of 3 ...processing 2 of 3 total clusters:44 now processing 0 ...processing 0 of 4 ...processing 1 of 4 ...processing 9 of 12 ...processing 2 of 4 ...processing 3 of 4 total clusters:44 now processing 0 ...processing 10 of 12 ...processing 0 of 67 ...processing 1 of 67 ERROR: Chunk failed at level:6, tier_type:0 ...processing 2 of 67 FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658 ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2 ...processing 9 of 298 ...processing 8 of 81 ...processing 11 of 202 ...processing 13 of 20 ...processing 10 of 298 ...processing 9 of 81 ...processing 10 of 81 ...processing 18 of 123 ...processing 14 of 20 ...processing 17 of 54 ...processing 18 of 54 ...processing 37 of 164 ...processing 20 of 254 Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=17, hostname=omega ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896 _________________________________ Greer Dolby, PhD Postdoctoral Research Scholar SoLS, Arizona State U. office: LSE 313, 480.965.7456 website | twitter Kusumi Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From kapeelc at gmail.com Fri Jun 22 14:41:58 2018 From: kapeelc at gmail.com (Kapeel Chougule) Date: Fri, 22 Jun 2018 15:41:58 -0400 Subject: [maker-devel] map_forward=1 not mapping reference ID's to output correctly Message-ID: Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- *Kapeel ChouguleComputational Scientist Developer II* *One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4990 bytes Desc: not available URL: From monica.poelchau at ars.usda.gov Fri Jun 22 15:04:28 2018 From: monica.poelchau at ars.usda.gov (Poelchau, Monica) Date: Fri, 22 Jun 2018 20:04:28 +0000 Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not mapping reference ID's to output correctly In-Reply-To: References: Message-ID: Hi Kapeel, If you just want your community annotations to replace models in an existing gene set, we have a tool for this: https://github.com/NAL-i5K/GFF3toolkit You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems. Hth, Monica From: maker-devel on behalf of Kapeel Chougule Date: Friday, June 22, 2018 at 13:53 To: "maker-devel at yandell-lab.org" Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links. Questions: Spam.Abuse at wdc.usda.gov Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- Kapeel Chougule Computational Scientist Developer II One Bungtown Road Cold Spring Harbor, NY 11724 http://www.warelab.org/ This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andremmachado25 at gmail.com Tue Jun 26 10:36:24 2018 From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=) Date: Tue, 26 Jun 2018 16:36:24 +0100 Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally.. Message-ID: Hi , First of all thanks for your efforts in Maker pipeline. Its a tremendous help for the people that works with genomes. In the last 4 days i have broke my head.. with an error .. but still without a solution. I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ Seems to be a quite similar... but don't point to a specific solution. I have run maker with the data test and all runned ok. Maker finalize the entire process without errors. Recently, i?m trying to aplly my own data on MPI cluster. But this error, frequently occurred. Thread 1 terminated abnormally: ../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0 --> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker line 1451 thread 1. --> rank=8, hostname=compute-0-1.local deleted:0 hits deleted:0 hits preparing ab-inits deleted:0 hits deleted:0 hits FATAL: Thread terminated, causing all processes to fail --> rank=8, hostname=compute-0-1.local deleted:0 hits Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and my_custom_lib_of_repeats.fa, to produce raw genes models which will be used to train SNAP. I already used several command lines and all gave me the same error.. The only change between different tests was the local of the error, sometimes happened in compute-0-1.local other time in compute-0-4.local or in another one. mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err mpiexec --hostfile Host maker 1>1.log 2>2.err mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log 2>2.err The log file as well the option files are provided below. Many thanks in advance, Andr? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2.log Type: text/x-log Size: 38654 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1223 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4547 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1412 bytes Desc: not available URL: From vsoza at uw.edu Fri Jun 1 13:36:10 2018 From: vsoza at uw.edu (Valerie Soza) Date: Fri, 1 Jun 2018 12:36:10 -0700 Subject: [maker-devel] how to input a masked assembly for annotation into Maker Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Hi Maker community I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. Annotation A default build steps: $ maker -base Rwill10 -fix_nucleotides $ maker -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11983 11983 312159 #should be 11985 $ maker -dsindex -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10_master_datastore_index.log $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff 21960 $ fasta_merge -d Rwill10_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.default.log Type: application/octet-stream Size: 4650 bytes Desc: not available URL: -------------- next part -------------- Annotation A standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file #IDs in .tsv file are called "processed-gene" from .fasta file, #but in .gff file, I think these are called "abinit-gene" #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed #extract list of IDs only to grep for cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff $ maker -base Rwill10standard2 -fix_nucleotides $ maker -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11975 11975 311953 #should be 11985 $ maker -dsindex -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10standard2.all.gff 23559 $ fasta_merge -d Rwill10standard2_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.standard.log Type: application/octet-stream Size: 4529 bytes Desc: not available URL: -------------- next part -------------- Annotation B default build steps: $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks #use script to extract ordered scaffolds for each chromosome $ ./extract_scaffolds_synteny.sh #use script to create pseudochromosomal sequence for each chromosome $ ./create_pseudo_chromosome_allLGs.sh #concatenate these into one fasta file cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta $ maker -base Rwill10.pseudochromos -fix_nucleotides $ maker -base Rwill10.pseudochromos -fix_nucleotides $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff 18465 $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.default.log Type: application/octet-stream Size: 4604 bytes Desc: not available URL: -------------- next part -------------- Annotation B standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff 20830 -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.standard.log Type: application/octet-stream Size: 4558 bytes Desc: not available URL: -------------- next part -------------- -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From carsonhh at gmail.com Fri Jun 1 16:01:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 Jun 2018 16:01:13 -0600 Subject: [maker-devel] Building MAKER with specific perl version In-Reply-To: References: Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com> You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation. ?Carson > On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko wrote: > > Hi, > > I have been banging my head for a while now, trying to install MAKER with my specific perl. > > I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ > > However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL. > > I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. > > Any tips of what do I need to adjust in Build.PL? > > Many thanks, > Ksenia > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 10:46:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 10:46:13 -0600 Subject: [maker-devel] how to input a masked assembly for annotation into Maker In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com> Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs). ?Carson > On Jun 1, 2018, at 1:36 PM, Valerie Soza wrote: > > Hi Maker community > > I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. > > Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. > > For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). > > I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. > I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? > > Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. > > > Annotation A default build steps: > > $ maker -base Rwill10 -fix_nucleotides > $ maker -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11983 11983 312159 > #should be 11985 > > $ maker -dsindex -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10_master_datastore_index.log > > $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff > 21960 > > $ fasta_merge -d Rwill10_master_datastore_index.log > > > > > Annotation A standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta > > #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file > #IDs in .tsv file are called "processed-gene" from .fasta file, > #but in .gff file, I think these are called "abinit-gene" > #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff > $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > #extract list of IDs only to grep for > cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff > > $ maker -base Rwill10standard2 -fix_nucleotides > $ maker -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11975 11975 311953 > #should be 11985 > > $ maker -dsindex -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10standard2.all.gff > 23559 > > $ fasta_merge -d Rwill10standard2_master_datastore_index.log > > > > > Annotation B default build steps: > > $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta > > #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header > $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks > > #use script to extract ordered scaffolds for each chromosome > $ ./extract_scaffolds_synteny.sh > > #use script to create pseudochromosomal sequence for each chromosome > $ ./create_pseudo_chromosome_allLGs.sh > > #concatenate these into one fasta file > cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta > > $ maker -base Rwill10.pseudochromos -fix_nucleotides > $ maker -base Rwill10.pseudochromos -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff > 18465 > > $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log > > > > > Annotation B standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta > > $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff > > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff > 20830 > > > > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From flopezo84 at gmail.com Sat Jun 9 14:06:48 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Sat, 9 Jun 2018 16:06:48 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores Message-ID: Hello, I'm using MAKER's "quality_filter.pl" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kissaj at miamioh.edu Mon Jun 11 11:56:46 2018 From: kissaj at miamioh.edu (Andor J Kiss) Date: Mon, 11 Jun 2018 13:56:46 -0400 Subject: [maker-devel] largest genome annotated? Message-ID: <1528739806.4677.97.camel@miamioh.edu> What's the largest genome that's been annotated with Maker2? Thanks, -- ________________________________________________________________________________________________________________________ Andor J Kiss, PhD Director - Center for Bioinformatics & Functional Genomics 086 Pearson Hall - Miami University 700 East High Street, Oxford Ohio 45056 USA eMAIL:?KissAJ at MiamiOH.edu? Telephone: +1 (513) 529-4280 Fax: +1 (513) 529-2431 Ring ID:?andorjkiss URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/? URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 12:05:07 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:05:07 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <1528739806.4677.97.camel@miamioh.edu> References: <1528739806.4677.97.camel@miamioh.edu> Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. ?Carson > On Jun 11, 2018, at 11:56 AM, Andor J Kiss wrote: > > What's the largest genome that's been annotated with Maker2? > > Thanks, > -- > ________________________________________________________________________________________________________________________ > Andor J Kiss, PhD > Director - Center for Bioinformatics & Functional Genomics > 086 Pearson Hall - Miami University > 700 East High Street, Oxford > Ohio 45056 > USA > > eMAIL: KissAJ at MiamiOH.edu > Telephone: +1 (513) 529-4280 > Fax: +1 (513) 529-2431 > Ring ID: andorjkiss > > URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ > URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics > URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 12:13:28 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:13:28 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> References: <1528739806.4677.97.camel@miamioh.edu> <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com> Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could. ?Carson > On Jun 11, 2018, at 12:05 PM, Carson Holt wrote: > > The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. > > ?Carson > > > >> On Jun 11, 2018, at 11:56 AM, Andor J Kiss > wrote: >> >> What's the largest genome that's been annotated with Maker2? >> >> Thanks, >> -- >> ________________________________________________________________________________________________________________________ >> Andor J Kiss, PhD >> Director - Center for Bioinformatics & Functional Genomics >> 086 Pearson Hall - Miami University >> 700 East High Street, Oxford >> Ohio 45056 >> USA >> >> eMAIL: KissAJ at MiamiOH.edu >> Telephone: +1 (513) 529-4280 >> Fax: +1 (513) 529-2431 >> Ring ID: andorjkiss >> >> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ >> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics >> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jennifer.anderson at ebc.uu.se Tue Jun 12 09:59:31 2018 From: jennifer.anderson at ebc.uu.se (Jennifer Anderson) Date: Tue, 12 Jun 2018 17:59:31 +0200 Subject: [maker-devel] Merge warning = 1 Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Hello, I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. 000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 Best, Jenni N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 12 10:03:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 10:03:37 -0600 Subject: [maker-devel] Merge warning = 1 In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Message-ID: It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear. ?Carson > On Jun 12, 2018, at 9:59 AM, Jennifer Anderson wrote: > > Hello, > > I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). > > I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. > > > 000030F|arrow maker gene > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 > 000030F|arrow > maker mRNA > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 > 000030F|arrow maker exon > 9838 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker exon > 9255 9762 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9838 9992 > . - > 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9255 9762 > . - > 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > > Best, > > Jenni > > > > > > > > > > > > > > > > > > > > > > > > > N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > > E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Tue Jun 12 12:08:19 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 18:08:19 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Dear Carson and maker-devel group, In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. Thanks, Josh Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Tue Jun 12 14:19:19 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 14:19:19 -0600 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. ?Carson > On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: > > Dear Carson and maker-devel group, > > In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. > > How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? > Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. > > Thanks, > Josh > > > Joshua Stein, PhD > Manager, Sci. Informatics III > Cold Spring Harbor Laboratory > steinj at cshl.edu > http://ware.cshl.org/ > > > From steinj at cshl.edu Tue Jun 12 14:31:13 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 20:31:13 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> Message-ID: Hi Carson, Thanks for identifying the problem. I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there. Best, Josh > On Jun 12, 2018, at 4:19 PM, Carson Holt wrote: > > The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. > > On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. > > ?Carson > > >> On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: >> >> Dear Carson and maker-devel group, >> >> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. >> >> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? >> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. >> >> Thanks, >> Josh >> >> >> Joshua Stein, PhD >> Manager, Sci. Informatics III >> Cold Spring Harbor Laboratory >> steinj at cshl.edu >> http://ware.cshl.org/ >> >> >> > Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Wed Jun 13 11:46:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 11:46:12 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. ?Carson > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ss2489 at cornell.edu Wed Jun 13 13:34:27 2018 From: ss2489 at cornell.edu (Surya Saha) Date: Wed, 13 Jun 2018 15:34:27 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: Hi Carson, We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks -Surya On Wed, Jun 13, 2018 at 2:03 PM Carson Holt wrote: > The eAED score also take protein reading frame into account and it can > infers support for exons when both introns are validated (i.e. can be lower > than AED in some cases). For your case where eAED is 1 but AED less than 1 > means that you evidence support is from an overlapping protein, but it is > never in the same reading frame as the gene model. So the positive evidence > support may be suspect, or it may be real and the model is poor because of > the assembly, gaps, etc. To use eAED instead in the quality_filter.pl > script, you would have to to manually edit the script and replace ?_AED' > with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower > quality assemblies (places where the predictors make the best model they > can and not the correct model because the assembly won?t allow for the > correct model but there is evidence that there is a gene locus). So make > sure to always view suspect regions in browser first. > > ?Carson > > > > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl" with the default option (AED<1). > However, I have noticed cases in which models have low AED scores and high > eAED scores (1.00), so presumably the good AED scores are the result of > spurious evidence alignments. Is there a way to filter models based on eAED > scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Surya Saha Sol Genomics Network Boyce Thompson Institute, Ithaca, NY, USA https://citrusgreening.org/ http://www.linkedin.com/in/suryasaha https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 13 13:57:46 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 13:57:46 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score). ?Carson > On Jun 13, 2018, at 1:34 PM, Surya Saha wrote: > > Hi Carson, > > We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks > > -Surya > > On Wed, Jun 13, 2018 at 2:03 PM Carson Holt > wrote: > The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. > > ?Carson > > > >> On Jun 9, 2018, at 2:06 PM, Federico L?pez > wrote: >> >> Hello, >> >> I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? >> >> Thank you. >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > > Surya Saha > Sol Genomics Network > Boyce Thompson Institute, Ithaca, NY, USA > https://citrusgreening.org/ > http://www.linkedin.com/in/suryasaha > https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdolby at asu.edu Fri Jun 15 10:29:16 2018 From: gdolby at asu.edu (Greer Dolby) Date: Fri, 15 Jun 2018 09:29:16 -0700 Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line 1774) Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu> Hello, I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks! Best, Greer ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1 ...processing 8 of 12 total clusters:44 now processing 0 ...processing 0 of 3 ...processing 1 of 3 ...processing 2 of 3 total clusters:44 now processing 0 ...processing 0 of 4 ...processing 1 of 4 ...processing 9 of 12 ...processing 2 of 4 ...processing 3 of 4 total clusters:44 now processing 0 ...processing 10 of 12 ...processing 0 of 67 ...processing 1 of 67 ERROR: Chunk failed at level:6, tier_type:0 ...processing 2 of 67 FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658 ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2 ...processing 9 of 298 ...processing 8 of 81 ...processing 11 of 202 ...processing 13 of 20 ...processing 10 of 298 ...processing 9 of 81 ...processing 10 of 81 ...processing 18 of 123 ...processing 14 of 20 ...processing 17 of 54 ...processing 18 of 54 ...processing 37 of 164 ...processing 20 of 254 Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=17, hostname=omega ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896 _________________________________ Greer Dolby, PhD Postdoctoral Research Scholar SoLS, Arizona State U. office: LSE 313, 480.965.7456 website | twitter Kusumi Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From kapeelc at gmail.com Fri Jun 22 13:41:58 2018 From: kapeelc at gmail.com (Kapeel Chougule) Date: Fri, 22 Jun 2018 15:41:58 -0400 Subject: [maker-devel] map_forward=1 not mapping reference ID's to output correctly Message-ID: Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- *Kapeel ChouguleComputational Scientist Developer II* *One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4990 bytes Desc: not available URL: From monica.poelchau at ars.usda.gov Fri Jun 22 14:04:28 2018 From: monica.poelchau at ars.usda.gov (Poelchau, Monica) Date: Fri, 22 Jun 2018 20:04:28 +0000 Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not mapping reference ID's to output correctly In-Reply-To: References: Message-ID: Hi Kapeel, If you just want your community annotations to replace models in an existing gene set, we have a tool for this: https://github.com/NAL-i5K/GFF3toolkit You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems. Hth, Monica From: maker-devel on behalf of Kapeel Chougule Date: Friday, June 22, 2018 at 13:53 To: "maker-devel at yandell-lab.org" Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links. Questions: Spam.Abuse at wdc.usda.gov Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- Kapeel Chougule Computational Scientist Developer II One Bungtown Road Cold Spring Harbor, NY 11724 http://www.warelab.org/ This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andremmachado25 at gmail.com Tue Jun 26 09:36:24 2018 From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=) Date: Tue, 26 Jun 2018 16:36:24 +0100 Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally.. Message-ID: Hi , First of all thanks for your efforts in Maker pipeline. Its a tremendous help for the people that works with genomes. In the last 4 days i have broke my head.. with an error .. but still without a solution. I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ Seems to be a quite similar... but don't point to a specific solution. I have run maker with the data test and all runned ok. Maker finalize the entire process without errors. Recently, i?m trying to aplly my own data on MPI cluster. But this error, frequently occurred. Thread 1 terminated abnormally: ../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0 --> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker line 1451 thread 1. --> rank=8, hostname=compute-0-1.local deleted:0 hits deleted:0 hits preparing ab-inits deleted:0 hits deleted:0 hits FATAL: Thread terminated, causing all processes to fail --> rank=8, hostname=compute-0-1.local deleted:0 hits Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and my_custom_lib_of_repeats.fa, to produce raw genes models which will be used to train SNAP. I already used several command lines and all gave me the same error.. The only change between different tests was the local of the error, sometimes happened in compute-0-1.local other time in compute-0-4.local or in another one. mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err mpiexec --hostfile Host maker 1>1.log 2>2.err mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log 2>2.err The log file as well the option files are provided below. Many thanks in advance, Andr? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2.log Type: text/x-log Size: 38654 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1223 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4547 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1412 bytes Desc: not available URL: From vsoza at uw.edu Fri Jun 1 13:36:10 2018 From: vsoza at uw.edu (Valerie Soza) Date: Fri, 1 Jun 2018 12:36:10 -0700 Subject: [maker-devel] how to input a masked assembly for annotation into Maker Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Hi Maker community I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. Annotation A default build steps: $ maker -base Rwill10 -fix_nucleotides $ maker -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11983 11983 312159 #should be 11985 $ maker -dsindex -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10_master_datastore_index.log $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff 21960 $ fasta_merge -d Rwill10_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.default.log Type: application/octet-stream Size: 4650 bytes Desc: not available URL: -------------- next part -------------- Annotation A standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file #IDs in .tsv file are called "processed-gene" from .fasta file, #but in .gff file, I think these are called "abinit-gene" #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed #extract list of IDs only to grep for cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff $ maker -base Rwill10standard2 -fix_nucleotides $ maker -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11975 11975 311953 #should be 11985 $ maker -dsindex -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10standard2.all.gff 23559 $ fasta_merge -d Rwill10standard2_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.standard.log Type: application/octet-stream Size: 4529 bytes Desc: not available URL: -------------- next part -------------- Annotation B default build steps: $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks #use script to extract ordered scaffolds for each chromosome $ ./extract_scaffolds_synteny.sh #use script to create pseudochromosomal sequence for each chromosome $ ./create_pseudo_chromosome_allLGs.sh #concatenate these into one fasta file cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta $ maker -base Rwill10.pseudochromos -fix_nucleotides $ maker -base Rwill10.pseudochromos -fix_nucleotides $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff 18465 $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.default.log Type: application/octet-stream Size: 4604 bytes Desc: not available URL: -------------- next part -------------- Annotation B standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff 20830 -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.standard.log Type: application/octet-stream Size: 4558 bytes Desc: not available URL: -------------- next part -------------- -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From carsonhh at gmail.com Fri Jun 1 16:01:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 Jun 2018 16:01:13 -0600 Subject: [maker-devel] Building MAKER with specific perl version In-Reply-To: References: Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com> You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation. ?Carson > On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko wrote: > > Hi, > > I have been banging my head for a while now, trying to install MAKER with my specific perl. > > I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ > > However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL. > > I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. > > Any tips of what do I need to adjust in Build.PL? > > Many thanks, > Ksenia > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 10:46:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 10:46:13 -0600 Subject: [maker-devel] how to input a masked assembly for annotation into Maker In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com> Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs). ?Carson > On Jun 1, 2018, at 1:36 PM, Valerie Soza wrote: > > Hi Maker community > > I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. > > Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. > > For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). > > I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. > I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? > > Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. > > > Annotation A default build steps: > > $ maker -base Rwill10 -fix_nucleotides > $ maker -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11983 11983 312159 > #should be 11985 > > $ maker -dsindex -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10_master_datastore_index.log > > $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff > 21960 > > $ fasta_merge -d Rwill10_master_datastore_index.log > > > > > Annotation A standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta > > #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file > #IDs in .tsv file are called "processed-gene" from .fasta file, > #but in .gff file, I think these are called "abinit-gene" > #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff > $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > #extract list of IDs only to grep for > cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff > > $ maker -base Rwill10standard2 -fix_nucleotides > $ maker -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11975 11975 311953 > #should be 11985 > > $ maker -dsindex -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10standard2.all.gff > 23559 > > $ fasta_merge -d Rwill10standard2_master_datastore_index.log > > > > > Annotation B default build steps: > > $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta > > #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header > $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks > > #use script to extract ordered scaffolds for each chromosome > $ ./extract_scaffolds_synteny.sh > > #use script to create pseudochromosomal sequence for each chromosome > $ ./create_pseudo_chromosome_allLGs.sh > > #concatenate these into one fasta file > cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta > > $ maker -base Rwill10.pseudochromos -fix_nucleotides > $ maker -base Rwill10.pseudochromos -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff > 18465 > > $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log > > > > > Annotation B standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta > > $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff > > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff > 20830 > > > > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From flopezo84 at gmail.com Sat Jun 9 14:06:48 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Sat, 9 Jun 2018 16:06:48 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores Message-ID: Hello, I'm using MAKER's "quality_filter.pl" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kissaj at miamioh.edu Mon Jun 11 11:56:46 2018 From: kissaj at miamioh.edu (Andor J Kiss) Date: Mon, 11 Jun 2018 13:56:46 -0400 Subject: [maker-devel] largest genome annotated? Message-ID: <1528739806.4677.97.camel@miamioh.edu> What's the largest genome that's been annotated with Maker2? Thanks, -- ________________________________________________________________________________________________________________________ Andor J Kiss, PhD Director - Center for Bioinformatics & Functional Genomics 086 Pearson Hall - Miami University 700 East High Street, Oxford Ohio 45056 USA eMAIL:?KissAJ at MiamiOH.edu? Telephone: +1 (513) 529-4280 Fax: +1 (513) 529-2431 Ring ID:?andorjkiss URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/? URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 12:05:07 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:05:07 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <1528739806.4677.97.camel@miamioh.edu> References: <1528739806.4677.97.camel@miamioh.edu> Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. ?Carson > On Jun 11, 2018, at 11:56 AM, Andor J Kiss wrote: > > What's the largest genome that's been annotated with Maker2? > > Thanks, > -- > ________________________________________________________________________________________________________________________ > Andor J Kiss, PhD > Director - Center for Bioinformatics & Functional Genomics > 086 Pearson Hall - Miami University > 700 East High Street, Oxford > Ohio 45056 > USA > > eMAIL: KissAJ at MiamiOH.edu > Telephone: +1 (513) 529-4280 > Fax: +1 (513) 529-2431 > Ring ID: andorjkiss > > URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ > URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics > URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 12:13:28 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:13:28 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> References: <1528739806.4677.97.camel@miamioh.edu> <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com> Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could. ?Carson > On Jun 11, 2018, at 12:05 PM, Carson Holt wrote: > > The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. > > ?Carson > > > >> On Jun 11, 2018, at 11:56 AM, Andor J Kiss > wrote: >> >> What's the largest genome that's been annotated with Maker2? >> >> Thanks, >> -- >> ________________________________________________________________________________________________________________________ >> Andor J Kiss, PhD >> Director - Center for Bioinformatics & Functional Genomics >> 086 Pearson Hall - Miami University >> 700 East High Street, Oxford >> Ohio 45056 >> USA >> >> eMAIL: KissAJ at MiamiOH.edu >> Telephone: +1 (513) 529-4280 >> Fax: +1 (513) 529-2431 >> Ring ID: andorjkiss >> >> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ >> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics >> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jennifer.anderson at ebc.uu.se Tue Jun 12 09:59:31 2018 From: jennifer.anderson at ebc.uu.se (Jennifer Anderson) Date: Tue, 12 Jun 2018 17:59:31 +0200 Subject: [maker-devel] Merge warning = 1 Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Hello, I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. 000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 Best, Jenni N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 12 10:03:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 10:03:37 -0600 Subject: [maker-devel] Merge warning = 1 In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Message-ID: It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear. ?Carson > On Jun 12, 2018, at 9:59 AM, Jennifer Anderson wrote: > > Hello, > > I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). > > I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. > > > 000030F|arrow maker gene > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 > 000030F|arrow > maker mRNA > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 > 000030F|arrow maker exon > 9838 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker exon > 9255 9762 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9838 9992 > . - > 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9255 9762 > . - > 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > > Best, > > Jenni > > > > > > > > > > > > > > > > > > > > > > > > > N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > > E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Tue Jun 12 12:08:19 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 18:08:19 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Dear Carson and maker-devel group, In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. Thanks, Josh Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Tue Jun 12 14:19:19 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 14:19:19 -0600 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. ?Carson > On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: > > Dear Carson and maker-devel group, > > In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. > > How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? > Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. > > Thanks, > Josh > > > Joshua Stein, PhD > Manager, Sci. Informatics III > Cold Spring Harbor Laboratory > steinj at cshl.edu > http://ware.cshl.org/ > > > From steinj at cshl.edu Tue Jun 12 14:31:13 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 20:31:13 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> Message-ID: Hi Carson, Thanks for identifying the problem. I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there. Best, Josh > On Jun 12, 2018, at 4:19 PM, Carson Holt wrote: > > The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. > > On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. > > ?Carson > > >> On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: >> >> Dear Carson and maker-devel group, >> >> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. >> >> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? >> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. >> >> Thanks, >> Josh >> >> >> Joshua Stein, PhD >> Manager, Sci. Informatics III >> Cold Spring Harbor Laboratory >> steinj at cshl.edu >> http://ware.cshl.org/ >> >> >> > Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Wed Jun 13 11:46:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 11:46:12 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. ?Carson > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ss2489 at cornell.edu Wed Jun 13 13:34:27 2018 From: ss2489 at cornell.edu (Surya Saha) Date: Wed, 13 Jun 2018 15:34:27 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: Hi Carson, We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks -Surya On Wed, Jun 13, 2018 at 2:03 PM Carson Holt wrote: > The eAED score also take protein reading frame into account and it can > infers support for exons when both introns are validated (i.e. can be lower > than AED in some cases). For your case where eAED is 1 but AED less than 1 > means that you evidence support is from an overlapping protein, but it is > never in the same reading frame as the gene model. So the positive evidence > support may be suspect, or it may be real and the model is poor because of > the assembly, gaps, etc. To use eAED instead in the quality_filter.pl > script, you would have to to manually edit the script and replace ?_AED' > with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower > quality assemblies (places where the predictors make the best model they > can and not the correct model because the assembly won?t allow for the > correct model but there is evidence that there is a gene locus). So make > sure to always view suspect regions in browser first. > > ?Carson > > > > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl" with the default option (AED<1). > However, I have noticed cases in which models have low AED scores and high > eAED scores (1.00), so presumably the good AED scores are the result of > spurious evidence alignments. Is there a way to filter models based on eAED > scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Surya Saha Sol Genomics Network Boyce Thompson Institute, Ithaca, NY, USA https://citrusgreening.org/ http://www.linkedin.com/in/suryasaha https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 13 13:57:46 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 13:57:46 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score). ?Carson > On Jun 13, 2018, at 1:34 PM, Surya Saha wrote: > > Hi Carson, > > We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks > > -Surya > > On Wed, Jun 13, 2018 at 2:03 PM Carson Holt > wrote: > The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. > > ?Carson > > > >> On Jun 9, 2018, at 2:06 PM, Federico L?pez > wrote: >> >> Hello, >> >> I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? >> >> Thank you. >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > > Surya Saha > Sol Genomics Network > Boyce Thompson Institute, Ithaca, NY, USA > https://citrusgreening.org/ > http://www.linkedin.com/in/suryasaha > https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdolby at asu.edu Fri Jun 15 10:29:16 2018 From: gdolby at asu.edu (Greer Dolby) Date: Fri, 15 Jun 2018 09:29:16 -0700 Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line 1774) Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu> Hello, I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks! Best, Greer ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1 ...processing 8 of 12 total clusters:44 now processing 0 ...processing 0 of 3 ...processing 1 of 3 ...processing 2 of 3 total clusters:44 now processing 0 ...processing 0 of 4 ...processing 1 of 4 ...processing 9 of 12 ...processing 2 of 4 ...processing 3 of 4 total clusters:44 now processing 0 ...processing 10 of 12 ...processing 0 of 67 ...processing 1 of 67 ERROR: Chunk failed at level:6, tier_type:0 ...processing 2 of 67 FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658 ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2 ...processing 9 of 298 ...processing 8 of 81 ...processing 11 of 202 ...processing 13 of 20 ...processing 10 of 298 ...processing 9 of 81 ...processing 10 of 81 ...processing 18 of 123 ...processing 14 of 20 ...processing 17 of 54 ...processing 18 of 54 ...processing 37 of 164 ...processing 20 of 254 Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=17, hostname=omega ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896 _________________________________ Greer Dolby, PhD Postdoctoral Research Scholar SoLS, Arizona State U. office: LSE 313, 480.965.7456 website | twitter Kusumi Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From kapeelc at gmail.com Fri Jun 22 13:41:58 2018 From: kapeelc at gmail.com (Kapeel Chougule) Date: Fri, 22 Jun 2018 15:41:58 -0400 Subject: [maker-devel] map_forward=1 not mapping reference ID's to output correctly Message-ID: Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- *Kapeel ChouguleComputational Scientist Developer II* *One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4991 bytes Desc: not available URL: From monica.poelchau at ars.usda.gov Fri Jun 22 14:04:28 2018 From: monica.poelchau at ars.usda.gov (Poelchau, Monica) Date: Fri, 22 Jun 2018 20:04:28 +0000 Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not mapping reference ID's to output correctly In-Reply-To: References: Message-ID: Hi Kapeel, If you just want your community annotations to replace models in an existing gene set, we have a tool for this: https://github.com/NAL-i5K/GFF3toolkit You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems. Hth, Monica From: maker-devel on behalf of Kapeel Chougule Date: Friday, June 22, 2018 at 13:53 To: "maker-devel at yandell-lab.org" Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links. Questions: Spam.Abuse at wdc.usda.gov Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- Kapeel Chougule Computational Scientist Developer II One Bungtown Road Cold Spring Harbor, NY 11724 http://www.warelab.org/ This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andremmachado25 at gmail.com Tue Jun 26 09:36:24 2018 From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=) Date: Tue, 26 Jun 2018 16:36:24 +0100 Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally.. Message-ID: Hi , First of all thanks for your efforts in Maker pipeline. Its a tremendous help for the people that works with genomes. In the last 4 days i have broke my head.. with an error .. but still without a solution. I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ Seems to be a quite similar... but don't point to a specific solution. I have run maker with the data test and all runned ok. Maker finalize the entire process without errors. Recently, i?m trying to aplly my own data on MPI cluster. But this error, frequently occurred. Thread 1 terminated abnormally: ../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0 --> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker line 1451 thread 1. --> rank=8, hostname=compute-0-1.local deleted:0 hits deleted:0 hits preparing ab-inits deleted:0 hits deleted:0 hits FATAL: Thread terminated, causing all processes to fail --> rank=8, hostname=compute-0-1.local deleted:0 hits Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and my_custom_lib_of_repeats.fa, to produce raw genes models which will be used to train SNAP. I already used several command lines and all gave me the same error.. The only change between different tests was the local of the error, sometimes happened in compute-0-1.local other time in compute-0-4.local or in another one. mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err mpiexec --hostfile Host maker 1>1.log 2>2.err mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log 2>2.err The log file as well the option files are provided below. Many thanks in advance, Andr? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2.log Type: text/x-log Size: 38655 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1224 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4548 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1413 bytes Desc: not available URL: From vsoza at uw.edu Fri Jun 1 13:36:10 2018 From: vsoza at uw.edu (Valerie Soza) Date: Fri, 1 Jun 2018 12:36:10 -0700 Subject: [maker-devel] how to input a masked assembly for annotation into Maker Message-ID: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Hi Maker community I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. Annotation A default build steps: $ maker -base Rwill10 -fix_nucleotides $ maker -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11983 11983 312159 #should be 11985 $ maker -dsindex -base Rwill10 -fix_nucleotides $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10_master_datastore_index.log $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff 21960 $ fasta_merge -d Rwill10_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.default.log Type: application/octet-stream Size: 4650 bytes Desc: not available URL: -------------- next part -------------- Annotation A standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file #IDs in .tsv file are called "processed-gene" from .fasta file, #but in .gff file, I think these are called "abinit-gene" #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed #extract list of IDs only to grep for cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff $ maker -base Rwill10standard2 -fix_nucleotides $ maker -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11975 11975 311953 #should be 11985 $ maker -dsindex -base Rwill10standard2 -fix_nucleotides $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 11985 11985 312211 $ gff3_merge -d Rwill10standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10standard2.all.gff 23559 $ fasta_merge -d Rwill10standard2_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationA.standard.log Type: application/octet-stream Size: 4529 bytes Desc: not available URL: -------------- next part -------------- Annotation B default build steps: $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks #use script to extract ordered scaffolds for each chromosome $ ./extract_scaffolds_synteny.sh #use script to create pseudochromosomal sequence for each chromosome $ ./create_pseudo_chromosome_allLGs.sh #concatenate these into one fasta file cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta $ maker -base Rwill10.pseudochromos -fix_nucleotides $ maker -base Rwill10.pseudochromos -fix_nucleotides $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff 18465 $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.default.log Type: application/octet-stream Size: 4604 bytes Desc: not available URL: -------------- next part -------------- Annotation B standard build steps: $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc 13 13 312 $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff 20830 -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.log.AnnotationB.standard.log Type: application/octet-stream Size: 4558 bytes Desc: not available URL: -------------- next part -------------- -Valerie Valerie Soza, Ph.D. c/o Hall Lab Department of Biology University of Washington Johnson Hall 202A Box 351800 Seattle, WA 98195-1800 206-543-6740 http://staff.washington.edu/vsoza/ From carsonhh at gmail.com Fri Jun 1 16:01:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 1 Jun 2018 16:01:13 -0600 Subject: [maker-devel] Building MAKER with specific perl version In-Reply-To: References: Message-ID: <78BA2892-5E3A-4DE5-AA44-18A9BCEF8071@gmail.com> You probably need to run './Build realclean? to clean up your previous build and settings that you likely ran with the system perl. You may even need to clean out the ?/maker/perl directory. You can also try updating your Module::Build installation. ?Carson > On May 31, 2018, at 8:30 AM, Ksenia Lavrichenko wrote: > > Hi, > > I have been banging my head for a while now, trying to install MAKER with my specific perl. > > I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/hScqdJW0FsU/3KT_UF7k9XMJ > > However, this does not work for me. I make sure bin/* and Build are deleted before I run $myperl Build.PL. > > I see my perl in shebang of Build however after ./Build install all scripts in bin have "#! /usr/bin/perl" which produces a version error when I try to run maker -h. > > Any tips of what do I need to adjust in Build.PL? > > Many thanks, > Ksenia > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 10:46:13 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 10:46:13 -0600 Subject: [maker-devel] how to input a masked assembly for annotation into Maker In-Reply-To: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> References: <88100FC0-DEDD-4D4D-80D2-12A4CB930557@uw.edu> Message-ID: <9987ECCD-E7B6-45CD-9CBB-0B13B7608E4C@gmail.com> Predictors will call partial models near the edge of contigs, but if you merger them with 100 N gaps, they will be more likely to jump the gap (attempt to merge models on each side while putting the gap in the intron - even if that is not correct), or they will just call nothing around the gap (i.e. they will behave different around gaps than at the edge of contigs). ?Carson > On Jun 1, 2018, at 1:36 PM, Valerie Soza wrote: > > Hi Maker community > > I have done repeatmasking and gene prediction using the Maker pipeline on an entire genome assembly that has some scaffolds mapped to chromosomes and others that are unmapped, for a total of 11,985 scaffolds (Annotation A). I used the standard build protocol #2 as outlined by Carson at https://groups.google.com/forum/#!searchin/maker-devel/Provide$20maker_gff$20%7Csort:date/maker-devel/7nU0EaSe2ww/Hb8ARa0WBAAJ. This gave me a total of 23,559 predicted genes (21,960 genes from the default build + 1,599 genes that were rescued with Pfam domains) for the entire assembly. > > Now, I want to only use scaffolds that have been mapped and ordered along chromosomes to create a pseudochromosal sequence for each chromosome that stitches together all the ordered scaffolds along a chromosome, each scaffold separated by a stretch of 100 Ns, for synteny analyses. I then want to get annotations for each of these pseudochromosal sequences. I am trying to see if I can re-do the annotations by Maker on these pseudochromosal sequences using the masked assembly produced by Maker above. I have extracted the masked sequence for ordered scaffolds for each chromosome and have used these masked sequences to create pseudochromosomal scaffolds, resulting in 13 scaffolds representing the 13 chromosomes. I tried using these masked sequences (13 scaffolds) as input for Maker to create a standard build (Annotation B), but am getting less genes predicted for these scaffolds than what I got from my entire assembly (Annotation A) above. > > For all ordered scaffolds across chromosomes, I got 21,419 genes from the standard build annotation on the entire assembly (Annotation A). However, using the masked pseudochromosomal scaffolds (Annotation B), I am getting less genes predicted for the same set of scaffolds: 20,830 genes (18,465 from default build + 2,365 genes that were rescued with Pfam domains). > > I am wondering if I have a setting wrong for my maker_opts.ctl files for the default and standard build runs on the masked sequences, see attached below, particularly in the repeat masking or re-annotation part of the file. > I also looked at the default build annotations in Jbrowse and compared Annotation A to Annotation B, and they looked similar except that there were transcripts from my altest= file that were not showing up in Annotation B, but were present in Annotation A; therefore, Maker did not predict some of these genes in Annotation B, but did in Annotation A. So I think Maker was missing some things in Annotation B. Is this unexpected? If so, is someone willing to check my steps and control files below to see if I did something wrong? > > Or, if someone has a better suggestion for extracting coding sequence from these pseudochromosal scaffolds that were created post-Maker annotation on the entire assembly, I would welcome it. iThanks a bunch. > > > Annotation A default build steps: > > $ maker -base Rwill10 -fix_nucleotides > $ maker -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11983 11983 312159 > #should be 11985 > > $ maker -dsindex -base Rwill10 -fix_nucleotides > > $ grep FINISHED Rwill10_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10_master_datastore_index.log > > $ grep -cP '\tgene\t' ../Rwill10.maker.output/Rwill10.all.gff > 21960 > > $ fasta_merge -d Rwill10_master_datastore_index.log > > > > > Annotation A standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta > > #genes in Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta and Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv are not in Rwill10.all.gff file > #IDs in .tsv file are called "processed-gene" from .fasta file, > #but in .gff file, I think these are called "abinit-gene" > #best thing would be to replace "processed" with "abinit" in tsv file and then grep .gff file with these IDs to create pred_gff > $ sed s/\-processed\-/\-abinit\-/g Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > #extract list of IDs only to grep for > cut -f 1 Rwill10.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 1599 1599 102958 Rwill10.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.all.gff file and create Rwill10.PfamA.abinit.gff > > $ maker -base Rwill10standard2 -fix_nucleotides > $ maker -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11975 11975 311953 > #should be 11985 > > $ maker -dsindex -base Rwill10standard2 -fix_nucleotides > > $ grep FINISHED Rwill10standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 11985 11985 312211 > > $ gff3_merge -d Rwill10standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10standard2.all.gff > 23559 > > $ fasta_merge -d Rwill10standard2_master_datastore_index.log > > > > > Annotation B default build steps: > > $ find Rwill10.maker.output -name 'query.masked.fasta' | sort -V -t "/" -k 5 | xargs cat > Rwill10.maker.assembly_masked.sorted.fasta > > #Use perl script remove_seq_breaks.pl to remove newline characters from sequences in genome fasta file so that only 1 line of sequence follows fasta header > $ perl ../remove_seq_breaks.pl Rwill10.maker.assembly_masked.sorted.fasta > Rwill10.maker.assembly_masked.sorted.fasta.woseqbreaks > > #use script to extract ordered scaffolds for each chromosome > $ ./extract_scaffolds_synteny.sh > > #use script to create pseudochromosomal sequence for each chromosome > $ ./create_pseudo_chromosome_allLGs.sh > > #concatenate these into one fasta file > cat *_ordered.masked.fasta2.pseudochromo > R.will10.masked.pseudochromos.fasta > > $ maker -base Rwill10.pseudochromos -fix_nucleotides > $ maker -base Rwill10.pseudochromos -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromos_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromos_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromos.all.gff > 18465 > > $ fasta_merge -d Rwill10.pseudochromos_master_datastore_index.log > > > > > Annotation B standard build steps: > > $ ./interproscan.sh -appl PfamA -iprlookup -goterms -f tsv -i Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta > > $ sed s/\-processed\-/\-abinit\-/g Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > > $ cut -f 1 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.fasta.tsv.IDsfixed > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs > > $ sort Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.IDs | uniq > Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > $ wc Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > 2365 2365 136750 Rwill10.pseudochromos.all.maker.non_overlapping_ab_initio.proteins.PfamA.uniqIDs > > #used create_pred_gff.sh script to grep IDs from Rwill10.pseudochromos.all.gff file and create Rwill10.pseudochromos.PfamA.abinit.gff > > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > $ maker -base Rwill10.pseudochromo.standard2 -fix_nucleotides > > $ grep FINISHED Rwill10.pseudochromo.standard2_master_datastore_index.log | sort | uniq | cut -f 1 | wc > 13 13 312 > > $ gff3_merge -d Rwill10.pseudochromo.standard2_master_datastore_index.log > > $ grep -cP '\tgene\t' Rwill10.pseudochromo.standard2.all.gff > 20830 > > > > > -Valerie > > Valerie Soza, Ph.D. > c/o Hall Lab > Department of Biology > University of Washington > Johnson Hall 202A > Box 351800 > Seattle, WA 98195-1800 > 206-543-6740 > http://staff.washington.edu/vsoza/ > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From flopezo84 at gmail.com Sat Jun 9 14:06:48 2018 From: flopezo84 at gmail.com (=?UTF-8?Q?Federico_L=C3=B3pez?=) Date: Sat, 9 Jun 2018 16:06:48 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores Message-ID: Hello, I'm using MAKER's "quality_filter.pl" with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kissaj at miamioh.edu Mon Jun 11 11:56:46 2018 From: kissaj at miamioh.edu (Andor J Kiss) Date: Mon, 11 Jun 2018 13:56:46 -0400 Subject: [maker-devel] largest genome annotated? Message-ID: <1528739806.4677.97.camel@miamioh.edu> What's the largest genome that's been annotated with Maker2? Thanks, -- ________________________________________________________________________________________________________________________ Andor J Kiss, PhD Director - Center for Bioinformatics & Functional Genomics 086 Pearson Hall - Miami University 700 East High Street, Oxford Ohio 45056 USA eMAIL:?KissAJ at MiamiOH.edu? Telephone: +1 (513) 529-4280 Fax: +1 (513) 529-2431 Ring ID:?andorjkiss URL (CBFG):?http://miamioh.edu/cas/academics/centers/cbfg/? URL (CBFG Services):?https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics URL (Research):?http://openwetware.org/wiki/User:Andor_J_Kiss? -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 12:05:07 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:05:07 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <1528739806.4677.97.camel@miamioh.edu> References: <1528739806.4677.97.camel@miamioh.edu> Message-ID: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. ?Carson > On Jun 11, 2018, at 11:56 AM, Andor J Kiss wrote: > > What's the largest genome that's been annotated with Maker2? > > Thanks, > -- > ________________________________________________________________________________________________________________________ > Andor J Kiss, PhD > Director - Center for Bioinformatics & Functional Genomics > 086 Pearson Hall - Miami University > 700 East High Street, Oxford > Ohio 45056 > USA > > eMAIL: KissAJ at MiamiOH.edu > Telephone: +1 (513) 529-4280 > Fax: +1 (513) 529-2431 > Ring ID: andorjkiss > > URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ > URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics > URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Jun 11 12:13:28 2018 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Jun 2018 12:13:28 -0600 Subject: [maker-devel] largest genome annotated? In-Reply-To: <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> References: <1528739806.4677.97.camel@miamioh.edu> <34BF91FD-439F-466D-B1EA-3FEDB9EA0F86@gmail.com> Message-ID: <45218C8A-70A4-4146-9962-EA0E8F506265@gmail.com> Correction. The recently published axolotl just barely edged out the Sugar Pine in size a couple of months ago. They did not really annotate it though, instead they independently assembled the transcriptome from mRNA-seq and aligned those where they could. ?Carson > On Jun 11, 2018, at 12:05 PM, Carson Holt wrote: > > The Sugar Pine. It's 31 gigabases in length and the largest genome ever sequenced. The Loblolly Pine (second largest genome ever sequenced at just over 20 gigabases) was also annotated using MAKER. > > ?Carson > > > >> On Jun 11, 2018, at 11:56 AM, Andor J Kiss > wrote: >> >> What's the largest genome that's been annotated with Maker2? >> >> Thanks, >> -- >> ________________________________________________________________________________________________________________________ >> Andor J Kiss, PhD >> Director - Center for Bioinformatics & Functional Genomics >> 086 Pearson Hall - Miami University >> 700 East High Street, Oxford >> Ohio 45056 >> USA >> >> eMAIL: KissAJ at MiamiOH.edu >> Telephone: +1 (513) 529-4280 >> Fax: +1 (513) 529-2431 >> Ring ID: andorjkiss >> >> URL (CBFG): http://miamioh.edu/cas/academics/centers/cbfg/ >> URL (CBFG Services): https://www.scienceexchange.com/labs/center-for-bioinformatics-functional-genomics >> URL (Research): http://openwetware.org/wiki/User:Andor_J_Kiss >> >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jennifer.anderson at ebc.uu.se Tue Jun 12 09:59:31 2018 From: jennifer.anderson at ebc.uu.se (Jennifer Anderson) Date: Tue, 12 Jun 2018 17:59:31 +0200 Subject: [maker-devel] Merge warning = 1 Message-ID: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Hello, I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. 000030F|arrow maker gene 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 000030F|arrow maker mRNA 9255 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 000030F|arrow maker exon 9838 9992 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker exon 9255 9762 . - . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9838 9992 . - 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 000030F|arrow maker CDS 9255 9762 . - 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 Best, Jenni N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Jun 12 10:03:37 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 10:03:37 -0600 Subject: [maker-devel] Merge warning = 1 In-Reply-To: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> References: <44468745-0D50-48D4-B941-1864EA067A7D@ebc.uu.se> Message-ID: It?s an internal debugging related value used in the beta. It has no meaning for the general user and will eventually disappear. ?Carson > On Jun 12, 2018, at 9:59 AM, Jennifer Anderson wrote: > > Hello, > > I am working on the ab initiio annotation of a fungal genome, taking advantage of cDNA and protein evidence from other species. I am looking at my gff file following my first full attempt (hmm files from snap trained 2x, and GenemarkES). > > I have not found information on what it means ?merge_warning=1? at the end of the ID line, as in the example below. Out of 6834 ID lines that include AED scores in this gff file, 5441 contain this warning. I would appreciate it if someone could point me in the right direction. > > > 000030F|arrow maker gene > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43 > 000030F|arrow > maker mRNA > 9255 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;Parent=genemark-000030F|arrow-processed-gene-0.43;Name=genemark-000030F|arrow-processed-gene-0.43-mRNA-1;_AED=0.06;_eAED=0.06;_QI=0|0|0|1|1|1|2|0|220;_merge_warning=1 > 000030F|arrow maker exon > 9838 9992 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:2;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker exon > 9255 9762 > . - > . ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:1;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9838 9992 > . - > 0 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > 000030F|arrow maker CDS > 9255 9762 > . - > 1 ID=genemark-000030F|arrow-processed-gene-0.43-mRNA-1:cds;Parent=genemark-000030F|arrow-processed-gene-0.43-mRNA-1 > > Best, > > Jenni > > > > > > > > > > > > > > > > > > > > > > > > > N?r du har kontakt med oss p? Uppsala universitet med e-post s? inneb?r det att vi behandlar dina personuppgifter. F?r att l?sa mer om hur vi g?r det kan du l?sa h?r: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > > E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/om-uu/dataskydd-personuppgifter/ > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From steinj at cshl.edu Tue Jun 12 12:08:19 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 18:08:19 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions Message-ID: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Dear Carson and maker-devel group, In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. Thanks, Josh Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Tue Jun 12 14:19:19 2018 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 12 Jun 2018 14:19:19 -0600 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> Message-ID: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. ?Carson > On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: > > Dear Carson and maker-devel group, > > In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. > > How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? > Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. > > Thanks, > Josh > > > Joshua Stein, PhD > Manager, Sci. Informatics III > Cold Spring Harbor Laboratory > steinj at cshl.edu > http://ware.cshl.org/ > > > From steinj at cshl.edu Tue Jun 12 14:31:13 2018 From: steinj at cshl.edu (Stein, Joshua) Date: Tue, 12 Jun 2018 20:31:13 +0000 Subject: [maker-devel] Transcript & protein fasta sequence id/name collisions In-Reply-To: <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> References: <79E2909D-8CED-4E95-B332-4FED34F55D45@cshl.edu> <91A6DB6E-B41E-4A6A-81D0-E95CBFCE07C7@gmail.com> Message-ID: Hi Carson, Thanks for identifying the problem. I see that the input fgenesh GFF3 did not use unique identifiers, so we will attack the problem there. Best, Josh > On Jun 12, 2018, at 4:19 PM, Carson Holt wrote: > > The ID= tag in GFF3 is not meant to be an identifier (despite being called ID). Instead it's a tag used to determine inheritance when reassembling a feature. It will be treated as such by all GMOD tools while Name will be treated as the user facing identifier. To further complicate things, ?Name' is not required to be unique (some groups like to name multi-copy genes the same and then have multiple locations in the GFF3 - you will see this all the time in NCBI derived GFF3 for example). I would not be surprised if fgenesh does not produce "best-practice? GFF3. You may need to slightly alter it before using it. > > On the MAKER end, at the very least MAKER it would be nice for it to complain that the names in the input file are not unique (even though non-unique names are allowed in GFF3). If you give GFF3 to pred_gff then MAKER should build it?s own unique names for things, but for model_gff it will keep the name you give it. > > ?Carson > > >> On Jun 12, 2018, at 12:08 PM, Stein, Joshua wrote: >> >> Dear Carson and maker-devel group, >> >> In our recent MAKER run, some of the transcript and protein id?s in the fasta files (e.g. *.all.maker.transcripts.fasta) correspond to the "Name=? field of the GFF. The problem is that these names are not unique, so for example the transcript ID ?mRNA_4? occurs 45 times, thus making it difficult to determine their corresponding gene ids. My colleague, Kapeel, who is copied, believes this happens when fgenesh results are passed to MAKER as a GFF using the ?pred_gff=? parameter. >> >> How do we prevent this from happening in the future (e.g. tell MAKER to use the "ID=? field for transcript/protein fasta id?s)? >> Do you have any tips on how to triage the current situation (i.e. figure out which fasta corresponds to which gene)? I suppose it is possible to match up based on the quality metrics in the definition line, assuming these are unique. >> >> Thanks, >> Josh >> >> >> Joshua Stein, PhD >> Manager, Sci. Informatics III >> Cold Spring Harbor Laboratory >> steinj at cshl.edu >> http://ware.cshl.org/ >> >> >> > Joshua Stein, PhD Manager, Sci. Informatics III Cold Spring Harbor Laboratory steinj at cshl.edu http://ware.cshl.org/ From carsonhh at gmail.com Wed Jun 13 11:46:12 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 11:46:12 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: Message-ID: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. ?Carson > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ss2489 at cornell.edu Wed Jun 13 13:34:27 2018 From: ss2489 at cornell.edu (Surya Saha) Date: Wed, 13 Jun 2018 15:34:27 -0400 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: Hi Carson, We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks -Surya On Wed, Jun 13, 2018 at 2:03 PM Carson Holt wrote: > The eAED score also take protein reading frame into account and it can > infers support for exons when both introns are validated (i.e. can be lower > than AED in some cases). For your case where eAED is 1 but AED less than 1 > means that you evidence support is from an overlapping protein, but it is > never in the same reading frame as the gene model. So the positive evidence > support may be suspect, or it may be real and the model is poor because of > the assembly, gaps, etc. To use eAED instead in the quality_filter.pl > script, you would have to to manually edit the script and replace ?_AED' > with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower > quality assemblies (places where the predictors make the best model they > can and not the correct model because the assembly won?t allow for the > correct model but there is evidence that there is a gene locus). So make > sure to always view suspect regions in browser first. > > ?Carson > > > > On Jun 9, 2018, at 2:06 PM, Federico L?pez wrote: > > Hello, > > I'm using MAKER's "quality_filter.pl" with the default option (AED<1). > However, I have noticed cases in which models have low AED scores and high > eAED scores (1.00), so presumably the good AED scores are the result of > spurious evidence alignments. Is there a way to filter models based on eAED > scores too? > > Thank you. > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > -- Surya Saha Sol Genomics Network Boyce Thompson Institute, Ithaca, NY, USA https://citrusgreening.org/ http://www.linkedin.com/in/suryasaha https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Jun 13 13:57:46 2018 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Jun 2018 13:57:46 -0600 Subject: [maker-devel] Filtering gene models based on eAED scores In-Reply-To: References: <3C47D6E0-1E36-4C5E-876B-7E343D7C9E1E@gmail.com> Message-ID: AED is documented in the 2011 MAKER2 paper, but eAED (extended AED) is not currently documented in a publication and is not used by any of the scripts that come with MAKER (it?s just there for reference right now). Basically AED is calculated with evidence overlap, but eAED will not count protein overlap unless it occurs in the same codon reading frame as the model (so evidence may count for a stretch, then stop counting for a few codons, then count again if there is an insertion in the alignment). Also eAED will infer support for exons if both introns are validated by evidence and the region in between is all ORF (this allows joint intron support to infer support for an internal exon). 99% of the time AED and eAED are the same, but eAED can be useful in identifying edge cases. Much of the time if AED and eAED are very different, it?s because there is a single base pair insertion or deletion in the assembly. The predictors still find the locus the best they can, but protein evidence and alignments will be out of sync with the reading frame on one of the exons. BLAST can?t really handle single bp INDELs in it?s alignments, but Exonerate can do mid alignment reading frame shifts to capture the assembly INDEL (and eAED is an attempt to use the extra Exonerate info in the score). ?Carson > On Jun 13, 2018, at 1:34 PM, Surya Saha wrote: > > Hi Carson, > > We have been using AED as a primary metric for evaluating predictions in our group but it sounds like we should be using both eAED and AED. Is there a detailed explanation of how exactly eAED and AED are computed besides Table 2 in the Cantarel 2008 paper? Thanks > > -Surya > > On Wed, Jun 13, 2018 at 2:03 PM Carson Holt > wrote: > The eAED score also take protein reading frame into account and it can infers support for exons when both introns are validated (i.e. can be lower than AED in some cases). For your case where eAED is 1 but AED less than 1 means that you evidence support is from an overlapping protein, but it is never in the same reading frame as the gene model. So the positive evidence support may be suspect, or it may be real and the model is poor because of the assembly, gaps, etc. To use eAED instead in the quality_filter.pl script, you would have to to manually edit the script and replace ?_AED' with ?_eAED?. Using eAED instead will greatly drop sensitivity on lower quality assemblies (places where the predictors make the best model they can and not the correct model because the assembly won?t allow for the correct model but there is evidence that there is a gene locus). So make sure to always view suspect regions in browser first. > > ?Carson > > > >> On Jun 9, 2018, at 2:06 PM, Federico L?pez > wrote: >> >> Hello, >> >> I'm using MAKER's "quality_filter.pl " with the default option (AED<1). However, I have noticed cases in which models have low AED scores and high eAED scores (1.00), so presumably the good AED scores are the result of spurious evidence alignments. Is there a way to filter models based on eAED scores too? >> >> Thank you. >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > -- > > Surya Saha > Sol Genomics Network > Boyce Thompson Institute, Ithaca, NY, USA > https://citrusgreening.org/ > http://www.linkedin.com/in/suryasaha > https://twitter.com/SahaSurya -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdolby at asu.edu Fri Jun 15 10:29:16 2018 From: gdolby at asu.edu (Greer Dolby) Date: Fri, 15 Jun 2018 09:29:16 -0700 Subject: [maker-devel] Best annotation set failure (auto_annotator.pm line 1774) Message-ID: <7A8554F9-3456-4D53-B5A8-80FA794F42EE@asu.edu> Hello, I?m de novo annotating a second-generation genome and I have a handful of scaffolds that keep failing while choosing the best annotation set (errors below). To my knowledge there is anything wrong with the scaffolds themselves and the others have run fine. Looking at the VOID folder for the failed contigs for different runs, they appear to fail at different chunks but always with these two errors. I?m running 30 cores with MPICH2 on a private server. Any suggestions? Thanks! Best, Greer ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 1 ...processing 8 of 12 total clusters:44 now processing 0 ...processing 0 of 3 ...processing 1 of 3 ...processing 2 of 3 total clusters:44 now processing 0 ...processing 0 of 4 ...processing 1 of 4 ...processing 9 of 12 ...processing 2 of 4 ...processing 3 of 4 total clusters:44 now processing 0 ...processing 10 of 12 ...processing 0 of 67 ...processing 1 of 67 ERROR: Chunk failed at level:6, tier_type:0 ...processing 2 of 67 FAILED CONTIG:ScCC6lQ_38465;HRSCAF=64658 ^^^^^^^^^^^^^^^^^^^^^^^^^^EXAMPLE 2 ...processing 9 of 298 ...processing 8 of 81 ...processing 11 of 202 ...processing 13 of 20 ...processing 10 of 298 ...processing 9 of 81 ...processing 10 of 81 ...processing 18 of 123 ...processing 14 of 20 ...processing 17 of 54 ...processing 18 of 54 ...processing 37 of 164 ...processing 20 of 254 Died at /home/jcornel3/tools/maker/bin/../lib/maker/auto_annotator.pm line 1774. --> rank=17, hostname=omega ERROR: Failed while choosing best annotation set ERROR: Chunk failed at level:4, tier_type:4 FAILED CONTIG:ScCC6lQ_16796;HRSCAF=38896 _________________________________ Greer Dolby, PhD Postdoctoral Research Scholar SoLS, Arizona State U. office: LSE 313, 480.965.7456 website | twitter Kusumi Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From kapeelc at gmail.com Fri Jun 22 13:41:58 2018 From: kapeelc at gmail.com (Kapeel Chougule) Date: Fri, 22 Jun 2018 15:41:58 -0400 Subject: [maker-devel] map_forward=1 not mapping reference ID's to output correctly Message-ID: Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- *Kapeel ChouguleComputational Scientist Developer II* *One Bungtown Road Cold Spring Harbor, NY 11724http://www.warelab.org/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4991 bytes Desc: not available URL: From monica.poelchau at ars.usda.gov Fri Jun 22 14:04:28 2018 From: monica.poelchau at ars.usda.gov (Poelchau, Monica) Date: Fri, 22 Jun 2018 20:04:28 +0000 Subject: [maker-devel] [CAUTION: Suspicious Link] map_forward=1 not mapping reference ID's to output correctly In-Reply-To: References: Message-ID: Hi Kapeel, If you just want your community annotations to replace models in an existing gene set, we have a tool for this: https://github.com/NAL-i5K/GFF3toolkit You?d need to run gff3_QC on your annotation files first to make sure your annotations are okay, then use gff3_merge to merge your community annotations with your existing gene set (in gff3 format). If you end up trying this out - we?re actively developing the GFF3toolkit, so feel free to post an issue if you notice any problems. Hth, Monica From: maker-devel on behalf of Kapeel Chougule Date: Friday, June 22, 2018 at 13:53 To: "maker-devel at yandell-lab.org" Subject: [CAUTION: Suspicious Link][maker-devel] map_forward=1 not mapping reference ID's to output correctly PROCEED WITH CAUTION: This message triggered warnings of potentially malicious web content. Evaluate this email by considering whether you are expecting the message, along with inspection for suspicious links. Questions: Spam.Abuse at wdc.usda.gov Hi, I am trying to update community annotation in the light of new evidence data but my MAKER runs are not keeping all the genes from the community annotation. Community annotation feature count: 2 1 bicolor 239969 CDS 266301 exon 51066 five_prime_UTR 34129 gene 47121 mRNA 53708 three_prime_UTR MAKER gene count-> awk '$3=="gene"{print}' maker_output.all.gff | grep "Sobic*" | wc -l 21105 In the maker_opts.ctl file attached, I did make keep_preds=1 and map_forward=1 which keep all the community gene models even if they dont have evidence support. This was explained here: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Updating_annotations_in_light_of_new_data . So not sure why we dont have the all the community gene models mapped in the MAKER output Thanks Kapeel -- Kapeel Chougule Computational Scientist Developer II One Bungtown Road Cold Spring Harbor, NY 11724 http://www.warelab.org/ This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andremmachado25 at gmail.com Tue Jun 26 09:36:24 2018 From: andremmachado25 at gmail.com (=?UTF-8?Q?Andr=C3=A9_Machado?=) Date: Tue, 26 Jun 2018 16:36:24 +0100 Subject: [maker-devel] Maker Error : Thread 1 terminated abnormally.. Message-ID: Hi , First of all thanks for your efforts in Maker pipeline. Its a tremendous help for the people that works with genomes. In the last 4 days i have broke my head.. with an error .. but still without a solution. I found this old thread: https://groups.google.com/forum/#!msg/maker-devel/X2-76BH9gvg/rU4kLJ3B6tsJ Seems to be a quite similar... but don't point to a specific solution. I have run maker with the data test and all runned ok. Maker finalize the entire process without errors. Recently, i?m trying to aplly my own data on MPI cluster. But this error, frequently occurred. Thread 1 terminated abnormally: ../dna.maker.output/mpi_blastdb/dna%2Efa.mpi.1/dna%2Efa.mpi.1.0 --> rank=8, hostname=compute-0-1.local, at ../Analysis/Geno/maker/bin/maker line 1451 thread 1. --> rank=8, hostname=compute-0-1.local deleted:0 hits deleted:0 hits preparing ab-inits deleted:0 hits deleted:0 hits FATAL: Thread terminated, causing all processes to fail --> rank=8, hostname=compute-0-1.local deleted:0 hits Basically im tring to run a maker with dna.fa, rna.fa, prot.fa and my_custom_lib_of_repeats.fa, to produce raw genes models which will be used to train SNAP. I already used several command lines and all gave me the same error.. The only change between different tests was the local of the error, sometimes happened in compute-0-1.local other time in compute-0-4.local or in another one. mpiexec -n 63 --hostfile Host maker 1>1.log 2>2.err mpiexec --hostfile Host maker 1>1.log 2>2.err mpiexec -mca btl ^openib -n 63 --hostfile Host maker 1>1.log 2>2.err nohup mpiexec -mca btl ^openib -n 63 --hostfile Host maker -a 1>1.log 2>2.err The log file as well the option files are provided below. Many thanks in advance, Andr? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2.log Type: text/x-log Size: 38655 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_exe.ctl Type: application/octet-stream Size: 1224 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_opts.ctl Type: application/octet-stream Size: 4548 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: maker_bopts.ctl Type: application/octet-stream Size: 1413 bytes Desc: not available URL: