From willett4 at email.unc.edu Fri Sep 1 10:22:34 2017 From: willett4 at email.unc.edu (Willett, Christopher S) Date: Fri, 1 Sep 2017 15:22:34 +0000 Subject: [maker-devel] ERROR: Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535 Message-ID: Hi Everyone- I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message: "Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.? This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8. If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information). Thanks, Best, Chris Willett error 48600 #--------- command -------------# Widget::augustus: /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus #-------------------------------# deleted:0 genes ...processing 0 of 5 ...processing 1 of 5 ...processing 2 of 5 ...processing 3 of 5 ...processing 4 of 5 Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. --> rank=NA, hostname=c-195-51.kd.unc.edu ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Chromosome_3 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Chromosome_3 error 48599 Widget::augustus: /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus #-------------------------------# deleted:0 genes ...processing 0 of 10 ...processing 1 of 10 ...processing 2 of 10 ...processing 3 of 10 ...processing 4 of 10 ...processing 5 of 10 ...processing 6 of 10 ...processing 7 of 10 ...processing 8 of 10 ...processing 9 of 10 Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. --> rank=NA, hostname=c-195-51.kd.unc.edu ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Chromosome_11 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Chromosome_11 error 48592 #--------- command -------------# Widget::snap: /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x def.snap /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap #-------------------------------# scoring....decoding.10.20.30.40.50.60.70.80.90.100 done deleted:0 genes ...processing 0 of 5 ...processing 1 of 5 ...processing 2 of 5 ...processing 3 of 5 ...processing 4 of 5 Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. --> rank=NA, hostname=c-193-25.kd.unc.edu ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Chromosome_5 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Chromosome_5 error 47069 #--------- command -------------# Widget::snap: /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x def.snap /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap #-------------------------------# scoring....decoding.10.20.30.40.50.60.70.80.90.100 done deleted:0 genes ...processing 0 of 10 ...processing 1 of 10 ...processing 2 of 10 ...processing 3 of 10 ...processing 4 of 10 ...processing 5 of 10 ...processing 6 of 10 ...processing 7 of 10 ...processing 8 of 10 ...processing 9 of 10 Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. --> rank=NA, hostname=c-183-35.kd.unc.edu ERROR: Failed while annotating transcripts ERROR: Chunk failed at level:1, tier_type:4 FAILED CONTIG:Chromosome_12 ERROR: Chunk failed at level:6, tier_type:0 FAILED CONTIG:Chromosome_12 Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535. From chzelin at gmail.com Tue Sep 5 08:59:09 2017 From: chzelin at gmail.com (zl c) Date: Tue, 5 Sep 2017 09:59:09 -0400 Subject: [maker-devel] MSG: Can't get HSPs: data not collected. Message-ID: Hello, I run maker for most sequences successfully but fail some long sequences. The error is: Widget::tblastx: /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out OUT.tblastx #-------------------------------# ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Can't get HSPs: data not collected. STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486 STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552 STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251 STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260 STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471 STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291 STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320 STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340 STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356 STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287 STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287 STACK: /home/chenz11/program/maker/bin/maker:695 ----------------------------------------------------------- --> rank=NA, hostname=cn3544 --> rank=NA, hostname=cn3544 --> rank=NA, hostname=cn3544 --> rank=NA, hostname=cn3544 ERROR: Failed while collecting tblastx reports ERROR: Chunk failed at level:5, tier_type:3 FAILED CONTIG:tig00011625_arrow ERROR: Chunk failed at level:4, tier_type:0 FAILED CONTIG:tig00011625_arrow examining contents of the fasta file and run log I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem? Thanks, Zelin -------------------------------------------- Zelin Chen [chzelin at gmail.com] NIH/NHGRI Building 50, Room 5531 50 SOUTH DR, MSC 8004 BETHESDA, MD 20892-8004 -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Tue Sep 5 15:24:34 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Tue, 5 Sep 2017 16:24:34 -0400 Subject: [maker-devel] Some errors reported by Maker2 Message-ID: Hello: We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems). Do you have any suggestions? Many thanks #some kinds of errors open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40. --> rank=NA, hostname=n520 ERROR: Failed while doing blastx of proteins ERROR: Chunk failed at level:8, tier_type:3 FAILED CONTIG:Contig2 setting up GFF3 output and fasta chunks doing repeat masking Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. --> rank=NA, hostname=n513 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:Contig12378 Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Sep 5 15:56:01 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 5 Sep 2017 14:56:01 -0600 Subject: [maker-devel] ERROR: Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535 In-Reply-To: References: Message-ID: <7DCB519E-9AFA-4D10-8046-72DE99C5E4FF@gmail.com> Did you use gff3 input to MAKER for any steps (example pred_gff or est_gff)? ?Carson > On Sep 1, 2017, at 9:22 AM, Willett, Christopher S wrote: > > Hi Everyone- > > I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message: > > "Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.? > > This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. > > We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8. > > If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information). > > Thanks, > > Best, > > Chris Willett > > > > error 48600 > > #--------- command -------------# > Widget::augustus: > /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus > #-------------------------------# > deleted:0 genes > ...processing 0 of 5 > ...processing 1 of 5 > ...processing 2 of 5 > ...processing 3 of 5 > ...processing 4 of 5 > Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. > --> rank=NA, hostname=c-195-51.kd.unc.edu > ERROR: Failed while annotating transcripts > ERROR: Chunk failed at level:1, tier_type:4 > FAILED CONTIG:Chromosome_3 > > ERROR: Chunk failed at level:6, tier_type:0 > FAILED CONTIG:Chromosome_3 > > error 48599 > > Widget::augustus: > /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus > #-------------------------------# > deleted:0 genes > ...processing 0 of 10 > ...processing 1 of 10 > ...processing 2 of 10 > ...processing 3 of 10 > ...processing 4 of 10 > ...processing 5 of 10 > ...processing 6 of 10 > ...processing 7 of 10 > ...processing 8 of 10 > ...processing 9 of 10 > Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. > --> rank=NA, hostname=c-195-51.kd.unc.edu > ERROR: Failed while annotating transcripts > ERROR: Chunk failed at level:1, tier_type:4 > FAILED CONTIG:Chromosome_11 > > ERROR: Chunk failed at level:6, tier_type:0 > FAILED CONTIG:Chromosome_11 > > error 48592 > > #--------- command -------------# > Widget::snap: > /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x > def.snap /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap > #-------------------------------# > scoring....decoding.10.20.30.40.50.60.70.80.90.100 done > deleted:0 genes > ...processing 0 of 5 > ...processing 1 of 5 > ...processing 2 of 5 > ...processing 3 of 5 > ...processing 4 of 5 > Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. > --> rank=NA, hostname=c-193-25.kd.unc.edu > ERROR: Failed while annotating transcripts > ERROR: Chunk failed at level:1, tier_type:4 > FAILED CONTIG:Chromosome_5 > > ERROR: Chunk failed at level:6, tier_type:0 > FAILED CONTIG:Chromosome_5 > > error 47069 > > #--------- command -------------# > Widget::snap: > /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x > def.snap /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap > #-------------------------------# > scoring....decoding.10.20.30.40.50.60.70.80.90.100 done > deleted:0 genes > ...processing 0 of 10 > ...processing 1 of 10 > ...processing 2 of 10 > ...processing 3 of 10 > ...processing 4 of 10 > ...processing 5 of 10 > ...processing 6 of 10 > ...processing 7 of 10 > ...processing 8 of 10 > ...processing 9 of 10 > Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535. > --> rank=NA, hostname=c-183-35.kd.unc.edu > ERROR: Failed while annotating transcripts > ERROR: Chunk failed at level:1, tier_type:4 > FAILED CONTIG:Chromosome_12 > > ERROR: Chunk failed at level:6, tier_type:0 > FAILED CONTIG:Chromosome_12 > > > Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535. > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From carsonhh at gmail.com Tue Sep 5 16:48:56 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 5 Sep 2017 15:48:56 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: Message-ID: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage. So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option). ?Carson > On Sep 5, 2017, at 2:24 PM, Quanwei Zhang wrote: > > Hello: > > We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. > > But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). > > I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems). > > Do you have any suggestions? Many thanks > #some kinds of errors > open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40. > --> rank=NA, hostname=n520 > ERROR: Failed while doing blastx of proteins > ERROR: Chunk failed at level:8, tier_type:3 > FAILED CONTIG:Contig2 > > > setting up GFF3 output and fasta chunks > doing repeat masking > Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. > --> rank=NA, hostname=n513 > ERROR: Failed while doing repeat masking > ERROR: Chunk failed at level:0, tier_type:1 > FAILED CONTIG:Contig12378 > > > Best > Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Sep 5 17:04:00 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 5 Sep 2017 16:04:00 -0600 Subject: [maker-devel] MSG: Can't get HSPs: data not collected. In-Reply-To: References: Message-ID: <846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com> The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it. If not, switch to legacy BLAST (not blast plus) and see if it goes away. ?Carson > On Sep 5, 2017, at 7:59 AM, zl c wrote: > > Hello, > > I run maker for most sequences successfully but fail some long sequences. The error is: > > Widget::tblastx: > /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out OUT.tblastx > #-------------------------------# > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Can't get HSPs: data not collected. > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486 > STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552 > STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 > STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 > STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251 > STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260 > STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471 > STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291 > STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320 > STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340 > STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356 > STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287 > STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287 > STACK: /home/chenz11/program/maker/bin/maker:695 > ----------------------------------------------------------- > --> rank=NA, hostname=cn3544 > --> rank=NA, hostname=cn3544 > --> rank=NA, hostname=cn3544 > --> rank=NA, hostname=cn3544 > ERROR: Failed while collecting tblastx reports > ERROR: Chunk failed at level:5, tier_type:3 > FAILED CONTIG:tig00011625_arrow > > ERROR: Chunk failed at level:4, tier_type:0 > FAILED CONTIG:tig00011625_arrow > > examining contents of the fasta file and run log > > I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem? > > Thanks, > Zelin > > -------------------------------------------- > Zelin Chen [chzelin at gmail.com ] > > > NIH/NHGRI > Building 50, Room 5531 > 50 SOUTH DR, MSC 8004 > BETHESDA, MD 20892-8004 > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Tue Sep 5 17:04:23 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Tue, 5 Sep 2017 18:04:23 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> Message-ID: Dear Carson: Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds. I set max_dna_len as 1Mb, because there are quite many long scaffolds (e.g., the longest one is about 100Mb). Would you explain whether smaller "max_dna_len" will decrease the quality of annotation (e.g., split some genes in the same scaffold)? Best Quanwei 2017-09-05 17:48 GMT-04:00 Carson Holt : > You ran out of memory. You probably set max_dna_len too high for the > machines you are using. There is a note in the maker_opts.ctl file that > tells you that this value affects memory usage. > > So you can either set it lower, or if running under MPI, use fewer CPUs > per node (how you do this is MPI flavor dependent, but some flavors let you > do this by setting process count lower combined with the round robin > option). > > ?Carson > > > > On Sep 5, 2017, at 2:24 PM, Quanwei Zhang wrote: > > Hello: > > We are doing genome annotation for a new rodent species. We have finished > the training of the ab initio gene predictors successful by setting the > following parameters (split_hit=40000, max_dna_len=1000000, and 99k > mammalian Swiss protein sequences as evidences. > > But when I used the trained model to do the genome annotation, I got the > following kinds of errors (shown in red). I used the same parameters as > those for training, except for addition of 340k rodent TrEMBL protein > sequences for protein evidences (i.e., I use both 99k mammalian Swiss > protein sequences and 340k rodent TrEMBL protein sequences). > > I am doing the annotation on a cluster and started multiple Maker in the > same directory (I had tried to use MPI but met some problems). > > Do you have any suggestions? Many thanks > #some kinds of errors > open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2. > 31.9/bin/../lib/Widget/blastx.pm line 40. > --> rank=NA, hostname=n520 > ERROR: Failed while doing blastx of proteins > ERROR: Chunk failed at level:8, tier_type:3 > FAILED CONTIG:Contig2 > > > setting up GFF3 output and fasta chunks > doing repeat masking > Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2. > 31.9/bin/../lib/File/NFSLock.pm line 1050. > --> rank=NA, hostname=n513 > ERROR: Failed while doing repeat masking > ERROR: Chunk failed at level:0, tier_type:1 > FAILED CONTIG:Contig12378 > > > Best > Quanwei > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Sep 5 17:08:28 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 5 Sep 2017 16:08:28 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> Message-ID: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> max_dna_len is the window size for keeping data in RAM. Smaller values do not split genes. But values lower than 100kb can create issues (if a single gene models spans 3 or more windows, it creates a weird failure). ?Carson > On Sep 5, 2017, at 4:04 PM, Quanwei Zhang wrote: > > Dear Carson: > > Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds. I set max_dna_len as 1Mb, because there are quite many long scaffolds (e.g., the longest one is about 100Mb). Would you explain whether smaller "max_dna_len" will decrease the quality of annotation (e.g., split some genes in the same scaffold)? > > > Best > Quanwei > > 2017-09-05 17:48 GMT-04:00 Carson Holt >: > You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage. > > So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option). > > ?Carson > > > >> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang > wrote: >> >> Hello: >> >> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. >> >> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). >> >> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems). >> >> Do you have any suggestions? Many thanks >> #some kinds of errors >> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40. >> --> rank=NA, hostname=n520 >> ERROR: Failed while doing blastx of proteins >> ERROR: Chunk failed at level:8, tier_type:3 >> FAILED CONTIG:Contig2 >> >> >> setting up GFF3 output and fasta chunks >> doing repeat masking >> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. >> --> rank=NA, hostname=n513 >> ERROR: Failed while doing repeat masking >> ERROR: Chunk failed at level:0, tier_type:1 >> FAILED CONTIG:Contig12378 >> >> >> Best >> Quanwei > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Wed Sep 6 10:51:54 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Wed, 6 Sep 2017 11:51:54 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> Message-ID: Dear Carson: (1) Thank you for your explanation. I will try to set max_dna_len as 400kb for our rodent species, which is a little bit higher than the suggested value for large vertebrate genome (in the maker manual it mentioned "300,000 is a good max_dna_len on large vertebrate genomes if memory is not a limiting factor"). (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file ( http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? **************** the bash file used to submit the maker job #!/bin/bash #$ -cwd #$ -S /bin/bash #$ -j y #$ -N makerT2 #$ -l h_vmem=8g #$ -pe smp 2 module load MAKER/2.31.9/perl.5.22.1 maker --q 2> maker_test.error Many thanks Best Qaunwei 2017-09-05 18:08 GMT-04:00 Carson Holt : > max_dna_len is the window size for keeping data in RAM. Smaller values do > not split genes. But values lower than 100kb can create issues (if a single > gene models spans 3 or more windows, it creates a weird failure). > > ?Carson > > > > > On Sep 5, 2017, at 4:04 PM, Quanwei Zhang wrote: > > Dear Carson: > > Thanks. I wonder whether smaller "max_dna_len" will split longer > scaffolds. I set max_dna_len as 1Mb, because there are quite many long > scaffolds (e.g., the longest one is about 100Mb). Would you explain whether > smaller "max_dna_len" will decrease the quality of annotation (e.g., split > some genes in the same scaffold)? > > > Best > Quanwei > > 2017-09-05 17:48 GMT-04:00 Carson Holt : > >> You ran out of memory. You probably set max_dna_len too high for the >> machines you are using. There is a note in the maker_opts.ctl file that >> tells you that this value affects memory usage. >> >> So you can either set it lower, or if running under MPI, use fewer CPUs >> per node (how you do this is MPI flavor dependent, but some flavors let you >> do this by setting process count lower combined with the round robin >> option). >> >> ?Carson >> >> >> >> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang wrote: >> >> Hello: >> >> We are doing genome annotation for a new rodent species. We have finished >> the training of the ab initio gene predictors successful by setting the >> following parameters (split_hit=40000, max_dna_len=1000000, and 99k >> mammalian Swiss protein sequences as evidences. >> >> But when I used the trained model to do the genome annotation, I got the >> following kinds of errors (shown in red). I used the same parameters as >> those for training, except for addition of 340k rodent TrEMBL protein >> sequences for protein evidences (i.e., I use both 99k mammalian Swiss >> protein sequences and 340k rodent TrEMBL protein sequences). >> >> I am doing the annotation on a cluster and started multiple Maker in the >> same directory (I had tried to use MPI but met some problems). >> >> Do you have any suggestions? Many thanks >> #some kinds of errors >> open3: fork failed: Cannot allocate memory at >> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40. >> --> rank=NA, hostname=n520 >> ERROR: Failed while doing blastx of proteins >> ERROR: Chunk failed at level:8, tier_type:3 >> FAILED CONTIG:Contig2 >> >> >> setting up GFF3 output and fasta chunks >> doing repeat masking >> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >> line 1050. >> --> rank=NA, hostname=n513 >> ERROR: Failed while doing repeat masking >> ERROR: Chunk failed at level:0, tier_type:1 >> FAILED CONTIG:Contig12378 >> >> >> Best >> Quanwei >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Sep 6 11:06:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 6 Sep 2017 10:06:46 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> Message-ID: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> > (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? > > depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) > depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) > depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) > bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. > (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. BLASTN (ESTs) -> fastest as it is searching nucleotide space BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. > (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. > (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. ?Carson -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Thu Sep 7 10:12:46 2017 From: carsonhh at gmail.com (Carson Holt) Date: Thu, 7 Sep 2017 09:12:46 -0600 Subject: [maker-devel] MSG: Can't get HSPs: data not collected. In-Reply-To: References: <846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com> Message-ID: <2B046506-1E32-4840-B3B6-6DABB4A5D4C2@gmail.com> I?m glad it fixed it. ?Carson > On Sep 6, 2017, at 8:27 PM, zl c wrote: > > Hi Carson, > > I try blast-2.6.0+ and it works. Thank you very much. > > Thanks > Zelin Chen > > -------------------------------------------- > Zelin Chen [chzelin at gmail.com ] > > NIH/NHGRI > Building 50, Room 5531 > 50 SOUTH DR, MSC 8004 > BETHESDA, MD 20892-8004 > > On Tue, Sep 5, 2017 at 6:04 PM, Carson Holt > wrote: > The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it. If not, switch to legacy BLAST (not blast plus) and see if it goes away. > > ?Carson > > >> On Sep 5, 2017, at 7:59 AM, zl c > wrote: >> >> Hello, >> >> I run maker for most sequences successfully but fail some long sequences. The error is: >> >> Widget::tblastx: >> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out OUT.tblastx >> #-------------------------------# >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Can't get HSPs: data not collected. >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486 >> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552 >> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 >> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 >> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251 >> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260 >> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471 >> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291 >> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320 >> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340 >> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356 >> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287 >> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287 >> STACK: /home/chenz11/program/maker/bin/maker:695 >> ----------------------------------------------------------- >> --> rank=NA, hostname=cn3544 >> --> rank=NA, hostname=cn3544 >> --> rank=NA, hostname=cn3544 >> --> rank=NA, hostname=cn3544 >> ERROR: Failed while collecting tblastx reports >> ERROR: Chunk failed at level:5, tier_type:3 >> FAILED CONTIG:tig00011625_arrow >> >> ERROR: Chunk failed at level:4, tier_type:0 >> FAILED CONTIG:tig00011625_arrow >> >> examining contents of the fasta file and run log >> >> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem? >> >> Thanks, >> Zelin >> >> -------------------------------------------- >> Zelin Chen [chzelin at gmail.com ] >> >> >> NIH/NHGRI >> Building 50, Room 5531 >> 50 SOUTH DR, MSC 8004 >> BETHESDA, MD 20892-8004 >> _______________________________________________ >> maker-devel mailing list >> maker-devel at box290.bluehost.com >> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Fri Sep 8 22:25:29 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Fri, 8 Sep 2017 23:25:29 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> Message-ID: Dear Carson: I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. Thank you! Have a nice weekend! #--------------------------------------------------------------------- Now starting the contig!! SeqID: Contig10 Length: 18773588 #--------------------------------------------------------------------- setting up GFF3 output and fasta chunks doing repeat masking doing blastx repeats doing blastx repeats doing blastx repeats doing blastx repeats doing blastx repeats doing blastx repeats doing blastx repeats doing blastx repeats doing blastx repeats doing blastx repeats collecting blastx repeatmasking processing all repeats doing repeat masking Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. --> rank=NA, hostname=n224 ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:Contig10 ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:Contig10 Best Quanwei 2017-09-06 12:06 GMT-04:00 Carson Holt : > > (2) By reading some of your replies in the maker google group, and I > noticed that it can reduce memory and save time for annotation if I set > depth_blast to a certain number. So I changed the following parameters. But > I wonder, whether it will decrease the quality of annotation? If it won't > affect the quality, can I even use a smaller number (e.g., 20) to save more > memory and time? > > depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) > depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) > depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) > bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking > > > This values really only affects the final evidence kept in the GFF3 when > you look at it in a browser. It has not affect on the annotation. This is > because internally MAKER already collapses evidence down to the 10 best > non-redundant features per evidence set per locus. The rest are put in the > GFF3 just for reference. by setting it lower, you are just letting MAKER > know it can through things away even sooner since you don?t want them in > the GFF3. It provides a minor improvement for memory use, but > max_dna_length is the big one that has the greatest effect. > > > (3) I also have some concerns about the speed, especially for the long > scaffolds (around 100Mb). I wonder which part is the most time consuming > for genome annotation (repeat masking, blast, or polishing?). > Particularly, I wonder whether the blastx of protein evidence will take > majority of time. Now, I have prepared 99k mammalian Swiss protein > sequences and 340k rodent TrEMBL protein sequences as protein evidences. I > am considering whether I can save much time if I only use the 99k mammalian > Swiss protein sequences as evidences. > > > BLASTN (ESTs) -> fastest as it is searching nucleotide space > BLASTX (proteins) -> must search 6 reading frames so will be at least 6 > times slower than BLASTN > TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 > times slower than BLASTN and twice as slow as BLASTX > > Also double the dataset size, double the runtime. Larger window sizes via > max_dna_length will also increase runtimes. > > > (4) For some reasons, I can not run maker though MPI on our cluster. So I > can only start multiple maker. I wonder if it is possible to let multiple > maker to annotate the same long scaffold (i.e., for a single sequence I > start multiple maker, without splitting the long sequence into shorter > ones). > > > Without MPI you won?t be able to split up large contigs. At the very least > you can try and run on a single node and set MPI to use all CPUs on that > node. It?s less difficult to set up compared to cross node jobs via MPI. > > > (5) Still about the speed issue. I read some of your comments about "cpus" > parameters in the maker_opts file (http://gmod.827538.n3.nabble. > com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know > it indicate the number of cpus for a single chunk. So if I set "cpus=2" in > the maker_opts file, then I can use the following command to submit the > job, right? > > > The cpu parameter only affects how many CPUs are given to the blast > command line. So only the BLASt step will speed up, so I recommend using > MPI to get all steps to speed up. Even if you are only running on a single > node, you can give all CPUs to the mpiexec command. > > > ?Carson > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sun Sep 10 20:03:11 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Mon, 11 Sep 2017 11:03:11 +1000 Subject: [maker-devel] augustus underpredicting Message-ID: Hi, I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two. I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem. Has anybody come up with any similar issue? I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 Cheers, Xabi -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Sep 11 11:19:50 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 11 Sep 2017 12:19:50 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> Message-ID: Dear Carson: About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline " http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks Here are some parameters I used #-----Repeat Masking (leave values blank to skip repeat masking) model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe max_dna_len=300000 split_hit=40000 depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. 33708 --> rank=NA, hostname=n409 33709 ERROR: Failed while processing all repeats 33710 ERROR: Chunk failed at level:3, tier_type:1 33711 FAILED CONTIG:Contig31 Best Quanwei 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : > Dear Carson: > > I got the following error again. Is this still related to memory issues? I > wonder whether there can be other reasons lead to this error? This time, I > got this error during training of the SNAP model. Before, even I set > max_dna_len=1Mb, I can train the model successfully. And in the current > training (where I get the following error), I have decreased the > max_dna_len to 300kb. I required the same amount memory as before. The only > difference is that I am using both mammalian repeat library and species > specific repeat library, while previously I only use the mammalian repeat > library. Will it greatly increases the requirement of memory to use both > repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I > have also set the depth_blast as 30 in current training. > > Thank you! Have a nice weekend! > > > > #--------------------------------------------------------------------- > Now starting the contig!! > SeqID: Contig10 > Length: 18773588 > #--------------------------------------------------------------------- > > > setting up GFF3 output and fasta chunks > doing repeat masking > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > collecting blastx repeatmasking > processing all repeats > doing repeat masking > Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2. > 31.9/bin/../lib/File/NFSLock.pm line 1050. > --> rank=NA, hostname=n224 > ERROR: Failed while doing repeat masking > ERROR: Chunk failed at level:0, tier_type:1 > FAILED CONTIG:Contig10 > > ERROR: Chunk failed at level:2, tier_type:0 > FAILED CONTIG:Contig10 > > Best > Quanwei > > 2017-09-06 12:06 GMT-04:00 Carson Holt : > >> >> (2) By reading some of your replies in the maker google group, and I >> noticed that it can reduce memory and save time for annotation if I set >> depth_blast to a certain number. So I changed the following parameters. But >> I wonder, whether it will decrease the quality of annotation? If it won't >> affect the quality, can I even use a smaller number (e.g., 20) to save more >> memory and time? >> >> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >> >> >> This values really only affects the final evidence kept in the GFF3 when >> you look at it in a browser. It has not affect on the annotation. This is >> because internally MAKER already collapses evidence down to the 10 best >> non-redundant features per evidence set per locus. The rest are put in the >> GFF3 just for reference. by setting it lower, you are just letting MAKER >> know it can through things away even sooner since you don?t want them in >> the GFF3. It provides a minor improvement for memory use, but >> max_dna_length is the big one that has the greatest effect. >> >> >> (3) I also have some concerns about the speed, especially for the long >> scaffolds (around 100Mb). I wonder which part is the most time consuming >> for genome annotation (repeat masking, blast, or polishing?). >> Particularly, I wonder whether the blastx of protein evidence will take >> majority of time. Now, I have prepared 99k mammalian Swiss protein >> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >> am considering whether I can save much time if I only use the 99k mammalian >> Swiss protein sequences as evidences. >> >> >> BLASTN (ESTs) -> fastest as it is searching nucleotide space >> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 >> times slower than BLASTN >> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least >> 12 times slower than BLASTN and twice as slow as BLASTX >> >> Also double the dataset size, double the runtime. Larger window sizes via >> max_dna_length will also increase runtimes. >> >> >> (4) For some reasons, I can not run maker though MPI on our cluster. So I >> can only start multiple maker. I wonder if it is possible to let multiple >> maker to annotate the same long scaffold (i.e., for a single sequence I >> start multiple maker, without splitting the long sequence into shorter >> ones). >> >> >> Without MPI you won?t be able to split up large contigs. At the very >> least you can try and run on a single node and set MPI to use all CPUs on >> that node. It?s less difficult to set up compared to cross node jobs via >> MPI. >> >> >> (5) Still about the speed issue. I read some of your comments about >> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble. >> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know >> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in >> the maker_opts file, then I can use the following command to submit the >> job, right? >> >> >> The cpu parameter only affects how many CPUs are given to the blast >> command line. So only the BLASt step will speed up, so I recommend using >> MPI to get all steps to speed up. Even if you are only running on a single >> node, you can give all CPUs to the mpiexec command. >> >> >> ?Carson >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Sep 11 11:48:16 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Sep 2017 10:48:16 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> Message-ID: <5C2477A3-CDBA-458A-95CA-E6DC912417B3@gmail.com> It may can a memory issue or an IO issue. Some resource is being taxed and creating a non-responsive bottleneck. If you are running MAKER multiple times in the same directory, you may have to run fewer processes. Also if you are running without MPI, run with MPI instead as it will better manage the parallelization and use fewer resources than multiple individual processes. ?Carson > On Sep 8, 2017, at 9:25 PM, Quanwei Zhang wrote: > > Dear Carson: > > I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. > > Thank you! Have a nice weekend! > > > > #--------------------------------------------------------------------- > Now starting the contig!! > SeqID: Contig10 > Length: 18773588 > #--------------------------------------------------------------------- > > > setting up GFF3 output and fasta chunks > doing repeat masking > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > collecting blastx repeatmasking > processing all repeats > doing repeat masking > Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. > --> rank=NA, hostname=n224 > ERROR: Failed while doing repeat masking > ERROR: Chunk failed at level:0, tier_type:1 > FAILED CONTIG:Contig10 > > ERROR: Chunk failed at level:2, tier_type:0 > FAILED CONTIG:Contig10 > > Best > Quanwei > > 2017-09-06 12:06 GMT-04:00 Carson Holt >: > >> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? >> >> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking > > This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. > > >> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. > > BLASTN (ESTs) -> fastest as it is searching nucleotide space > BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN > TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX > > Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. > > >> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). > > Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. > > >> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? > > The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. > > > ?Carson > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Sep 11 11:50:41 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Sep 2017 10:50:41 -0600 Subject: [maker-devel] augustus underpredicting In-Reply-To: References: Message-ID: BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results. ?Carson > On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos wrote: > > Hi, > I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two. > I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem. > Has anybody come up with any similar issue? > I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 > Cheers, > Xabi > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Sep 11 12:07:12 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Sep 2017 11:07:12 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> Message-ID: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER. For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated). ?Carson > On Sep 11, 2017, at 10:19 AM, Quanwei Zhang wrote: > > Dear Carson: > > About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic "). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks > > Here are some parameters I used > > #-----Repeat Masking (leave values blank to skip repeat masking) > model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker > rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe > > max_dna_len=300000 > split_hit=40000 > depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) > depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) > depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) > bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking > > > Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. > 33708 --> rank=NA, hostname=n409 > 33709 ERROR: Failed while processing all repeats > 33710 ERROR: Chunk failed at level:3, tier_type:1 > 33711 FAILED CONTIG:Contig31 > > > Best > Quanwei > > 2017-09-08 23:25 GMT-04:00 Quanwei Zhang >: > Dear Carson: > > I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. > > Thank you! Have a nice weekend! > > > > #--------------------------------------------------------------------- > Now starting the contig!! > SeqID: Contig10 > Length: 18773588 > #--------------------------------------------------------------------- > > > setting up GFF3 output and fasta chunks > doing repeat masking > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > doing blastx repeats > collecting blastx repeatmasking > processing all repeats > doing repeat masking > Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. > --> rank=NA, hostname=n224 > ERROR: Failed while doing repeat masking > ERROR: Chunk failed at level:0, tier_type:1 > FAILED CONTIG:Contig10 > > ERROR: Chunk failed at level:2, tier_type:0 > FAILED CONTIG:Contig10 > > Best > Quanwei > > 2017-09-06 12:06 GMT-04:00 Carson Holt >: > >> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? >> >> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking > > This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. > > >> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. > > BLASTN (ESTs) -> fastest as it is searching nucleotide space > BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN > TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX > > Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. > > >> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). > > Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. > > >> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? > > The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. > > > ?Carson > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Sep 11 12:12:29 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 11 Sep 2017 13:12:29 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> Message-ID: Dear Carson: I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed. I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. Thank you! Best Quanwei 2017-09-11 13:07 GMT-04:00 Carson Holt : > I think the cause of the error may have been a little further upstream > from what you pasted in the e-mail. One thing that may be happening is that > you are taxing resources (like IO) if running MAKER multiple times or on > too many CPUs. That can lead to failures because of truncated BLAST reports > etc. In which case you can just retry and that will get around those types > of IO derived errors. MAKER can generate a lot of IO, and if you are > working on network mounted locations (i.e. the storage being used is > actually across the network), then they can be lest robust than local > storage (when under heavy load NFS can falsely report success on read/write > operations that actually failed). It?s the reason we built in the retry > capabilities of MAKER. > > For contigs that continuously fail, you may need to set clean_try=1. That > will cause failures to start from scratch (i.e. delete all old reports on > failure rather than just those suspected of being truncated). > > ?Carson > > > > On Sep 11, 2017, at 10:19 AM, Quanwei Zhang wrote: > > Dear Carson: > > About the error in my above email, I found the contig was correctly > annotated at the second time RETRY. So please ignore my last email. But > now, for a few number of scaffolds, I met problems to process the repeats > (as shown below in red). I used both Mammalia repeat library and species > specific repeat library (which is generated by your pipeline " > http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/ > Repeat_Library_Construction--Basic"). There were no such problems when I > only used Mammalia repeat library. Do you have any ideas about this? What > could be the reason? Or do you have any suggestions for me to find the > reason? Many thanks > > Here are some parameters I used > > #-----Repeat Masking (leave values blank to skip repeat masking) > model_org=Mammalia #select a model organism for RepBase masking in > RepeatMasker > rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific > repeat library in fasta format for Repe > > max_dna_len=300000 > split_hit=40000 > depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) > depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) > depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) > bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking > > > Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm > line 188. > 33708 --> rank=NA, hostname=n409 > 33709 ERROR: Failed while processing all repeats > 33710 ERROR: Chunk failed at level:3, tier_type:1 > 33711 FAILED CONTIG:Contig31 > > > Best > Quanwei > > 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : > >> Dear Carson: >> >> I got the following error again. Is this still related to memory issues? >> I wonder whether there can be other reasons lead to this error? This time, >> I got this error during training of the SNAP model. Before, even I set >> max_dna_len=1Mb, I can train the model successfully. And in the current >> training (where I get the following error), I have decreased the >> max_dna_len to 300kb. I required the same amount memory as before. The only >> difference is that I am using both mammalian repeat library and species >> specific repeat library, while previously I only use the mammalian repeat >> library. Will it greatly increases the requirement of memory to use both >> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I >> have also set the depth_blast as 30 in current training. >> >> Thank you! Have a nice weekend! >> >> >> >> #--------------------------------------------------------------------- >> Now starting the contig!! >> SeqID: Contig10 >> Length: 18773588 >> #--------------------------------------------------------------------- >> >> >> setting up GFF3 output and fasta chunks >> doing repeat masking >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> collecting blastx repeatmasking >> processing all repeats >> doing repeat masking >> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >> line 1050. >> --> rank=NA, hostname=n224 >> ERROR: Failed while doing repeat masking >> ERROR: Chunk failed at level:0, tier_type:1 >> FAILED CONTIG:Contig10 >> >> ERROR: Chunk failed at level:2, tier_type:0 >> FAILED CONTIG:Contig10 >> >> Best >> Quanwei >> >> 2017-09-06 12:06 GMT-04:00 Carson Holt : >> >>> >>> (2) By reading some of your replies in the maker google group, and I >>> noticed that it can reduce memory and save time for annotation if I set >>> depth_blast to a certain number. So I changed the following parameters. But >>> I wonder, whether it will decrease the quality of annotation? If it won't >>> affect the quality, can I even use a smaller number (e.g., 20) to save more >>> memory and time? >>> >>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>> >>> >>> This values really only affects the final evidence kept in the GFF3 when >>> you look at it in a browser. It has not affect on the annotation. This is >>> because internally MAKER already collapses evidence down to the 10 best >>> non-redundant features per evidence set per locus. The rest are put in the >>> GFF3 just for reference. by setting it lower, you are just letting MAKER >>> know it can through things away even sooner since you don?t want them in >>> the GFF3. It provides a minor improvement for memory use, but >>> max_dna_length is the big one that has the greatest effect. >>> >>> >>> (3) I also have some concerns about the speed, especially for the long >>> scaffolds (around 100Mb). I wonder which part is the most time consuming >>> for genome annotation (repeat masking, blast, or polishing?). >>> Particularly, I wonder whether the blastx of protein evidence will take >>> majority of time. Now, I have prepared 99k mammalian Swiss protein >>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >>> am considering whether I can save much time if I only use the 99k mammalian >>> Swiss protein sequences as evidences. >>> >>> >>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 >>> times slower than BLASTN >>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least >>> 12 times slower than BLASTN and twice as slow as BLASTX >>> >>> Also double the dataset size, double the runtime. Larger window sizes >>> via max_dna_length will also increase runtimes. >>> >>> >>> (4) For some reasons, I can not run maker though MPI on our cluster. So >>> I can only start multiple maker. I wonder if it is possible to let multiple >>> maker to annotate the same long scaffold (i.e., for a single sequence I >>> start multiple maker, without splitting the long sequence into shorter >>> ones). >>> >>> >>> Without MPI you won?t be able to split up large contigs. At the very >>> least you can try and run on a single node and set MPI to use all CPUs on >>> that node. It?s less difficult to set up compared to cross node jobs via >>> MPI. >>> >>> >>> (5) Still about the speed issue. I read some of your comments about >>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble. >>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I >>> know it indicate the number of cpus for a single chunk. So if I set >>> "cpus=2" in the maker_opts file, then I can use the following command to >>> submit the job, right? >>> >>> >>> The cpu parameter only affects how many CPUs are given to the blast >>> command line. So only the BLASt step will speed up, so I recommend using >>> MPI to get all steps to speed up. Even if you are only running on a single >>> node, you can give all CPUs to the mpiexec command. >>> >>> >>> ?Carson >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Sep 11 12:14:11 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Sep 2017 11:14:11 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> Message-ID: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage. ?Carson > On Sep 11, 2017, at 11:12 AM, Quanwei Zhang wrote: > > Dear Carson: > > I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed. > > I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. > > Thank you! > > Best > Quanwei > > 2017-09-11 13:07 GMT-04:00 Carson Holt >: > I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER. > > For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated). > > ?Carson > > > >> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang > wrote: >> >> Dear Carson: >> >> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic "). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks >> >> Here are some parameters I used >> >> #-----Repeat Masking (leave values blank to skip repeat masking) >> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker >> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe >> >> max_dna_len=300000 >> split_hit=40000 >> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >> >> >> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. >> 33708 --> rank=NA, hostname=n409 >> 33709 ERROR: Failed while processing all repeats >> 33710 ERROR: Chunk failed at level:3, tier_type:1 >> 33711 FAILED CONTIG:Contig31 >> >> >> Best >> Quanwei >> >> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang >: >> Dear Carson: >> >> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. >> >> Thank you! Have a nice weekend! >> >> >> >> #--------------------------------------------------------------------- >> Now starting the contig!! >> SeqID: Contig10 >> Length: 18773588 >> #--------------------------------------------------------------------- >> >> >> setting up GFF3 output and fasta chunks >> doing repeat masking >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> doing blastx repeats >> collecting blastx repeatmasking >> processing all repeats >> doing repeat masking >> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. >> --> rank=NA, hostname=n224 >> ERROR: Failed while doing repeat masking >> ERROR: Chunk failed at level:0, tier_type:1 >> FAILED CONTIG:Contig10 >> >> ERROR: Chunk failed at level:2, tier_type:0 >> FAILED CONTIG:Contig10 >> >> Best >> Quanwei >> >> 2017-09-06 12:06 GMT-04:00 Carson Holt >: >> >>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? >>> >>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >> >> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. >> >> >>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. >> >> BLASTN (ESTs) -> fastest as it is searching nucleotide space >> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN >> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX >> >> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. >> >> >>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). >> >> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. >> >> >>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? >> >> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. >> >> >> ?Carson >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Sep 11 12:16:49 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 11 Sep 2017 13:16:49 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> Message-ID: Dear Carson: I met some problems to use MPI. I will give it another try. Thank you! Best Quanwei 2017-09-11 13:14 GMT-04:00 Carson Holt : > It could be either. Please use MPI instead of starting multiple instances. > It will greatly reduce both IO and RAM usage. > > ?Carson > > > > On Sep 11, 2017, at 11:12 AM, Quanwei Zhang wrote: > > Dear Carson: > > I only run 5 Maker instances in each directory (and set cpus=2). If it is > related to memory issue or an IO issue, I am not sure why the much longer > scaffolds (than the failed ones) were all annotated successfully, but the > relatively shorter ones failed. > > I have set "tries=5" (#number of times to try a contig if there is a > failure for some reason). I will try "clean_try=1" and test on the failed > scaffolds individually with larger memory to see whether they can be > annotated. > > Thank you! > > Best > Quanwei > > 2017-09-11 13:07 GMT-04:00 Carson Holt : > >> I think the cause of the error may have been a little further upstream >> from what you pasted in the e-mail. One thing that may be happening is that >> you are taxing resources (like IO) if running MAKER multiple times or on >> too many CPUs. That can lead to failures because of truncated BLAST reports >> etc. In which case you can just retry and that will get around those types >> of IO derived errors. MAKER can generate a lot of IO, and if you are >> working on network mounted locations (i.e. the storage being used is >> actually across the network), then they can be lest robust than local >> storage (when under heavy load NFS can falsely report success on read/write >> operations that actually failed). It?s the reason we built in the retry >> capabilities of MAKER. >> >> For contigs that continuously fail, you may need to set clean_try=1. That >> will cause failures to start from scratch (i.e. delete all old reports on >> failure rather than just those suspected of being truncated). >> >> ?Carson >> >> >> >> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang >> wrote: >> >> Dear Carson: >> >> About the error in my above email, I found the contig was correctly >> annotated at the second time RETRY. So please ignore my last email. But >> now, for a few number of scaffolds, I met problems to process the repeats >> (as shown below in red). I used both Mammalia repeat library and species >> specific repeat library (which is generated by your pipeline " >> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep >> eat_Library_Construction--Basic"). There were no such problems when I >> only used Mammalia repeat library. Do you have any ideas about this? What >> could be the reason? Or do you have any suggestions for me to find the >> reason? Many thanks >> >> Here are some parameters I used >> >> #-----Repeat Masking (leave values blank to skip repeat masking) >> model_org=Mammalia #select a model organism for RepBase masking in >> RepeatMasker >> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific >> repeat library in fasta format for Repe >> >> max_dna_len=300000 >> split_hit=40000 >> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >> >> >> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm >> line 188. >> 33708 --> rank=NA, hostname=n409 >> 33709 ERROR: Failed while processing all repeats >> 33710 ERROR: Chunk failed at level:3, tier_type:1 >> 33711 FAILED CONTIG:Contig31 >> >> >> Best >> Quanwei >> >> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : >> >>> Dear Carson: >>> >>> I got the following error again. Is this still related to memory issues? >>> I wonder whether there can be other reasons lead to this error? This time, >>> I got this error during training of the SNAP model. Before, even I set >>> max_dna_len=1Mb, I can train the model successfully. And in the current >>> training (where I get the following error), I have decreased the >>> max_dna_len to 300kb. I required the same amount memory as before. The only >>> difference is that I am using both mammalian repeat library and species >>> specific repeat library, while previously I only use the mammalian repeat >>> library. Will it greatly increases the requirement of memory to use both >>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I >>> have also set the depth_blast as 30 in current training. >>> >>> Thank you! Have a nice weekend! >>> >>> >>> >>> #--------------------------------------------------------------------- >>> Now starting the contig!! >>> SeqID: Contig10 >>> Length: 18773588 >>> #--------------------------------------------------------------------- >>> >>> >>> setting up GFF3 output and fasta chunks >>> doing repeat masking >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> collecting blastx repeatmasking >>> processing all repeats >>> doing repeat masking >>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >>> line 1050. >>> --> rank=NA, hostname=n224 >>> ERROR: Failed while doing repeat masking >>> ERROR: Chunk failed at level:0, tier_type:1 >>> FAILED CONTIG:Contig10 >>> >>> ERROR: Chunk failed at level:2, tier_type:0 >>> FAILED CONTIG:Contig10 >>> >>> Best >>> Quanwei >>> >>> 2017-09-06 12:06 GMT-04:00 Carson Holt : >>> >>>> >>>> (2) By reading some of your replies in the maker google group, and I >>>> noticed that it can reduce memory and save time for annotation if I set >>>> depth_blast to a certain number. So I changed the following parameters. But >>>> I wonder, whether it will decrease the quality of annotation? If it won't >>>> affect the quality, can I even use a smaller number (e.g., 20) to save more >>>> memory and time? >>>> >>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>> >>>> >>>> This values really only affects the final evidence kept in the GFF3 >>>> when you look at it in a browser. It has not affect on the annotation. This >>>> is because internally MAKER already collapses evidence down to the 10 best >>>> non-redundant features per evidence set per locus. The rest are put in the >>>> GFF3 just for reference. by setting it lower, you are just letting MAKER >>>> know it can through things away even sooner since you don?t want them in >>>> the GFF3. It provides a minor improvement for memory use, but >>>> max_dna_length is the big one that has the greatest effect. >>>> >>>> >>>> (3) I also have some concerns about the speed, especially for the long >>>> scaffolds (around 100Mb). I wonder which part is the most time consuming >>>> for genome annotation (repeat masking, blast, or polishing?). >>>> Particularly, I wonder whether the blastx of protein evidence will take >>>> majority of time. Now, I have prepared 99k mammalian Swiss protein >>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >>>> am considering whether I can save much time if I only use the 99k mammalian >>>> Swiss protein sequences as evidences. >>>> >>>> >>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 >>>> times slower than BLASTN >>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least >>>> 12 times slower than BLASTN and twice as slow as BLASTX >>>> >>>> Also double the dataset size, double the runtime. Larger window sizes >>>> via max_dna_length will also increase runtimes. >>>> >>>> >>>> (4) For some reasons, I can not run maker though MPI on our cluster. So >>>> I can only start multiple maker. I wonder if it is possible to let multiple >>>> maker to annotate the same long scaffold (i.e., for a single sequence I >>>> start multiple maker, without splitting the long sequence into shorter >>>> ones). >>>> >>>> >>>> Without MPI you won?t be able to split up large contigs. At the very >>>> least you can try and run on a single node and set MPI to use all CPUs on >>>> that node. It?s less difficult to set up compared to cross node jobs via >>>> MPI. >>>> >>>> >>>> (5) Still about the speed issue. I read some of your comments about >>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble. >>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I >>>> know it indicate the number of cpus for a single chunk. So if I set >>>> "cpus=2" in the maker_opts file, then I can use the following command to >>>> submit the job, right? >>>> >>>> >>>> The cpu parameter only affects how many CPUs are given to the blast >>>> command line. So only the BLASt step will speed up, so I recommend using >>>> MPI to get all steps to speed up. Even if you are only running on a single >>>> node, you can give all CPUs to the mpiexec command. >>>> >>>> >>>> ?Carson >>>> >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Sep 11 12:18:14 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Sep 2017 11:18:14 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> Message-ID: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org It?s easy to install yourself, and tends to be very robust to failure. ?Carson > On Sep 11, 2017, at 11:16 AM, Quanwei Zhang wrote: > > Dear Carson: > > I met some problems to use MPI. I will give it another try. > Thank you! > > Best > Quanwei > > 2017-09-11 13:14 GMT-04:00 Carson Holt >: > It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage. > > ?Carson > > > >> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang > wrote: >> >> Dear Carson: >> >> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed. >> >> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. >> >> Thank you! >> >> Best >> Quanwei >> >> 2017-09-11 13:07 GMT-04:00 Carson Holt >: >> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER. >> >> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated). >> >> ?Carson >> >> >> >>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang > wrote: >>> >>> Dear Carson: >>> >>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic "). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks >>> >>> Here are some parameters I used >>> >>> #-----Repeat Masking (leave values blank to skip repeat masking) >>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker >>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe >>> >>> max_dna_len=300000 >>> split_hit=40000 >>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>> >>> >>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. >>> 33708 --> rank=NA, hostname=n409 >>> 33709 ERROR: Failed while processing all repeats >>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>> 33711 FAILED CONTIG:Contig31 >>> >>> >>> Best >>> Quanwei >>> >>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang >: >>> Dear Carson: >>> >>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. >>> >>> Thank you! Have a nice weekend! >>> >>> >>> >>> #--------------------------------------------------------------------- >>> Now starting the contig!! >>> SeqID: Contig10 >>> Length: 18773588 >>> #--------------------------------------------------------------------- >>> >>> >>> setting up GFF3 output and fasta chunks >>> doing repeat masking >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> doing blastx repeats >>> collecting blastx repeatmasking >>> processing all repeats >>> doing repeat masking >>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. >>> --> rank=NA, hostname=n224 >>> ERROR: Failed while doing repeat masking >>> ERROR: Chunk failed at level:0, tier_type:1 >>> FAILED CONTIG:Contig10 >>> >>> ERROR: Chunk failed at level:2, tier_type:0 >>> FAILED CONTIG:Contig10 >>> >>> Best >>> Quanwei >>> >>> 2017-09-06 12:06 GMT-04:00 Carson Holt >: >>> >>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? >>>> >>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>> >>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. >>> >>> >>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. >>> >>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN >>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX >>> >>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. >>> >>> >>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). >>> >>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. >>> >>> >>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? >>> >>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. >>> >>> >>> ?Carson >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Sep 11 12:27:22 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 11 Sep 2017 13:27:22 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> Message-ID: Dear Carson: Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3? Thanks Best Quanwei 2017-09-11 13:18 GMT-04:00 Carson Holt : > If you are just using a single machine (and not cross machine MPI), use > MPICH3 ?> https://www.mpich.org > > It?s easy to install yourself, and tends to be very robust to failure. > > ?Carson > > > > On Sep 11, 2017, at 11:16 AM, Quanwei Zhang wrote: > > Dear Carson: > > I met some problems to use MPI. I will give it another try. > Thank you! > > Best > Quanwei > > 2017-09-11 13:14 GMT-04:00 Carson Holt : > >> It could be either. Please use MPI instead of starting multiple >> instances. It will greatly reduce both IO and RAM usage. >> >> ?Carson >> >> >> >> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang >> wrote: >> >> Dear Carson: >> >> I only run 5 Maker instances in each directory (and set cpus=2). If it is >> related to memory issue or an IO issue, I am not sure why the much longer >> scaffolds (than the failed ones) were all annotated successfully, but the >> relatively shorter ones failed. >> >> I have set "tries=5" (#number of times to try a contig if there is a >> failure for some reason). I will try "clean_try=1" and test on the failed >> scaffolds individually with larger memory to see whether they can be >> annotated. >> >> Thank you! >> >> Best >> Quanwei >> >> 2017-09-11 13:07 GMT-04:00 Carson Holt : >> >>> I think the cause of the error may have been a little further upstream >>> from what you pasted in the e-mail. One thing that may be happening is that >>> you are taxing resources (like IO) if running MAKER multiple times or on >>> too many CPUs. That can lead to failures because of truncated BLAST reports >>> etc. In which case you can just retry and that will get around those types >>> of IO derived errors. MAKER can generate a lot of IO, and if you are >>> working on network mounted locations (i.e. the storage being used is >>> actually across the network), then they can be lest robust than local >>> storage (when under heavy load NFS can falsely report success on read/write >>> operations that actually failed). It?s the reason we built in the retry >>> capabilities of MAKER. >>> >>> For contigs that continuously fail, you may need to set clean_try=1. >>> That will cause failures to start from scratch (i.e. delete all old reports >>> on failure rather than just those suspected of being truncated). >>> >>> ?Carson >>> >>> >>> >>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang >>> wrote: >>> >>> Dear Carson: >>> >>> About the error in my above email, I found the contig was correctly >>> annotated at the second time RETRY. So please ignore my last email. But >>> now, for a few number of scaffolds, I met problems to process the repeats >>> (as shown below in red). I used both Mammalia repeat library and species >>> specific repeat library (which is generated by your pipeline " >>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep >>> eat_Library_Construction--Basic"). There were no such problems when I >>> only used Mammalia repeat library. Do you have any ideas about this? What >>> could be the reason? Or do you have any suggestions for me to find the >>> reason? Many thanks >>> >>> Here are some parameters I used >>> >>> #-----Repeat Masking (leave values blank to skip repeat masking) >>> model_org=Mammalia #select a model organism for RepBase masking in >>> RepeatMasker >>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism >>> specific repeat library in fasta format for Repe >>> >>> max_dna_len=300000 >>> split_hit=40000 >>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>> >>> >>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm >>> line 188. >>> 33708 --> rank=NA, hostname=n409 >>> 33709 ERROR: Failed while processing all repeats >>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>> 33711 FAILED CONTIG:Contig31 >>> >>> >>> Best >>> Quanwei >>> >>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : >>> >>>> Dear Carson: >>>> >>>> I got the following error again. Is this still related to memory >>>> issues? I wonder whether there can be other reasons lead to this error? >>>> This time, I got this error during training of the SNAP model. Before, even >>>> I set max_dna_len=1Mb, I can train the model successfully. And in the >>>> current training (where I get the following error), I have decreased the >>>> max_dna_len to 300kb. I required the same amount memory as before. The only >>>> difference is that I am using both mammalian repeat library and species >>>> specific repeat library, while previously I only use the mammalian repeat >>>> library. Will it greatly increases the requirement of memory to use both >>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I >>>> have also set the depth_blast as 30 in current training. >>>> >>>> Thank you! Have a nice weekend! >>>> >>>> >>>> >>>> #--------------------------------------------------------------------- >>>> Now starting the contig!! >>>> SeqID: Contig10 >>>> Length: 18773588 >>>> #--------------------------------------------------------------------- >>>> >>>> >>>> setting up GFF3 output and fasta chunks >>>> doing repeat masking >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> collecting blastx repeatmasking >>>> processing all repeats >>>> doing repeat masking >>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >>>> line 1050. >>>> --> rank=NA, hostname=n224 >>>> ERROR: Failed while doing repeat masking >>>> ERROR: Chunk failed at level:0, tier_type:1 >>>> FAILED CONTIG:Contig10 >>>> >>>> ERROR: Chunk failed at level:2, tier_type:0 >>>> FAILED CONTIG:Contig10 >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-06 12:06 GMT-04:00 Carson Holt : >>>> >>>>> >>>>> (2) By reading some of your replies in the maker google group, and I >>>>> noticed that it can reduce memory and save time for annotation if I set >>>>> depth_blast to a certain number. So I changed the following parameters. But >>>>> I wonder, whether it will decrease the quality of annotation? If it won't >>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more >>>>> memory and time? >>>>> >>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>> >>>>> >>>>> This values really only affects the final evidence kept in the GFF3 >>>>> when you look at it in a browser. It has not affect on the annotation. This >>>>> is because internally MAKER already collapses evidence down to the 10 best >>>>> non-redundant features per evidence set per locus. The rest are put in the >>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER >>>>> know it can through things away even sooner since you don?t want them in >>>>> the GFF3. It provides a minor improvement for memory use, but >>>>> max_dna_length is the big one that has the greatest effect. >>>>> >>>>> >>>>> (3) I also have some concerns about the speed, especially for the long >>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming >>>>> for genome annotation (repeat masking, blast, or polishing?). >>>>> Particularly, I wonder whether the blastx of protein evidence will take >>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein >>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >>>>> am considering whether I can save much time if I only use the 99k mammalian >>>>> Swiss protein sequences as evidences. >>>>> >>>>> >>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least >>>>> 6 times slower than BLASTN >>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at >>>>> least 12 times slower than BLASTN and twice as slow as BLASTX >>>>> >>>>> Also double the dataset size, double the runtime. Larger window sizes >>>>> via max_dna_length will also increase runtimes. >>>>> >>>>> >>>>> (4) For some reasons, I can not run maker though MPI on our cluster. >>>>> So I can only start multiple maker. I wonder if it is possible to let >>>>> multiple maker to annotate the same long scaffold (i.e., for a single >>>>> sequence I start multiple maker, without splitting the long sequence into >>>>> shorter ones). >>>>> >>>>> >>>>> Without MPI you won?t be able to split up large contigs. At the very >>>>> least you can try and run on a single node and set MPI to use all CPUs on >>>>> that node. It?s less difficult to set up compared to cross node jobs via >>>>> MPI. >>>>> >>>>> >>>>> (5) Still about the speed issue. I read some of your comments about >>>>> "cpus" parameters in the maker_opts file ( >>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a >>>>> llocate-memory-td4025117.html). And I know it indicate the number of >>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then >>>>> I can use the following command to submit the job, right? >>>>> >>>>> >>>>> The cpu parameter only affects how many CPUs are given to the blast >>>>> command line. So only the BLASt step will speed up, so I recommend using >>>>> MPI to get all steps to speed up. Even if you are only running on a single >>>>> node, you can give all CPUs to the mpiexec command. >>>>> >>>>> >>>>> ?Carson >>>>> >>>> >>>> >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Mon Sep 11 12:46:39 2017 From: carsonhh at gmail.com (Carson Holt) Date: Mon, 11 Sep 2017 11:46:39 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> Message-ID: Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes. MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node. Example command for a 20 CPU node ?> mpiexec -n 20 maker ?Carson > On Sep 11, 2017, at 11:27 AM, Quanwei Zhang wrote: > > Dear Carson: > > Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3? > > Thanks > > Best > Quanwei > > 2017-09-11 13:18 GMT-04:00 Carson Holt >: > If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org > > It?s easy to install yourself, and tends to be very robust to failure. > > ?Carson > > > >> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang > wrote: >> >> Dear Carson: >> >> I met some problems to use MPI. I will give it another try. >> Thank you! >> >> Best >> Quanwei >> >> 2017-09-11 13:14 GMT-04:00 Carson Holt >: >> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage. >> >> ?Carson >> >> >> >>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang > wrote: >>> >>> Dear Carson: >>> >>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed. >>> >>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. >>> >>> Thank you! >>> >>> Best >>> Quanwei >>> >>> 2017-09-11 13:07 GMT-04:00 Carson Holt >: >>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER. >>> >>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated). >>> >>> ?Carson >>> >>> >>> >>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang > wrote: >>>> >>>> Dear Carson: >>>> >>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic "). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks >>>> >>>> Here are some parameters I used >>>> >>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker >>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe >>>> >>>> max_dna_len=300000 >>>> split_hit=40000 >>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>> >>>> >>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. >>>> 33708 --> rank=NA, hostname=n409 >>>> 33709 ERROR: Failed while processing all repeats >>>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>>> 33711 FAILED CONTIG:Contig31 >>>> >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang >: >>>> Dear Carson: >>>> >>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. >>>> >>>> Thank you! Have a nice weekend! >>>> >>>> >>>> >>>> #--------------------------------------------------------------------- >>>> Now starting the contig!! >>>> SeqID: Contig10 >>>> Length: 18773588 >>>> #--------------------------------------------------------------------- >>>> >>>> >>>> setting up GFF3 output and fasta chunks >>>> doing repeat masking >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> doing blastx repeats >>>> collecting blastx repeatmasking >>>> processing all repeats >>>> doing repeat masking >>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. >>>> --> rank=NA, hostname=n224 >>>> ERROR: Failed while doing repeat masking >>>> ERROR: Chunk failed at level:0, tier_type:1 >>>> FAILED CONTIG:Contig10 >>>> >>>> ERROR: Chunk failed at level:2, tier_type:0 >>>> FAILED CONTIG:Contig10 >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-06 12:06 GMT-04:00 Carson Holt >: >>>> >>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? >>>>> >>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>> >>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. >>>> >>>> >>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. >>>> >>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN >>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX >>>> >>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. >>>> >>>> >>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). >>>> >>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. >>>> >>>> >>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? >>>> >>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. >>>> >>>> >>>> ?Carson >>>> >>>> >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Sep 11 13:33:51 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 11 Sep 2017 14:33:51 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> Message-ID: Dear Carson: I see. Thank you. I will try it. Best Quanwei 2017-09-11 13:46 GMT-04:00 Carson Holt : > Each node is a single machine. Because you currently run without MPI, each > MAKER job you submit runs on a single machine. So you are either running > multiple times on the same node, or you submitted 5 separate batch jobs in > which case you may have a single maker process on each of 5 nodes. > > MPI can parallelize on the same node or across nodes. If you request 10 > nodes, then it can communicate across nodes to run the job on all hardware. > Or you can run MPI on a single node and ask for all CPUs on that node. In > that case it will split up work within a single node and use all resources > just on that node. So if you can?t get MPI to work across nodes, you can > just submit a job that goes to a single node and ask for all CPUs on that > node (multinode jobs may be hard to configure, but single node jobs are > very easy). Just set the -n parameter of mpiexec to the CPU count of that > node, and it will parallelize within the node. > > Example command for a 20 CPU node ?> mpiexec -n 20 maker > > ?Carson > > > > > > On Sep 11, 2017, at 11:27 AM, Quanwei Zhang wrote: > > Dear Carson: > > Would you please explain what do you mean by "a single machine"? I am > running maker2 on our high performance cluster. The cluster has more than > 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used > as the scheduler. Can I use MPICH3? > > Thanks > > Best > Quanwei > > 2017-09-11 13:18 GMT-04:00 Carson Holt : > >> If you are just using a single machine (and not cross machine MPI), use >> MPICH3 ?> https://www.mpich.org >> >> It?s easy to install yourself, and tends to be very robust to failure. >> >> ?Carson >> >> >> >> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang >> wrote: >> >> Dear Carson: >> >> I met some problems to use MPI. I will give it another try. >> Thank you! >> >> Best >> Quanwei >> >> 2017-09-11 13:14 GMT-04:00 Carson Holt : >> >>> It could be either. Please use MPI instead of starting multiple >>> instances. It will greatly reduce both IO and RAM usage. >>> >>> ?Carson >>> >>> >>> >>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang >>> wrote: >>> >>> Dear Carson: >>> >>> I only run 5 Maker instances in each directory (and set cpus=2). If it >>> is related to memory issue or an IO issue, I am not sure why the much >>> longer scaffolds (than the failed ones) were all annotated successfully, >>> but the relatively shorter ones failed. >>> >>> I have set "tries=5" (#number of times to try a contig if there is a >>> failure for some reason). I will try "clean_try=1" and test on the failed >>> scaffolds individually with larger memory to see whether they can be >>> annotated. >>> >>> Thank you! >>> >>> Best >>> Quanwei >>> >>> 2017-09-11 13:07 GMT-04:00 Carson Holt : >>> >>>> I think the cause of the error may have been a little further upstream >>>> from what you pasted in the e-mail. One thing that may be happening is that >>>> you are taxing resources (like IO) if running MAKER multiple times or on >>>> too many CPUs. That can lead to failures because of truncated BLAST reports >>>> etc. In which case you can just retry and that will get around those types >>>> of IO derived errors. MAKER can generate a lot of IO, and if you are >>>> working on network mounted locations (i.e. the storage being used is >>>> actually across the network), then they can be lest robust than local >>>> storage (when under heavy load NFS can falsely report success on read/write >>>> operations that actually failed). It?s the reason we built in the retry >>>> capabilities of MAKER. >>>> >>>> For contigs that continuously fail, you may need to set clean_try=1. >>>> That will cause failures to start from scratch (i.e. delete all old reports >>>> on failure rather than just those suspected of being truncated). >>>> >>>> ?Carson >>>> >>>> >>>> >>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang >>>> wrote: >>>> >>>> Dear Carson: >>>> >>>> About the error in my above email, I found the contig was correctly >>>> annotated at the second time RETRY. So please ignore my last email. But >>>> now, for a few number of scaffolds, I met problems to process the repeats >>>> (as shown below in red). I used both Mammalia repeat library and species >>>> specific repeat library (which is generated by your pipeline " >>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep >>>> eat_Library_Construction--Basic"). There were no such problems when I >>>> only used Mammalia repeat library. Do you have any ideas about this? What >>>> could be the reason? Or do you have any suggestions for me to find the >>>> reason? Many thanks >>>> >>>> Here are some parameters I used >>>> >>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>> model_org=Mammalia #select a model organism for RepBase masking in >>>> RepeatMasker >>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism >>>> specific repeat library in fasta format for Repe >>>> >>>> max_dna_len=300000 >>>> split_hit=40000 >>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>> >>>> >>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm >>>> line 188. >>>> 33708 --> rank=NA, hostname=n409 >>>> 33709 ERROR: Failed while processing all repeats >>>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>>> 33711 FAILED CONTIG:Contig31 >>>> >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : >>>> >>>>> Dear Carson: >>>>> >>>>> I got the following error again. Is this still related to memory >>>>> issues? I wonder whether there can be other reasons lead to this error? >>>>> This time, I got this error during training of the SNAP model. Before, even >>>>> I set max_dna_len=1Mb, I can train the model successfully. And in the >>>>> current training (where I get the following error), I have decreased the >>>>> max_dna_len to 300kb. I required the same amount memory as before. The only >>>>> difference is that I am using both mammalian repeat library and species >>>>> specific repeat library, while previously I only use the mammalian repeat >>>>> library. Will it greatly increases the requirement of memory to use both >>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I >>>>> have also set the depth_blast as 30 in current training. >>>>> >>>>> Thank you! Have a nice weekend! >>>>> >>>>> >>>>> >>>>> #--------------------------------------------------------------------- >>>>> Now starting the contig!! >>>>> SeqID: Contig10 >>>>> Length: 18773588 >>>>> #--------------------------------------------------------------------- >>>>> >>>>> >>>>> setting up GFF3 output and fasta chunks >>>>> doing repeat masking >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> collecting blastx repeatmasking >>>>> processing all repeats >>>>> doing repeat masking >>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >>>>> line 1050. >>>>> --> rank=NA, hostname=n224 >>>>> ERROR: Failed while doing repeat masking >>>>> ERROR: Chunk failed at level:0, tier_type:1 >>>>> FAILED CONTIG:Contig10 >>>>> >>>>> ERROR: Chunk failed at level:2, tier_type:0 >>>>> FAILED CONTIG:Contig10 >>>>> >>>>> Best >>>>> Quanwei >>>>> >>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt : >>>>> >>>>>> >>>>>> (2) By reading some of your replies in the maker google group, and I >>>>>> noticed that it can reduce memory and save time for annotation if I set >>>>>> depth_blast to a certain number. So I changed the following parameters. But >>>>>> I wonder, whether it will decrease the quality of annotation? If it won't >>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more >>>>>> memory and time? >>>>>> >>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>>> >>>>>> >>>>>> This values really only affects the final evidence kept in the GFF3 >>>>>> when you look at it in a browser. It has not affect on the annotation. This >>>>>> is because internally MAKER already collapses evidence down to the 10 best >>>>>> non-redundant features per evidence set per locus. The rest are put in the >>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER >>>>>> know it can through things away even sooner since you don?t want them in >>>>>> the GFF3. It provides a minor improvement for memory use, but >>>>>> max_dna_length is the big one that has the greatest effect. >>>>>> >>>>>> >>>>>> (3) I also have some concerns about the speed, especially for the >>>>>> long scaffolds (around 100Mb). I wonder which part is the most time >>>>>> consuming for genome annotation (repeat masking, blast, or polishing?). >>>>>> Particularly, I wonder whether the blastx of protein evidence will take >>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein >>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >>>>>> am considering whether I can save much time if I only use the 99k mammalian >>>>>> Swiss protein sequences as evidences. >>>>>> >>>>>> >>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least >>>>>> 6 times slower than BLASTN >>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at >>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX >>>>>> >>>>>> Also double the dataset size, double the runtime. Larger window sizes >>>>>> via max_dna_length will also increase runtimes. >>>>>> >>>>>> >>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. >>>>>> So I can only start multiple maker. I wonder if it is possible to let >>>>>> multiple maker to annotate the same long scaffold (i.e., for a single >>>>>> sequence I start multiple maker, without splitting the long sequence into >>>>>> shorter ones). >>>>>> >>>>>> >>>>>> Without MPI you won?t be able to split up large contigs. At the very >>>>>> least you can try and run on a single node and set MPI to use all CPUs on >>>>>> that node. It?s less difficult to set up compared to cross node jobs via >>>>>> MPI. >>>>>> >>>>>> >>>>>> (5) Still about the speed issue. I read some of your comments about >>>>>> "cpus" parameters in the maker_opts file ( >>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a >>>>>> llocate-memory-td4025117.html). And I know it indicate the number of >>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then >>>>>> I can use the following command to submit the job, right? >>>>>> >>>>>> >>>>>> The cpu parameter only affects how many CPUs are given to the blast >>>>>> command line. So only the BLASt step will speed up, so I recommend using >>>>>> MPI to get all steps to speed up. Even if you are only running on a single >>>>>> node, you can give all CPUs to the mpiexec command. >>>>>> >>>>>> >>>>>> ?Carson >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Wed Sep 13 09:51:32 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Wed, 13 Sep 2017 10:51:32 -0400 Subject: [maker-devel] Repeats annotation Message-ID: Dear Carson: We have generated species specific repeat library following your pipeline ( http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic). And did genome annotation by maker2 by using both species specific repeat library and mammalian repeat library. Now, we want to do some comparison about the repeat contexts among different species. So I want to generate species specific for other species and also use both their species specific repeat library and mammalian repeat library. But I found, I can only provide either the species specific repeat library or mammalian repeat library to RepeatMasker (not for both). I wonder whether I can run maker2 on those genome but only for repeat masking. BTW, by running RepeatMasker we can get a summary report (as below), I wonder whether there is any script from maker2 to analyze repeats element (or other tools to process the output of maker2). Many thanks file name: test_scaffold31.fasta sequences: 1 total length: 863590 bp (858757 bp excl N/X-runs) GC level: 37.02 % bases masked: 301634 bp ( 34.93 %) ================================================== number of length percentage elements* occupied of sequence -------------------------------------------------- SINEs: 134 14362 bp 1.66 % Alu/B1 28 2183 bp 0.25 % MIRs 21 2860 bp 0.33 % LINEs: 188 129104 bp 14.95 % LINE1 168 124633 bp 14.43 % LINE2 16 4266 bp 0.49 % L3/CR1 4 205 bp 0.02 % RTE 0 0 bp 0.00 % LTR elements: 127 101129 bp 11.71 % ERVL 10 3057 bp 0.35 % ERVL-MaLRs 22 6902 bp 0.80 % ERV_classI 66 80258 bp 9.29 % ERV_classII 29 10912 bp 1.26 % DNA elements: 27 4402 bp 0.51 % hAT-Charlie 13 1836 bp 0.21 % TcMar-Tigger 8 1651 bp 0.19 % Unclassified: 4 1590 bp 0.18 % Total interspersed repeats: 250587 bp 29.02 % Small RNA: 9 616 bp 0.07 % Satellites: 66 40820 bp 4.73 % Simple repeats: 159 7235 bp 0.84 % Low complexity: 50 2766 bp 0.32 % ================================================== * most repeats fragmented by insertions or deletions have been counted as one element The query species was assumed to be mammalia RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 run with rmblastn version 2.2.27+ -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Wed Sep 13 09:32:34 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Wed, 13 Sep 2017 10:32:34 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> Message-ID: Dear Carson: I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig. Do you have any ideas about this. Thanks file name: test_scaffold31.fasta sequences: 1 total length: 863590 bp (858757 bp excl N/X-runs) GC level: 37.02 % bases masked: 562909 bp ( 65.18 %) ================================================== number of length percentage elements* occupied of sequence -------------------------------------------------- SINEs: 113 16134 bp 1.87 % ALUs 71 12479 bp 1.45 % MIRs 1 133 bp 0.02 % LINEs: 251 380142 bp 44.02 % LINE1 211 210623 bp 24.39 % LINE2 1 86 bp 0.01 % L3/CR1 0 0 bp 0.00 % LTR elements: 246 101221 bp 11.72 % ERVL 5 1037 bp 0.12 % ERVL-MaLRs 18 2744 bp 0.32 % ERV_classI 201 90942 bp 10.53 % ERV_classII 18 5964 bp 0.69 % DNA elements: 39 14177 bp 1.64 % hAT-Charlie 7 3864 bp 0.45 % TcMar-Tigger 7 1706 bp 0.20 % Unclassified: 196 45831 bp 5.31 % Total interspersed repeats: 557505 bp 64.56 % Small RNA: 3 823 bp 0.10 % Satellites: 2 237 bp 0.03 % Simple repeats: 94 4472 bp 0.52 % Low complexity: 18 766 bp 0.09 % ================================================== * most repeats fragmented by insertions or deletions have been counted as one element The query species was assumed to be homo RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 run with rmblastn version 2.2.27+ The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal" Best Quanwei 2017-09-11 14:33 GMT-04:00 Quanwei Zhang : > Dear Carson: > > I see. Thank you. I will try it. > > Best > Quanwei > > 2017-09-11 13:46 GMT-04:00 Carson Holt : > >> Each node is a single machine. Because you currently run without MPI, >> each MAKER job you submit runs on a single machine. So you are either >> running multiple times on the same node, or you submitted 5 separate batch >> jobs in which case you may have a single maker process on each of 5 nodes. >> >> MPI can parallelize on the same node or across nodes. If you request 10 >> nodes, then it can communicate across nodes to run the job on all hardware. >> Or you can run MPI on a single node and ask for all CPUs on that node. In >> that case it will split up work within a single node and use all resources >> just on that node. So if you can?t get MPI to work across nodes, you can >> just submit a job that goes to a single node and ask for all CPUs on that >> node (multinode jobs may be hard to configure, but single node jobs are >> very easy). Just set the -n parameter of mpiexec to the CPU count of that >> node, and it will parallelize within the node. >> >> Example command for a 20 CPU node ?> mpiexec -n 20 maker >> >> ?Carson >> >> >> >> >> >> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang >> wrote: >> >> Dear Carson: >> >> Would you please explain what do you mean by "a single machine"? I am >> running maker2 on our high performance cluster. The cluster has more than >> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used >> as the scheduler. Can I use MPICH3? >> >> Thanks >> >> Best >> Quanwei >> >> 2017-09-11 13:18 GMT-04:00 Carson Holt : >> >>> If you are just using a single machine (and not cross machine MPI), use >>> MPICH3 ?> https://www.mpich.org >>> >>> It?s easy to install yourself, and tends to be very robust to failure. >>> >>> ?Carson >>> >>> >>> >>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang >>> wrote: >>> >>> Dear Carson: >>> >>> I met some problems to use MPI. I will give it another try. >>> Thank you! >>> >>> Best >>> Quanwei >>> >>> 2017-09-11 13:14 GMT-04:00 Carson Holt : >>> >>>> It could be either. Please use MPI instead of starting multiple >>>> instances. It will greatly reduce both IO and RAM usage. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang >>>> wrote: >>>> >>>> Dear Carson: >>>> >>>> I only run 5 Maker instances in each directory (and set cpus=2). If it >>>> is related to memory issue or an IO issue, I am not sure why the much >>>> longer scaffolds (than the failed ones) were all annotated successfully, >>>> but the relatively shorter ones failed. >>>> >>>> I have set "tries=5" (#number of times to try a contig if there is a >>>> failure for some reason). I will try "clean_try=1" and test on the failed >>>> scaffolds individually with larger memory to see whether they can be >>>> annotated. >>>> >>>> Thank you! >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-11 13:07 GMT-04:00 Carson Holt : >>>> >>>>> I think the cause of the error may have been a little further upstream >>>>> from what you pasted in the e-mail. One thing that may be happening is that >>>>> you are taxing resources (like IO) if running MAKER multiple times or on >>>>> too many CPUs. That can lead to failures because of truncated BLAST reports >>>>> etc. In which case you can just retry and that will get around those types >>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are >>>>> working on network mounted locations (i.e. the storage being used is >>>>> actually across the network), then they can be lest robust than local >>>>> storage (when under heavy load NFS can falsely report success on read/write >>>>> operations that actually failed). It?s the reason we built in the retry >>>>> capabilities of MAKER. >>>>> >>>>> For contigs that continuously fail, you may need to set clean_try=1. >>>>> That will cause failures to start from scratch (i.e. delete all old reports >>>>> on failure rather than just those suspected of being truncated). >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang >>>>> wrote: >>>>> >>>>> Dear Carson: >>>>> >>>>> About the error in my above email, I found the contig was correctly >>>>> annotated at the second time RETRY. So please ignore my last email. But >>>>> now, for a few number of scaffolds, I met problems to process the repeats >>>>> (as shown below in red). I used both Mammalia repeat library and species >>>>> specific repeat library (which is generated by your pipeline " >>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep >>>>> eat_Library_Construction--Basic"). There were no such problems when I >>>>> only used Mammalia repeat library. Do you have any ideas about this? What >>>>> could be the reason? Or do you have any suggestions for me to find the >>>>> reason? Many thanks >>>>> >>>>> Here are some parameters I used >>>>> >>>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>>> model_org=Mammalia #select a model organism for RepBase masking in >>>>> RepeatMasker >>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism >>>>> specific repeat library in fasta format for Repe >>>>> >>>>> max_dna_len=300000 >>>>> split_hit=40000 >>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>> >>>>> >>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm >>>>> line 188. >>>>> 33708 --> rank=NA, hostname=n409 >>>>> 33709 ERROR: Failed while processing all repeats >>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>>>> 33711 FAILED CONTIG:Contig31 >>>>> >>>>> >>>>> Best >>>>> Quanwei >>>>> >>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : >>>>> >>>>>> Dear Carson: >>>>>> >>>>>> I got the following error again. Is this still related to memory >>>>>> issues? I wonder whether there can be other reasons lead to this error? >>>>>> This time, I got this error during training of the SNAP model. Before, even >>>>>> I set max_dna_len=1Mb, I can train the model successfully. And in the >>>>>> current training (where I get the following error), I have decreased the >>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only >>>>>> difference is that I am using both mammalian repeat library and species >>>>>> specific repeat library, while previously I only use the mammalian repeat >>>>>> library. Will it greatly increases the requirement of memory to use both >>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I >>>>>> have also set the depth_blast as 30 in current training. >>>>>> >>>>>> Thank you! Have a nice weekend! >>>>>> >>>>>> >>>>>> >>>>>> #----------------------------------------------------------- >>>>>> ---------- >>>>>> Now starting the contig!! >>>>>> SeqID: Contig10 >>>>>> Length: 18773588 >>>>>> #----------------------------------------------------------- >>>>>> ---------- >>>>>> >>>>>> >>>>>> setting up GFF3 output and fasta chunks >>>>>> doing repeat masking >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> collecting blastx repeatmasking >>>>>> processing all repeats >>>>>> doing repeat masking >>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >>>>>> line 1050. >>>>>> --> rank=NA, hostname=n224 >>>>>> ERROR: Failed while doing repeat masking >>>>>> ERROR: Chunk failed at level:0, tier_type:1 >>>>>> FAILED CONTIG:Contig10 >>>>>> >>>>>> ERROR: Chunk failed at level:2, tier_type:0 >>>>>> FAILED CONTIG:Contig10 >>>>>> >>>>>> Best >>>>>> Quanwei >>>>>> >>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt : >>>>>> >>>>>>> >>>>>>> (2) By reading some of your replies in the maker google group, and I >>>>>>> noticed that it can reduce memory and save time for annotation if I set >>>>>>> depth_blast to a certain number. So I changed the following parameters. But >>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't >>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more >>>>>>> memory and time? >>>>>>> >>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>>>> >>>>>>> >>>>>>> This values really only affects the final evidence kept in the GFF3 >>>>>>> when you look at it in a browser. It has not affect on the annotation. This >>>>>>> is because internally MAKER already collapses evidence down to the 10 best >>>>>>> non-redundant features per evidence set per locus. The rest are put in the >>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER >>>>>>> know it can through things away even sooner since you don?t want them in >>>>>>> the GFF3. It provides a minor improvement for memory use, but >>>>>>> max_dna_length is the big one that has the greatest effect. >>>>>>> >>>>>>> >>>>>>> (3) I also have some concerns about the speed, especially for the >>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time >>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?). >>>>>>> Particularly, I wonder whether the blastx of protein evidence will take >>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein >>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >>>>>>> am considering whether I can save much time if I only use the 99k mammalian >>>>>>> Swiss protein sequences as evidences. >>>>>>> >>>>>>> >>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at >>>>>>> least 6 times slower than BLASTN >>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at >>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX >>>>>>> >>>>>>> Also double the dataset size, double the runtime. Larger window >>>>>>> sizes via max_dna_length will also increase runtimes. >>>>>>> >>>>>>> >>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. >>>>>>> So I can only start multiple maker. I wonder if it is possible to let >>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single >>>>>>> sequence I start multiple maker, without splitting the long sequence into >>>>>>> shorter ones). >>>>>>> >>>>>>> >>>>>>> Without MPI you won?t be able to split up large contigs. At the very >>>>>>> least you can try and run on a single node and set MPI to use all CPUs on >>>>>>> that node. It?s less difficult to set up compared to cross node jobs via >>>>>>> MPI. >>>>>>> >>>>>>> >>>>>>> (5) Still about the speed issue. I read some of your comments about >>>>>>> "cpus" parameters in the maker_opts file ( >>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a >>>>>>> llocate-memory-td4025117.html). And I know it indicate the number >>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, >>>>>>> then I can use the following command to submit the job, right? >>>>>>> >>>>>>> >>>>>>> The cpu parameter only affects how many CPUs are given to the blast >>>>>>> command line. So only the BLASt step will speed up, so I recommend using >>>>>>> MPI to get all steps to speed up. Even if you are only running on a single >>>>>>> node, you can give all CPUs to the mpiexec command. >>>>>>> >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Wed Sep 13 13:01:11 2017 From: mathog at caltech.edu (mathog) Date: Wed, 13 Sep 2017 11:01:11 -0700 Subject: [maker-devel] OpenMPI issues, no response in two attempts to subscribe to list Message-ID: Greetings, I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system. It just won't start. OpenMPI works fine with a small test program, it just doesn't work with maker. It fails in exactly the same way on a second Centos system with minor software differences (Centos 6.9 and perl 5.20 compiled without thread support, the perl on the first machine had thread support.) The gory details were posted already in a Centos forum so rather than repeat it all here, this is a link to that thread: https://www.centos.org/forums/viewtopic.php?f=14&t=64099 maker was unpacked from the maker-2.31.9.tgz a second time (after moving the original) after setting up the "module add openmpi-x86_64" to my .bash_profile and logging in cleanly. It was rebuilt. The build messages were identical to the previous ones and when a run was attempted it also failed in exactly the same way. I also tried to subscribe to the list here https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org once yesterday, and once today, but no email ever came back. Hopefully this message gets through! Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From carsonhh at gmail.com Wed Sep 13 13:23:11 2017 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Sep 2017 12:23:11 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> Message-ID: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com> These are the 3 errors you have shown in your e-mails ?> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40. Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory. The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent. IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues. Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM. 1. Some things to check. Make sure TMP= is not being set to a network mounted location. 2. Make sure your temporary directory is not a virtual in memory directory on the node being used. 3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission. Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better. Thanks, Carson > On Sep 13, 2017, at 8:32 AM, Quanwei Zhang wrote: > > Dear Carson: > > I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. > > I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig. Do you have any ideas about this. Thanks > > > > file name: test_scaffold31.fasta > sequences: 1 > total length: 863590 bp (858757 bp excl N/X-runs) > GC level: 37.02 % > bases masked: 562909 bp ( 65.18 %) > ================================================== > number of length percentage > elements* occupied of sequence > -------------------------------------------------- > SINEs: 113 16134 bp 1.87 % > ALUs 71 12479 bp 1.45 % > MIRs 1 133 bp 0.02 % > > LINEs: 251 380142 bp 44.02 % > LINE1 211 210623 bp 24.39 % > LINE2 1 86 bp 0.01 % > L3/CR1 0 0 bp 0.00 % > > LTR elements: 246 101221 bp 11.72 % > ERVL 5 1037 bp 0.12 % > ERVL-MaLRs 18 2744 bp 0.32 % > ERV_classI 201 90942 bp 10.53 % > ERV_classII 18 5964 bp 0.69 % > > DNA elements: 39 14177 bp 1.64 % > hAT-Charlie 7 3864 bp 0.45 % > TcMar-Tigger 7 1706 bp 0.20 % > > Unclassified: 196 45831 bp 5.31 % > > Total interspersed repeats: 557505 bp 64.56 % > > > Small RNA: 3 823 bp 0.10 % > > Satellites: 2 237 bp 0.03 % > Simple repeats: 94 4472 bp 0.52 % > Low complexity: 18 766 bp 0.09 % > ================================================== > > * most repeats fragmented by insertions or deletions > have been counted as one element > > > The query species was assumed to be homo > RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 > > run with rmblastn version 2.2.27+ > The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal" > > > Best > Quanwei > > 2017-09-11 14:33 GMT-04:00 Quanwei Zhang >: > Dear Carson: > > I see. Thank you. I will try it. > > Best > Quanwei > > 2017-09-11 13:46 GMT-04:00 Carson Holt >: > Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes. > > MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node. > > Example command for a 20 CPU node ?> mpiexec -n 20 maker > > ?Carson > > > > > >> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang > wrote: >> >> Dear Carson: >> >> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3? >> >> Thanks >> >> Best >> Quanwei >> >> 2017-09-11 13:18 GMT-04:00 Carson Holt >: >> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org >> >> It?s easy to install yourself, and tends to be very robust to failure. >> >> ?Carson >> >> >> >>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang > wrote: >>> >>> Dear Carson: >>> >>> I met some problems to use MPI. I will give it another try. >>> Thank you! >>> >>> Best >>> Quanwei >>> >>> 2017-09-11 13:14 GMT-04:00 Carson Holt >: >>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage. >>> >>> ?Carson >>> >>> >>> >>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang > wrote: >>>> >>>> Dear Carson: >>>> >>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed. >>>> >>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. >>>> >>>> Thank you! >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-11 13:07 GMT-04:00 Carson Holt >: >>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER. >>>> >>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated). >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang > wrote: >>>>> >>>>> Dear Carson: >>>>> >>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic "). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks >>>>> >>>>> Here are some parameters I used >>>>> >>>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker >>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe >>>>> >>>>> max_dna_len=300000 >>>>> split_hit=40000 >>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>> >>>>> >>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. >>>>> 33708 --> rank=NA, hostname=n409 >>>>> 33709 ERROR: Failed while processing all repeats >>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>>>> 33711 FAILED CONTIG:Contig31 >>>>> >>>>> >>>>> Best >>>>> Quanwei >>>>> >>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang >: >>>>> Dear Carson: >>>>> >>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. >>>>> >>>>> Thank you! Have a nice weekend! >>>>> >>>>> >>>>> >>>>> #--------------------------------------------------------------------- >>>>> Now starting the contig!! >>>>> SeqID: Contig10 >>>>> Length: 18773588 >>>>> #--------------------------------------------------------------------- >>>>> >>>>> >>>>> setting up GFF3 output and fasta chunks >>>>> doing repeat masking >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> doing blastx repeats >>>>> collecting blastx repeatmasking >>>>> processing all repeats >>>>> doing repeat masking >>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. >>>>> --> rank=NA, hostname=n224 >>>>> ERROR: Failed while doing repeat masking >>>>> ERROR: Chunk failed at level:0, tier_type:1 >>>>> FAILED CONTIG:Contig10 >>>>> >>>>> ERROR: Chunk failed at level:2, tier_type:0 >>>>> FAILED CONTIG:Contig10 >>>>> >>>>> Best >>>>> Quanwei >>>>> >>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt >: >>>>> >>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? >>>>>> >>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>> >>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. >>>>> >>>>> >>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. >>>>> >>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN >>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX >>>>> >>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. >>>>> >>>>> >>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). >>>>> >>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. >>>>> >>>>> >>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? >>>>> >>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. >>>>> >>>>> >>>>> ?Carson >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Sep 13 13:26:08 2017 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Sep 2017 12:26:08 -0600 Subject: [maker-devel] Repeats annotation In-Reply-To: References: Message-ID: <40F80C42-836A-41FF-9C9F-1F45C5816283@gmail.com> I don?t know of any tool to analyze the repeat info. MAKER really only focuses on getting the masking done for the gene prediction, and while it does keep the repeats as features in the GFF3, it does not do any kind of analysis. You would have to do that outside of MAKER. ?Carson > On Sep 13, 2017, at 8:51 AM, Quanwei Zhang wrote: > > Dear Carson: > > We have generated species specific repeat library following your pipeline (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic ). And did genome annotation by maker2 by using both species specific repeat library and mammalian repeat library. > > Now, we want to do some comparison about the repeat contexts among different species. So I want to generate species specific for other species and also use both their species specific repeat library and mammalian repeat library. But I found, I can only provide either the species specific repeat library or mammalian repeat library to RepeatMasker (not for both). I wonder whether I can run maker2 on those genome but only for repeat masking. > > BTW, by running RepeatMasker we can get a summary report (as below), I wonder whether there is any script from maker2 to analyze repeats element (or other tools to process the output of maker2). > > Many thanks > > > file name: test_scaffold31.fasta > sequences: 1 > total length: 863590 bp (858757 bp excl N/X-runs) > GC level: 37.02 % > bases masked: 301634 bp ( 34.93 %) > ================================================== > number of length percentage > elements* occupied of sequence > -------------------------------------------------- > SINEs: 134 14362 bp 1.66 % > Alu/B1 28 2183 bp 0.25 % > MIRs 21 2860 bp 0.33 % > > LINEs: 188 129104 bp 14.95 % > LINE1 168 124633 bp 14.43 % > LINE2 16 4266 bp 0.49 % > L3/CR1 4 205 bp 0.02 % > RTE 0 0 bp 0.00 % > > LTR elements: 127 101129 bp 11.71 % > ERVL 10 3057 bp 0.35 % > ERVL-MaLRs 22 6902 bp 0.80 % > ERV_classI 66 80258 bp 9.29 % > ERV_classII 29 10912 bp 1.26 % > > DNA elements: 27 4402 bp 0.51 % > hAT-Charlie 13 1836 bp 0.21 % > TcMar-Tigger 8 1651 bp 0.19 % > > Unclassified: 4 1590 bp 0.18 % > > Total interspersed repeats: 250587 bp 29.02 % > > > Small RNA: 9 616 bp 0.07 % > > Satellites: 66 40820 bp 4.73 % > Simple repeats: 159 7235 bp 0.84 % > Low complexity: 50 2766 bp 0.32 % > ================================================== > > * most repeats fragmented by insertions or deletions > have been counted as one element > > > The query species was assumed to be mammalia > RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 > > run with rmblastn version 2.2.27+ -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Sep 13 13:41:24 2017 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Sep 2017 12:41:24 -0600 Subject: [maker-devel] OpenMPI issues, no response in two attempts to subscribe to list In-Reply-To: References: Message-ID: Mi David, First thing. MAKER binds shared C libraries using Perl, so you have to tell MAKER where to find the needed files before you install it. Then it compiles the bindings and saves them for MAKER to use. If you have two MPI installation, you may have MAKER setup to use one of the installations then you are trying to call it with the other one. That would break the compiles bindings. Also make sure you did the following (info from the ?/maker/INSTALL instructions file) ?> "make sure to set LD_PRELOAD to the location of libmpi.so before even trying to install MAKER. It must also be set before running MAKER (or any program that binds OpenMPI's shared libraries), so it's best just to add it to your ~/.bash_profile. (i.e. export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so)." Remember to replace '/usr/local/openmpi/lib/libmpi.so? with the actual location of the file. Second once you can get maker to start under OpenMPI, you may get freezes or failures part way into a run because OpenFabrics libraries use registered memory in a weird way that can cause system calls in a program to fail with a snowballing error effect. Adding this to the mpiexec options can stop this from occurring ?> '-mca btl ^openib' That option has the side effect of disabling infiniband and using the ethernet adapter instead. However if you need to use the infiniband adapter, you can use this flag instead '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0' That command will use IP over infiniband rather than the native infiniband which will have the same effect of diabling the OpenFabrics libraries. Thanks, Carson > On Sep 13, 2017, at 12:01 PM, mathog wrote: > > Greetings, > > I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system. It just won't start. OpenMPI works fine with a small test program, it just doesn't work with maker. It fails in exactly the same way on a second Centos system with minor software differences (Centos 6.9 and perl 5.20 compiled without thread support, the perl on the first machine had thread support.) The gory details were posted already in a Centos forum so rather than repeat it all here, this is a link to that thread: > > https://www.centos.org/forums/viewtopic.php?f=14&t=64099 > > maker was unpacked from the maker-2.31.9.tgz a second time (after moving the original) after setting up the "module add openmpi-x86_64" to my .bash_profile > and logging in cleanly. It was rebuilt. The build messages were identical to the previous ones and when a run was attempted it also failed in exactly the same way. > > I also tried to subscribe to the list here > > https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > once yesterday, and once today, but no email ever came back. Hopefully this message gets through! > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From qwzhang0601 at gmail.com Wed Sep 13 14:42:01 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Wed, 13 Sep 2017 15:42:01 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com> Message-ID: Dear Carson: Thank you for your explanation. Sorry for not describing my problem clearly. The first two errors were all gone after I changed the parameters you suggested (e.g., max_dna_len, depeth_blast). Now I only get the following error for two contigs among thousands of contigs. One of the two failed contigs has length 863k, and I have done more tests on this contig individually. By running repeatmask on this contig, 65% was masked when using species specific repeat library, while it is only 35% when using mammalian repeat library. Since longer contigs (even 98Mb) can all be annotated, I doubt why this much shorter one can fail due to IO. I did not set "TMP", and I am running on a high performance cluster. I am not sure whether it is a virtual memory or not. I will check it later. Many thanks Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. 33708 --> rank=NA, hostname=n409 33709 ERROR: Failed while processing all repeats 33710 ERROR: Chunk failed at level:3, tier_type:1 33711 FAILED CONTIG:Contig31 Best Quanwei 2017-09-13 14:23 GMT-04:00 Carson Holt : > These are the 3 errors you have shown in your e-mails ?> > open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2. > 31.9/bin/../lib/Widget/blastx.pm line 40. > Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2. > 31.9/bin/../lib/File/NFSLock.pm line 1050. > Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm > line 188. > > The first two are memory related with the second being because it cannot > kill a lock maintainer thread that it was not able to start because of lack > of memory. > > The third one is IO related. It is a truncated file that succeeded on the > second try according to the e-mail you sent. > > > IO errors are quite common with NFS (network mounted file systems). It?s > one of the most frequent issues submitted to the devel list. MAKER can hit > IO limits long before it hits CPU limits. One of the most frequent casues > of these issues is that the user set TMP= in the control files to a manual > location that is not suitable for high IO (note TMP= defaults to /tmp). The > location should always be a true locally mounted disk. Sometimes this is a > virtual location (not really local disk but network mounted disk or an in > memory location). With the former you will get frequent IO failures and > with the latter you will also get out of memory issues. > > Note that when you supply more data files you will also use more memory > (to hold analysis results). According to your e-mail the last error you got > was 'Can't kill a non-numeric process ID?. Correct? So getting the error > with two input files but not when you supply a single input file further > suggests you are running low on RAM. > > 1. Some things to check. Make sure TMP= is not being set to a network > mounted location. > 2. Make sure your temporary directory is not a virtual in memory directory > on the node being used. > 3. If nodes are shared, you may run out of memory because of other users > or because you failed to request enough RAM during job submission. > > Finally, try running interactively so you can see what the memory and > directory locations look like on the node you get assigned for the job > (check space and mount points. Is /tmp or whereever you set TMP= in fact a > local disk?). Also run with MPI rather than starting multiple MAKER > instances. It uses resources better. > > Thanks, > Carson > > > > > > > On Sep 13, 2017, at 8:32 AM, Quanwei Zhang wrote: > > Dear Carson: > > I did more tests on one of the contigs (with length 863kb) that failed > when doing repeat masking. I found it only fail when I added the species > specific repeat library, and it can be successfully annotated when only > considering mammalian repeat library. When I did the test I only picked the > this contig and run maker with 64G memory. So I think the failure should > not be the problem with memory or IO, because even the contigs with length > 98Mb can be annotated with memory 32G. > > I also run RepeatMasker on this contig with mammalian and species specific > repeat library, separately. I found when I use mammalian repeat library, > about 35% was masked as repeats, while it is 65% when I use species > specific repeat library (as shown below in blue). I wonder whether the high > level of repeats can lead to the failure of this contig. Do you have any > ideas about this. Thanks > > > > file name: test_scaffold31.fasta > sequences: 1 > total length: 863590 bp (858757 bp excl N/X-runs) > GC level: 37.02 % > bases masked: 562909 bp ( 65.18 %) > ================================================== > number of length percentage > elements* occupied of sequence > -------------------------------------------------- > SINEs: 113 16134 bp 1.87 % > ALUs 71 12479 bp 1.45 % > MIRs 1 133 bp 0.02 % > > LINEs: 251 380142 bp 44.02 % > LINE1 211 210623 bp 24.39 % > LINE2 1 86 bp 0.01 % > L3/CR1 0 0 bp 0.00 % > > LTR elements: 246 101221 bp 11.72 % > ERVL 5 1037 bp 0.12 % > ERVL-MaLRs 18 2744 bp 0.32 % > ERV_classI 201 90942 bp 10.53 % > ERV_classII 18 5964 bp 0.69 % > > DNA elements: 39 14177 bp 1.64 % > hAT-Charlie 7 3864 bp 0.45 % > TcMar-Tigger 7 1706 bp 0.20 % > > Unclassified: 196 45831 bp 5.31 % > > Total interspersed repeats: 557505 bp 64.56 % > > > Small RNA: 3 823 bp 0.10 % > > Satellites: 2 237 bp 0.03 % > Simple repeats: 94 4472 bp 0.52 % > Low complexity: 18 766 bp 0.09 % > ================================================== > > * most repeats fragmented by insertions or deletions > have been counted as one element > > > The query species was assumed to be homo > RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 > > run with rmblastn version 2.2.27+ > The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal" > > > > Best > Quanwei > > 2017-09-11 14:33 GMT-04:00 Quanwei Zhang : > >> Dear Carson: >> >> I see. Thank you. I will try it. >> >> Best >> Quanwei >> >> 2017-09-11 13:46 GMT-04:00 Carson Holt : >> >>> Each node is a single machine. Because you currently run without MPI, >>> each MAKER job you submit runs on a single machine. So you are either >>> running multiple times on the same node, or you submitted 5 separate batch >>> jobs in which case you may have a single maker process on each of 5 nodes. >>> >>> MPI can parallelize on the same node or across nodes. If you request 10 >>> nodes, then it can communicate across nodes to run the job on all hardware. >>> Or you can run MPI on a single node and ask for all CPUs on that node. In >>> that case it will split up work within a single node and use all resources >>> just on that node. So if you can?t get MPI to work across nodes, you can >>> just submit a job that goes to a single node and ask for all CPUs on that >>> node (multinode jobs may be hard to configure, but single node jobs are >>> very easy). Just set the -n parameter of mpiexec to the CPU count of that >>> node, and it will parallelize within the node. >>> >>> Example command for a 20 CPU node ?> mpiexec -n 20 maker >>> >>> ?Carson >>> >>> >>> >>> >>> >>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang >>> wrote: >>> >>> Dear Carson: >>> >>> Would you please explain what do you mean by "a single machine"? I am >>> running maker2 on our high performance cluster. The cluster has more than >>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used >>> as the scheduler. Can I use MPICH3? >>> >>> Thanks >>> >>> Best >>> Quanwei >>> >>> 2017-09-11 13:18 GMT-04:00 Carson Holt : >>> >>>> If you are just using a single machine (and not cross machine MPI), use >>>> MPICH3 ?> https://www.mpich.org >>>> >>>> It?s easy to install yourself, and tends to be very robust to failure. >>>> >>>> ?Carson >>>> >>>> >>>> >>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang >>>> wrote: >>>> >>>> Dear Carson: >>>> >>>> I met some problems to use MPI. I will give it another try. >>>> Thank you! >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-11 13:14 GMT-04:00 Carson Holt : >>>> >>>>> It could be either. Please use MPI instead of starting multiple >>>>> instances. It will greatly reduce both IO and RAM usage. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang >>>>> wrote: >>>>> >>>>> Dear Carson: >>>>> >>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it >>>>> is related to memory issue or an IO issue, I am not sure why the much >>>>> longer scaffolds (than the failed ones) were all annotated successfully, >>>>> but the relatively shorter ones failed. >>>>> >>>>> I have set "tries=5" (#number of times to try a contig if there is a >>>>> failure for some reason). I will try "clean_try=1" and test on the failed >>>>> scaffolds individually with larger memory to see whether they can be >>>>> annotated. >>>>> >>>>> Thank you! >>>>> >>>>> Best >>>>> Quanwei >>>>> >>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt : >>>>> >>>>>> I think the cause of the error may have been a little further >>>>>> upstream from what you pasted in the e-mail. One thing that may be >>>>>> happening is that you are taxing resources (like IO) if running MAKER >>>>>> multiple times or on too many CPUs. That can lead to failures because of >>>>>> truncated BLAST reports etc. In which case you can just retry and that will >>>>>> get around those types of IO derived errors. MAKER can generate a lot of >>>>>> IO, and if you are working on network mounted locations (i.e. the storage >>>>>> being used is actually across the network), then they can be lest robust >>>>>> than local storage (when under heavy load NFS can falsely report success on >>>>>> read/write operations that actually failed). It?s the reason we built in >>>>>> the retry capabilities of MAKER. >>>>>> >>>>>> For contigs that continuously fail, you may need to set clean_try=1. >>>>>> That will cause failures to start from scratch (i.e. delete all old reports >>>>>> on failure rather than just those suspected of being truncated). >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>> >>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang >>>>>> wrote: >>>>>> >>>>>> Dear Carson: >>>>>> >>>>>> About the error in my above email, I found the contig was correctly >>>>>> annotated at the second time RETRY. So please ignore my last email. But >>>>>> now, for a few number of scaffolds, I met problems to process the repeats >>>>>> (as shown below in red). I used both Mammalia repeat library and species >>>>>> specific repeat library (which is generated by your pipeline " >>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep >>>>>> eat_Library_Construction--Basic"). There were no such problems when >>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What >>>>>> could be the reason? Or do you have any suggestions for me to find the >>>>>> reason? Many thanks >>>>>> >>>>>> Here are some parameters I used >>>>>> >>>>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>>>> model_org=Mammalia #select a model organism for RepBase masking in >>>>>> RepeatMasker >>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism >>>>>> specific repeat library in fasta format for Repe >>>>>> >>>>>> max_dna_len=300000 >>>>>> split_hit=40000 >>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>>> >>>>>> >>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm >>>>>> line 188. >>>>>> 33708 --> rank=NA, hostname=n409 >>>>>> 33709 ERROR: Failed while processing all repeats >>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>>>>> 33711 FAILED CONTIG:Contig31 >>>>>> >>>>>> >>>>>> Best >>>>>> Quanwei >>>>>> >>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : >>>>>> >>>>>>> Dear Carson: >>>>>>> >>>>>>> I got the following error again. Is this still related to memory >>>>>>> issues? I wonder whether there can be other reasons lead to this error? >>>>>>> This time, I got this error during training of the SNAP model. Before, even >>>>>>> I set max_dna_len=1Mb, I can train the model successfully. And in the >>>>>>> current training (where I get the following error), I have decreased the >>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only >>>>>>> difference is that I am using both mammalian repeat library and species >>>>>>> specific repeat library, while previously I only use the mammalian repeat >>>>>>> library. Will it greatly increases the requirement of memory to use both >>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I >>>>>>> have also set the depth_blast as 30 in current training. >>>>>>> >>>>>>> Thank you! Have a nice weekend! >>>>>>> >>>>>>> >>>>>>> >>>>>>> #----------------------------------------------------------- >>>>>>> ---------- >>>>>>> Now starting the contig!! >>>>>>> SeqID: Contig10 >>>>>>> Length: 18773588 >>>>>>> #----------------------------------------------------------- >>>>>>> ---------- >>>>>>> >>>>>>> >>>>>>> setting up GFF3 output and fasta chunks >>>>>>> doing repeat masking >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> doing blastx repeats >>>>>>> collecting blastx repeatmasking >>>>>>> processing all repeats >>>>>>> doing repeat masking >>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >>>>>>> line 1050. >>>>>>> --> rank=NA, hostname=n224 >>>>>>> ERROR: Failed while doing repeat masking >>>>>>> ERROR: Chunk failed at level:0, tier_type:1 >>>>>>> FAILED CONTIG:Contig10 >>>>>>> >>>>>>> ERROR: Chunk failed at level:2, tier_type:0 >>>>>>> FAILED CONTIG:Contig10 >>>>>>> >>>>>>> Best >>>>>>> Quanwei >>>>>>> >>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt : >>>>>>> >>>>>>>> >>>>>>>> (2) By reading some of your replies in the maker google group, and >>>>>>>> I noticed that it can reduce memory and save time for annotation if I set >>>>>>>> depth_blast to a certain number. So I changed the following parameters. But >>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't >>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more >>>>>>>> memory and time? >>>>>>>> >>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>>>>> >>>>>>>> >>>>>>>> This values really only affects the final evidence kept in the GFF3 >>>>>>>> when you look at it in a browser. It has not affect on the annotation. This >>>>>>>> is because internally MAKER already collapses evidence down to the 10 best >>>>>>>> non-redundant features per evidence set per locus. The rest are put in the >>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER >>>>>>>> know it can through things away even sooner since you don?t want them in >>>>>>>> the GFF3. It provides a minor improvement for memory use, but >>>>>>>> max_dna_length is the big one that has the greatest effect. >>>>>>>> >>>>>>>> >>>>>>>> (3) I also have some concerns about the speed, especially for the >>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time >>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?). >>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take >>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein >>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >>>>>>>> am considering whether I can save much time if I only use the 99k mammalian >>>>>>>> Swiss protein sequences as evidences. >>>>>>>> >>>>>>>> >>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at >>>>>>>> least 6 times slower than BLASTN >>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at >>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX >>>>>>>> >>>>>>>> Also double the dataset size, double the runtime. Larger window >>>>>>>> sizes via max_dna_length will also increase runtimes. >>>>>>>> >>>>>>>> >>>>>>>> (4) For some reasons, I can not run maker though MPI on our >>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to >>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single >>>>>>>> sequence I start multiple maker, without splitting the long sequence into >>>>>>>> shorter ones). >>>>>>>> >>>>>>>> >>>>>>>> Without MPI you won?t be able to split up large contigs. At the >>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs >>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via >>>>>>>> MPI. >>>>>>>> >>>>>>>> >>>>>>>> (5) Still about the speed issue. I read some of your comments about >>>>>>>> "cpus" parameters in the maker_opts file ( >>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a >>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number >>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, >>>>>>>> then I can use the following command to submit the job, right? >>>>>>>> >>>>>>>> >>>>>>>> The cpu parameter only affects how many CPUs are given to the blast >>>>>>>> command line. So only the BLASt step will speed up, so I recommend using >>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single >>>>>>>> node, you can give all CPUs to the mpiexec command. >>>>>>>> >>>>>>>> >>>>>>>> ?Carson >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Wed Sep 13 15:21:14 2017 From: carsonhh at gmail.com (Carson Holt) Date: Wed, 13 Sep 2017 14:21:14 -0600 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com> Message-ID: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com> One final thought. If you are using rmblast as part of the RepeatMasker installation, it may be suffering a bug that some blast version suffer from that can sometimes lead to truncation of a blast report (example of a separate error related to blast report truncation here)?> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ As a result there is a special update to rmblast ?> http://www.repeatmasker.org/RMBlast.html So if you are not using the update try it, but if you are using the update and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update may be the cause or the cure or RepeatMasker errors). ?Carson > On Sep 13, 2017, at 1:42 PM, Quanwei Zhang wrote: > > Dear Carson: > > Thank you for your explanation. Sorry for not describing my problem clearly. The first two errors were all gone after I changed the parameters you suggested (e.g., max_dna_len, depeth_blast). Now I only get the following error for two contigs among thousands of contigs. One of the two failed contigs has length 863k, and I have done more tests on this contig individually. By running repeatmask on this contig, 65% was masked when using species specific repeat library, while it is only 35% when using mammalian repeat library. Since longer contigs (even 98Mb) can all be annotated, I doubt why this much shorter one can fail due to IO. > > I did not set "TMP", and I am running on a high performance cluster. I am not sure whether it is a virtual memory or not. I will check it later. Many thanks > > Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. > 33708 --> rank=NA, hostname=n409 > 33709 ERROR: Failed while processing all repeats > 33710 ERROR: Chunk failed at level:3, tier_type:1 > 33711 FAILED CONTIG:Contig31 > > Best > Quanwei > > 2017-09-13 14:23 GMT-04:00 Carson Holt >: > These are the 3 errors you have shown in your e-mails ?> > open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40. > Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. > Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. > > The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory. > > The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent. > > > IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues. > > Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM. > > 1. Some things to check. Make sure TMP= is not being set to a network mounted location. > 2. Make sure your temporary directory is not a virtual in memory directory on the node being used. > 3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission. > > Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better. > > Thanks, > Carson > > > > > > >> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang > wrote: >> >> Dear Carson: >> >> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. >> >> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig. Do you have any ideas about this. Thanks >> >> >> >> file name: test_scaffold31.fasta >> sequences: 1 >> total length: 863590 bp (858757 bp excl N/X-runs) >> GC level: 37.02 % >> bases masked: 562909 bp ( 65.18 %) >> ================================================== >> number of length percentage >> elements* occupied of sequence >> -------------------------------------------------- >> SINEs: 113 16134 bp 1.87 % >> ALUs 71 12479 bp 1.45 % >> MIRs 1 133 bp 0.02 % >> >> LINEs: 251 380142 bp 44.02 % >> LINE1 211 210623 bp 24.39 % >> LINE2 1 86 bp 0.01 % >> L3/CR1 0 0 bp 0.00 % >> >> LTR elements: 246 101221 bp 11.72 % >> ERVL 5 1037 bp 0.12 % >> ERVL-MaLRs 18 2744 bp 0.32 % >> ERV_classI 201 90942 bp 10.53 % >> ERV_classII 18 5964 bp 0.69 % >> >> DNA elements: 39 14177 bp 1.64 % >> hAT-Charlie 7 3864 bp 0.45 % >> TcMar-Tigger 7 1706 bp 0.20 % >> >> Unclassified: 196 45831 bp 5.31 % >> >> Total interspersed repeats: 557505 bp 64.56 % >> >> >> Small RNA: 3 823 bp 0.10 % >> >> Satellites: 2 237 bp 0.03 % >> Simple repeats: 94 4472 bp 0.52 % >> Low complexity: 18 766 bp 0.09 % >> ================================================== >> >> * most repeats fragmented by insertions or deletions >> have been counted as one element >> >> >> The query species was assumed to be homo >> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 >> >> run with rmblastn version 2.2.27+ >> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal" >> >> >> Best >> Quanwei >> >> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang >: >> Dear Carson: >> >> I see. Thank you. I will try it. >> >> Best >> Quanwei >> >> 2017-09-11 13:46 GMT-04:00 Carson Holt >: >> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes. >> >> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node. >> >> Example command for a 20 CPU node ?> mpiexec -n 20 maker >> >> ?Carson >> >> >> >> >> >>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang > wrote: >>> >>> Dear Carson: >>> >>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3? >>> >>> Thanks >>> >>> Best >>> Quanwei >>> >>> 2017-09-11 13:18 GMT-04:00 Carson Holt >: >>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org >>> >>> It?s easy to install yourself, and tends to be very robust to failure. >>> >>> ?Carson >>> >>> >>> >>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang > wrote: >>>> >>>> Dear Carson: >>>> >>>> I met some problems to use MPI. I will give it another try. >>>> Thank you! >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-11 13:14 GMT-04:00 Carson Holt >: >>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage. >>>> >>>> ?Carson >>>> >>>> >>>> >>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang > wrote: >>>>> >>>>> Dear Carson: >>>>> >>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed. >>>>> >>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. >>>>> >>>>> Thank you! >>>>> >>>>> Best >>>>> Quanwei >>>>> >>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt >: >>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER. >>>>> >>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated). >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang > wrote: >>>>>> >>>>>> Dear Carson: >>>>>> >>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic "). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks >>>>>> >>>>>> Here are some parameters I used >>>>>> >>>>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker >>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe >>>>>> >>>>>> max_dna_len=300000 >>>>>> split_hit=40000 >>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>>> >>>>>> >>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188. >>>>>> 33708 --> rank=NA, hostname=n409 >>>>>> 33709 ERROR: Failed while processing all repeats >>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>>>>> 33711 FAILED CONTIG:Contig31 >>>>>> >>>>>> >>>>>> Best >>>>>> Quanwei >>>>>> >>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang >: >>>>>> Dear Carson: >>>>>> >>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set max_dna_len=1Mb, I can train the model successfully. And in the current training (where I get the following error), I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training. >>>>>> >>>>>> Thank you! Have a nice weekend! >>>>>> >>>>>> >>>>>> >>>>>> #--------------------------------------------------------------------- >>>>>> Now starting the contig!! >>>>>> SeqID: Contig10 >>>>>> Length: 18773588 >>>>>> #--------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> setting up GFF3 output and fasta chunks >>>>>> doing repeat masking >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> doing blastx repeats >>>>>> collecting blastx repeatmasking >>>>>> processing all repeats >>>>>> doing repeat masking >>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050. >>>>>> --> rank=NA, hostname=n224 >>>>>> ERROR: Failed while doing repeat masking >>>>>> ERROR: Chunk failed at level:0, tier_type:1 >>>>>> FAILED CONTIG:Contig10 >>>>>> >>>>>> ERROR: Chunk failed at level:2, tier_type:0 >>>>>> FAILED CONTIG:Contig10 >>>>>> >>>>>> Best >>>>>> Quanwei >>>>>> >>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt >: >>>>>> >>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time? >>>>>>> >>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>>> >>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect. >>>>>> >>>>>> >>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?). Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences. >>>>>> >>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN >>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX >>>>>> >>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes. >>>>>> >>>>>> >>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones). >>>>>> >>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI. >>>>>> >>>>>> >>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html ). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right? >>>>>> >>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command. >>>>>> >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Wed Sep 13 15:26:11 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Wed, 13 Sep 2017 16:26:11 -0400 Subject: [maker-devel] Some errors reported by Maker2 In-Reply-To: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com> References: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com> <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com> <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com> <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com> <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com> <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com> <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com> <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com> Message-ID: Dear Carson: I will take a look at try it. Thank you. Best Quanwei 2017-09-13 16:21 GMT-04:00 Carson Holt : > One final thought. If you are using rmblast as part of the RepeatMasker > installation, it may be suffering a bug that some blast version suffer from > that can sometimes lead to truncation of a blast report (example of a > separate error related to blast report truncation here)?> > https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ > > As a result there is a special update to rmblast ?> > http://www.repeatmasker.org/RMBlast.html > > So if you are not using the update try it, but if you are using the update > and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update > may be the cause or the cure or RepeatMasker errors). > > ?Carson > > > > On Sep 13, 2017, at 1:42 PM, Quanwei Zhang wrote: > > Dear Carson: > > Thank you for your explanation. Sorry for not describing my problem > clearly. The first two errors were all gone after I changed the parameters > you suggested (e.g., max_dna_len, depeth_blast). Now I only get the > following error for two contigs among thousands of contigs. One of the two > failed contigs has length 863k, and I have done more tests on this contig > individually. By running repeatmask on this contig, 65% was masked when > using species specific repeat library, while it is only 35% when using > mammalian repeat library. Since longer contigs (even 98Mb) can all be > annotated, I doubt why this much shorter one can fail due to IO. > > I did not set "TMP", and I am running on a high performance cluster. I am > not sure whether it is a virtual memory or not. I will check it later. Many > thanks > > Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm > line 188. > 33708 --> rank=NA, hostname=n409 > 33709 ERROR: Failed while processing all repeats > 33710 ERROR: Chunk failed at level:3, tier_type:1 > 33711 FAILED CONTIG:Contig31 > > Best > Quanwei > > 2017-09-13 14:23 GMT-04:00 Carson Holt : > >> These are the 3 errors you have shown in your e-mails ?> >> open3: fork failed: Cannot allocate memory at >> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40. >> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm >> line 1050. >> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm >> line 188. >> >> The first two are memory related with the second being because it cannot >> kill a lock maintainer thread that it was not able to start because of lack >> of memory. >> >> The third one is IO related. It is a truncated file that succeeded on the >> second try according to the e-mail you sent. >> >> >> IO errors are quite common with NFS (network mounted file systems). It?s >> one of the most frequent issues submitted to the devel list. MAKER can hit >> IO limits long before it hits CPU limits. One of the most frequent casues >> of these issues is that the user set TMP= in the control files to a manual >> location that is not suitable for high IO (note TMP= defaults to /tmp). The >> location should always be a true locally mounted disk. Sometimes this is a >> virtual location (not really local disk but network mounted disk or an in >> memory location). With the former you will get frequent IO failures and >> with the latter you will also get out of memory issues. >> >> Note that when you supply more data files you will also use more memory >> (to hold analysis results). According to your e-mail the last error you got >> was 'Can't kill a non-numeric process ID?. Correct? So getting the error >> with two input files but not when you supply a single input file further >> suggests you are running low on RAM. >> >> 1. Some things to check. Make sure TMP= is not being set to a network >> mounted location. >> 2. Make sure your temporary directory is not a virtual in memory >> directory on the node being used. >> 3. If nodes are shared, you may run out of memory because of other users >> or because you failed to request enough RAM during job submission. >> >> Finally, try running interactively so you can see what the memory and >> directory locations look like on the node you get assigned for the job >> (check space and mount points. Is /tmp or whereever you set TMP= in fact a >> local disk?). Also run with MPI rather than starting multiple MAKER >> instances. It uses resources better. >> >> Thanks, >> Carson >> >> >> >> >> >> >> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang wrote: >> >> Dear Carson: >> >> I did more tests on one of the contigs (with length 863kb) that failed >> when doing repeat masking. I found it only fail when I added the species >> specific repeat library, and it can be successfully annotated when only >> considering mammalian repeat library. When I did the test I only picked the >> this contig and run maker with 64G memory. So I think the failure should >> not be the problem with memory or IO, because even the contigs with length >> 98Mb can be annotated with memory 32G. >> >> I also run RepeatMasker on this contig with mammalian and species >> specific repeat library, separately. I found when I use mammalian repeat >> library, about 35% was masked as repeats, while it is 65% when I use >> species specific repeat library (as shown below in blue). I wonder whether >> the high level of repeats can lead to the failure of this contig. Do you >> have any ideas about this. Thanks >> >> >> >> file name: test_scaffold31.fasta >> sequences: 1 >> total length: 863590 bp (858757 bp excl N/X-runs) >> GC level: 37.02 % >> bases masked: 562909 bp ( 65.18 %) >> ================================================== >> number of length percentage >> elements* occupied of sequence >> -------------------------------------------------- >> SINEs: 113 16134 bp 1.87 % >> ALUs 71 12479 bp 1.45 % >> MIRs 1 133 bp 0.02 % >> >> LINEs: 251 380142 bp 44.02 % >> LINE1 211 210623 bp 24.39 % >> LINE2 1 86 bp 0.01 % >> L3/CR1 0 0 bp 0.00 % >> >> LTR elements: 246 101221 bp 11.72 % >> ERVL 5 1037 bp 0.12 % >> ERVL-MaLRs 18 2744 bp 0.32 % >> ERV_classI 201 90942 bp 10.53 % >> ERV_classII 18 5964 bp 0.69 % >> >> DNA elements: 39 14177 bp 1.64 % >> hAT-Charlie 7 3864 bp 0.45 % >> TcMar-Tigger 7 1706 bp 0.20 % >> >> Unclassified: 196 45831 bp 5.31 % >> >> Total interspersed repeats: 557505 bp 64.56 % >> >> >> Small RNA: 3 823 bp 0.10 % >> >> Satellites: 2 237 bp 0.03 % >> Simple repeats: 94 4472 bp 0.52 % >> Low complexity: 18 766 bp 0.09 % >> ================================================== >> >> * most repeats fragmented by insertions or deletions >> have been counted as one element >> >> >> The query species was assumed to be homo >> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127 >> >> run with rmblastn version 2.2.27+ >> The query was compared to classified sequences in >> ".../consensi.fa.classifiednoProtFinal" >> >> >> Best >> Quanwei >> >> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang : >> >>> Dear Carson: >>> >>> I see. Thank you. I will try it. >>> >>> Best >>> Quanwei >>> >>> 2017-09-11 13:46 GMT-04:00 Carson Holt : >>> >>>> Each node is a single machine. Because you currently run without MPI, >>>> each MAKER job you submit runs on a single machine. So you are either >>>> running multiple times on the same node, or you submitted 5 separate batch >>>> jobs in which case you may have a single maker process on each of 5 nodes. >>>> >>>> MPI can parallelize on the same node or across nodes. If you request 10 >>>> nodes, then it can communicate across nodes to run the job on all hardware. >>>> Or you can run MPI on a single node and ask for all CPUs on that node. In >>>> that case it will split up work within a single node and use all resources >>>> just on that node. So if you can?t get MPI to work across nodes, you can >>>> just submit a job that goes to a single node and ask for all CPUs on that >>>> node (multinode jobs may be hard to configure, but single node jobs are >>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that >>>> node, and it will parallelize within the node. >>>> >>>> Example command for a 20 CPU node ?> mpiexec -n 20 maker >>>> >>>> ?Carson >>>> >>>> >>>> >>>> >>>> >>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang >>>> wrote: >>>> >>>> Dear Carson: >>>> >>>> Would you please explain what do you mean by "a single machine"? I am >>>> running maker2 on our high performance cluster. The cluster has more than >>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used >>>> as the scheduler. Can I use MPICH3? >>>> >>>> Thanks >>>> >>>> Best >>>> Quanwei >>>> >>>> 2017-09-11 13:18 GMT-04:00 Carson Holt : >>>> >>>>> If you are just using a single machine (and not cross machine MPI), >>>>> use MPICH3 ?> https://www.mpich.org >>>>> >>>>> It?s easy to install yourself, and tends to be very robust to failure. >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang >>>>> wrote: >>>>> >>>>> Dear Carson: >>>>> >>>>> I met some problems to use MPI. I will give it another try. >>>>> Thank you! >>>>> >>>>> Best >>>>> Quanwei >>>>> >>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt : >>>>> >>>>>> It could be either. Please use MPI instead of starting multiple >>>>>> instances. It will greatly reduce both IO and RAM usage. >>>>>> >>>>>> ?Carson >>>>>> >>>>>> >>>>>> >>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang >>>>>> wrote: >>>>>> >>>>>> Dear Carson: >>>>>> >>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If >>>>>> it is related to memory issue or an IO issue, I am not sure why the much >>>>>> longer scaffolds (than the failed ones) were all annotated successfully, >>>>>> but the relatively shorter ones failed. >>>>>> >>>>>> I have set "tries=5" (#number of times to try a contig if there is a >>>>>> failure for some reason). I will try "clean_try=1" and test on the failed >>>>>> scaffolds individually with larger memory to see whether they can be >>>>>> annotated. >>>>>> >>>>>> Thank you! >>>>>> >>>>>> Best >>>>>> Quanwei >>>>>> >>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt : >>>>>> >>>>>>> I think the cause of the error may have been a little further >>>>>>> upstream from what you pasted in the e-mail. One thing that may be >>>>>>> happening is that you are taxing resources (like IO) if running MAKER >>>>>>> multiple times or on too many CPUs. That can lead to failures because of >>>>>>> truncated BLAST reports etc. In which case you can just retry and that will >>>>>>> get around those types of IO derived errors. MAKER can generate a lot of >>>>>>> IO, and if you are working on network mounted locations (i.e. the storage >>>>>>> being used is actually across the network), then they can be lest robust >>>>>>> than local storage (when under heavy load NFS can falsely report success on >>>>>>> read/write operations that actually failed). It?s the reason we built in >>>>>>> the retry capabilities of MAKER. >>>>>>> >>>>>>> For contigs that continuously fail, you may need to set clean_try=1. >>>>>>> That will cause failures to start from scratch (i.e. delete all old reports >>>>>>> on failure rather than just those suspected of being truncated). >>>>>>> >>>>>>> ?Carson >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang >>>>>>> wrote: >>>>>>> >>>>>>> Dear Carson: >>>>>>> >>>>>>> About the error in my above email, I found the contig was correctly >>>>>>> annotated at the second time RETRY. So please ignore my last email. But >>>>>>> now, for a few number of scaffolds, I met problems to process the repeats >>>>>>> (as shown below in red). I used both Mammalia repeat library and species >>>>>>> specific repeat library (which is generated by your pipeline " >>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep >>>>>>> eat_Library_Construction--Basic"). There were no such problems when >>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What >>>>>>> could be the reason? Or do you have any suggestions for me to find the >>>>>>> reason? Many thanks >>>>>>> >>>>>>> Here are some parameters I used >>>>>>> >>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>>>>> model_org=Mammalia #select a model organism for RepBase masking in >>>>>>> RepeatMasker >>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism >>>>>>> specific repeat library in fasta format for Repe >>>>>>> >>>>>>> max_dna_len=300000 >>>>>>> split_hit=40000 >>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking >>>>>>> >>>>>>> >>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm >>>>>>> line 188. >>>>>>> 33708 --> rank=NA, hostname=n409 >>>>>>> 33709 ERROR: Failed while processing all repeats >>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1 >>>>>>> 33711 FAILED CONTIG:Contig31 >>>>>>> >>>>>>> >>>>>>> Best >>>>>>> Quanwei >>>>>>> >>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang : >>>>>>> >>>>>>>> Dear Carson: >>>>>>>> >>>>>>>> I got the following error again. Is this still related to memory >>>>>>>> issues? I wonder whether there can be other reasons lead to this error? >>>>>>>> This time, I got this error during training of the SNAP model. Before, even >>>>>>>> I set max_dna_len=1Mb, I can train the model successfully. And in the >>>>>>>> current training (where I get the following error), I have decreased the >>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only >>>>>>>> difference is that I am using both mammalian repeat library and species >>>>>>>> specific repeat library, while previously I only use the mammalian repeat >>>>>>>> library. Will it greatly increases the requirement of memory to use both >>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I >>>>>>>> have also set the depth_blast as 30 in current training. >>>>>>>> >>>>>>>> Thank you! Have a nice weekend! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> #----------------------------------------------------------- >>>>>>>> ---------- >>>>>>>> Now starting the contig!! >>>>>>>> SeqID: Contig10 >>>>>>>> Length: 18773588 >>>>>>>> #----------------------------------------------------------- >>>>>>>> ---------- >>>>>>>> >>>>>>>> >>>>>>>> setting up GFF3 output and fasta chunks >>>>>>>> doing repeat masking >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> doing blastx repeats >>>>>>>> collecting blastx repeatmasking >>>>>>>> processing all repeats >>>>>>>> doing repeat masking >>>>>>>> Can't kill a non-numeric process ID at >>>>>>>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line >>>>>>>> 1050. >>>>>>>> --> rank=NA, hostname=n224 >>>>>>>> ERROR: Failed while doing repeat masking >>>>>>>> ERROR: Chunk failed at level:0, tier_type:1 >>>>>>>> FAILED CONTIG:Contig10 >>>>>>>> >>>>>>>> ERROR: Chunk failed at level:2, tier_type:0 >>>>>>>> FAILED CONTIG:Contig10 >>>>>>>> >>>>>>>> Best >>>>>>>> Quanwei >>>>>>>> >>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt : >>>>>>>> >>>>>>>>> >>>>>>>>> (2) By reading some of your replies in the maker google group, and >>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set >>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But >>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't >>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more >>>>>>>>> memory and time? >>>>>>>>> >>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff) >>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff) >>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff) >>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element >>>>>>>>> masking >>>>>>>>> >>>>>>>>> >>>>>>>>> This values really only affects the final evidence kept in the >>>>>>>>> GFF3 when you look at it in a browser. It has not affect on the annotation. >>>>>>>>> This is because internally MAKER already collapses evidence down to the 10 >>>>>>>>> best non-redundant features per evidence set per locus. The rest are put in >>>>>>>>> the GFF3 just for reference. by setting it lower, you are just letting >>>>>>>>> MAKER know it can through things away even sooner since you don?t want them >>>>>>>>> in the GFF3. It provides a minor improvement for memory use, but >>>>>>>>> max_dna_length is the big one that has the greatest effect. >>>>>>>>> >>>>>>>>> >>>>>>>>> (3) I also have some concerns about the speed, especially for the >>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time >>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?). >>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take >>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein >>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I >>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian >>>>>>>>> Swiss protein sequences as evidences. >>>>>>>>> >>>>>>>>> >>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space >>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at >>>>>>>>> least 6 times slower than BLASTN >>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at >>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX >>>>>>>>> >>>>>>>>> Also double the dataset size, double the runtime. Larger window >>>>>>>>> sizes via max_dna_length will also increase runtimes. >>>>>>>>> >>>>>>>>> >>>>>>>>> (4) For some reasons, I can not run maker though MPI on our >>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to >>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single >>>>>>>>> sequence I start multiple maker, without splitting the long sequence into >>>>>>>>> shorter ones). >>>>>>>>> >>>>>>>>> >>>>>>>>> Without MPI you won?t be able to split up large contigs. At the >>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs >>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via >>>>>>>>> MPI. >>>>>>>>> >>>>>>>>> >>>>>>>>> (5) Still about the speed issue. I read some of your comments >>>>>>>>> about "cpus" parameters in the maker_opts file ( >>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a >>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number >>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, >>>>>>>>> then I can use the following command to submit the job, right? >>>>>>>>> >>>>>>>>> >>>>>>>>> The cpu parameter only affects how many CPUs are given to the >>>>>>>>> blast command line. So only the BLASt step will speed up, so I recommend >>>>>>>>> using MPI to get all steps to speed up. Even if you are only running on a >>>>>>>>> single node, you can give all CPUs to the mpiexec command. >>>>>>>>> >>>>>>>>> >>>>>>>>> ?Carson >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Sun Sep 17 20:12:56 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Mon, 18 Sep 2017 11:12:56 +1000 Subject: [maker-devel] augustus underpredicting In-Reply-To: References: Message-ID: I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step. In comparison, SNAP gives 16000 and GeneMark 19000. I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead? Thanks, On 12 September 2017 at 02:50, Carson Holt wrote: > BUSCO may be generating too few models. BUSCO also identifies classes of > conserved short genes that may not represent enough training diversity for > your organism. Try running MAKER in protein2genome or est2genome mode, and > then train with those results. > > ?Carson > > > On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos > wrote: > > Hi, > I have been annotating a fungal genome as usual, using Busco-trained > Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus > is predicting a mere 207 genes compared to 15-20k from the other two. > I've never had this problem. The genome has an unusual repeat content > close to 50%, not sure if that might suppose a problem. > Has anybody come up with any similar issue? > I also asked to Busco developers if they have any idea > https://gitlab.com/ezlab/busco/issues/49 > Cheers, > Xabi > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Sep 18 22:07:25 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Mon, 18 Sep 2017 23:07:25 -0400 Subject: [maker-devel] Question about "maker-", "augustus_masked", "snap_masked" gene model Message-ID: Hello: Would you please explain what is the difference between "maker-...-agustus..." and "augustus_masked..." gene models? I know "augustus_masked..." gene models are raw august predictions, while "maker-...-agustus..." are hit derived gene models. But by default, maker2 reports gene models with evidence support (protein sequences or transcripts). Then why some gene models are hit derived while other models (with evidence support) are raw augustus prediction (even there are protein sequences or transcript evidence)? BTW, is it true that generally the "maker-...-agustus..." gene models are more reliable than the "augustus_masked..." gene models? Thanks Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Mon Sep 18 23:14:38 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Tue, 19 Sep 2017 00:14:38 -0400 Subject: [maker-devel] about min_protein Message-ID: Hello: I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter? I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid). min_protein=0 #require at least this many amino acids in predicted proteins Thanks Best Quanwei -------------- next part -------------- An HTML attachment was scrubbed... URL: From qwzhang0601 at gmail.com Tue Sep 19 07:47:00 2017 From: qwzhang0601 at gmail.com (Quanwei Zhang) Date: Tue, 19 Sep 2017 08:47:00 -0400 Subject: [maker-devel] about min_protein In-Reply-To: References: Message-ID: Thank you Daniel. I wonder whether there is a suggested value for the ?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people often use. I am studying a rodent species. Thank you. Best Quanwei 2017-09-19 8:29 GMT-04:00 Daniel Ence : > Hi Quanwei, > > Increasing the ?min_protein" parameter should get ride of those very short > predicted proteins. > > > > > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang > wrote: > > > > Hello: > > > > I am working on a rodent species and get 28k annotated genes, I wonder > whether you have any suggestions about the "min_protein" parameter? > > > > I did not change the parameter in my current annotation. I get several > very short predicted proteins (even those with only 1 amino acid). > > > > min_protein=0 #require at least this many amino acids in predicted > proteins > > > > Thanks > > > > Best > > Quanwei > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dandence at gmail.com Tue Sep 19 07:29:35 2017 From: dandence at gmail.com (Daniel Ence) Date: Tue, 19 Sep 2017 08:29:35 -0400 Subject: [maker-devel] about min_protein In-Reply-To: References: Message-ID: Hi Quanwei, Increasing the ?min_protein" parameter should get ride of those very short predicted proteins. > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang wrote: > > Hello: > > I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter? > > I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid). > > min_protein=0 #require at least this many amino acids in predicted proteins > > Thanks > > Best > Quanwei > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org From tuanduonganh at gmail.com Tue Sep 19 12:23:39 2017 From: tuanduonganh at gmail.com (Tuan Duong Anh) Date: Tue, 19 Sep 2017 19:23:39 +0200 Subject: [maker-devel] MAKER3 beta - EVM under predicting Message-ID: Dear MAKER-devel group I have been testing out MAKER3 beta version and found out that EVM always returns much less number of models. Did any one experience this before? I do expect that EVM will return less models when compare to other, but not to this extend (only 20% of the expected gene models). Any suggestion would be much appreciated. ## Number of models obtained by each gene predictors: HLIG.all.maker.augustus_masked.proteins.fasta:11224 HLIG.all.maker.evm.proteins.fasta:1974 HLIG.all.maker.genemark.proteins.fasta:11352 HLIG.all.maker.proteins.fasta:13672 HLIG.all.maker.snap_masked.proteins.fasta:13404 ## maker_evm.ctl #-----Transcript weights evmtrans=10 #default weight for source unspecified est/alt_est alignments evmtrans:blastn=0 #weight for blastn sourced alignments evmtrans:est2genome=10 #weight for est2genome sourced alignments evmtrans:tblastx=0 #weight for tblastx sourced alignments evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments #-----Protein weights evmprot=10 #default weight for source unspecified protein alignments evmprot:blastx=2 #weight for blastx sourced alignments evmprot:protein2genome=10 #weight for protein2genome sourced alignments #-----Abinitio Prediction weights evmab=10 #default weight for source unspecified ab initio predictions evmab:snap=7 #weight for snap sourced predictions evmab:augustus=10 #weight for augustus sourced predictions evmab:fgenesh=10 #weight for fgenesh sourced predictions evmab:genemark=10 #weight for genemark sourced predictions Regards, Tuan -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Sep 19 16:34:40 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Sep 2017 15:34:40 -0600 Subject: [maker-devel] augustus underpredicting In-Reply-To: References: Message-ID: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com> Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER. ?Carson > On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos wrote: > > I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step. > In comparison, SNAP gives 16000 and GeneMark 19000. > > I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead? > Thanks, > > > > On 12 September 2017 at 02:50, Carson Holt > wrote: > BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results. > > ?Carson > > >> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos > wrote: >> >> Hi, >> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two. >> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem. >> Has anybody come up with any similar issue? >> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 >> Cheers, >> Xabi >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA > > > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Sep 19 16:40:27 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Sep 2017 15:40:27 -0600 Subject: [maker-devel] Question about "maker-", "augustus_masked", "snap_masked" gene model In-Reply-To: References: Message-ID: <56CC4BEB-083E-4DE6-99F3-CB34A1735AB4@gmail.com> MAKER uses all derived models as a pool of alternate models for a given locus. The one that best matches the aligned evidence is then selected using the AED calculation described in the MAKER2 publication. Overall hint based models tend to perform better than the raw models because they get extra info about observed intron/exon structure from alignments. There is also a discussion of this in the MAKER2 paper. ?Carson > On Sep 18, 2017, at 9:07 PM, Quanwei Zhang wrote: > > Hello: > > Would you please explain what is the difference between "maker-...-agustus..." and "augustus_masked..." gene models? > > I know "augustus_masked..." gene models are raw august predictions, while "maker-...-agustus..." are hit derived gene models. But by default, maker2 reports gene models with evidence support (protein sequences or transcripts). Then why some gene models are hit derived while other models (with evidence support) are raw augustus prediction (even there are protein sequences or transcript evidence)? > > BTW, is it true that generally the "maker-...-agustus..." gene models are more reliable than the "augustus_masked..." gene models? > > Thanks > > Best > Quanwei From carsonhh at gmail.com Tue Sep 19 16:41:40 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Sep 2017 15:41:40 -0600 Subject: [maker-devel] about min_protein In-Reply-To: References: Message-ID: The value is arbitrary, but some submission databases like NCBI will flag entries under ~20-30 amino acids as errors if you try and submit them (I can?t remember the exact number). ?Carson > On Sep 19, 2017, at 6:47 AM, Quanwei Zhang wrote: > > Thank you Daniel. I wonder whether there is a suggested value for the ?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people often use. I am studying a rodent species. > > Thank you. > > Best > Quanwei > > 2017-09-19 8:29 GMT-04:00 Daniel Ence >: > Hi Quanwei, > > Increasing the ?min_protein" parameter should get ride of those very short predicted proteins. > > > > > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang > wrote: > > > > Hello: > > > > I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter? > > > > I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid). > > > > min_protein=0 #require at least this many amino acids in predicted proteins > > > > Thanks > > > > Best > > Quanwei > > _______________________________________________ > > maker-devel mailing list > > maker-devel at box290.bluehost.com > > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Tue Sep 19 16:47:42 2017 From: carsonhh at gmail.com (Carson Holt) Date: Tue, 19 Sep 2017 15:47:42 -0600 Subject: [maker-devel] MAKER3 beta - EVM under predicting In-Reply-To: References: Message-ID: <12FE3318-F0DE-485B-B43A-25A4A6EC9390@gmail.com> If ab initio predictors and evidence alignments aren?t in high concordance, then EVM won?t produce results. This often indicates minor sequencing errors in the assembly (this is very common in draft assemblies). Ab initio predictors will slightly alter splicing and extend introns/exons to make a model work around these variations, but doing this does not always concord well with the alignment, so EVM produces nothing. In these cases it is often better just to train the predictor as well as you can, and then take the standard MAKER results. ?Carson > On Sep 19, 2017, at 11:23 AM, Tuan Duong Anh wrote: > > Dear MAKER-devel group > > I have been testing out MAKER3 beta version and found out that EVM always returns much less number of models. Did any one experience this before? I do expect that EVM will return less models when compare to other, but not to this extend (only 20% of the expected gene models). Any suggestion would be much appreciated. > > ## Number of models obtained by each gene predictors: > HLIG.all.maker.augustus_masked.proteins.fasta:11224 > HLIG.all.maker.evm.proteins.fasta:1974 > HLIG.all.maker.genemark.proteins.fasta:11352 > HLIG.all.maker.proteins.fasta:13672 > HLIG.all.maker.snap_masked.proteins.fasta:13404 > > ## maker_evm.ctl > #-----Transcript weights > evmtrans=10 #default weight for source unspecified est/alt_est alignments > evmtrans:blastn=0 #weight for blastn sourced alignments > evmtrans:est2genome=10 #weight for est2genome sourced alignments > evmtrans:tblastx=0 #weight for tblastx sourced alignments > evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments > > #-----Protein weights > evmprot=10 #default weight for source unspecified protein alignments > evmprot:blastx=2 #weight for blastx sourced alignments > evmprot:protein2genome=10 #weight for protein2genome sourced alignments > > #-----Abinitio Prediction weights > evmab=10 #default weight for source unspecified ab initio predictions > evmab:snap=7 #weight for snap sourced predictions > evmab:augustus=10 #weight for augustus sourced predictions > evmab:fgenesh=10 #weight for fgenesh sourced predictions > evmab:genemark=10 #weight for genemark sourced predictions > > > Regards, > Tuan > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From xvazquezc at gmail.com Tue Sep 19 19:02:04 2017 From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=) Date: Wed, 20 Sep 2017 10:02:04 +1000 Subject: [maker-devel] augustus underpredicting In-Reply-To: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com> References: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com> Message-ID: Thanks Carson. Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for. In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome. How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that? PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside) On 20 September 2017 at 07:34, Carson Holt wrote: > Gene predictors tend to over predict, so I would not take the high numbers > given by SNAP and GeneMark as true counts. You will probably end up with > something like 7-10k in the final results. But now Augustus is giving a > higher count, you should be good to start running MAKER. > > ?Carson > > > > > On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos > wrote: > > I did it that way and AUGUSTUS is predicting a more reasonable number of > genes, about 12500 in Maker, but about 19000 in the model assessment step. > In comparison, SNAP gives 16000 and GeneMark 19000. > > I haven't found any reference about but, would it be a good idea to train > Augustus over the masked genome instead? > Thanks, > > > > On 12 September 2017 at 02:50, Carson Holt wrote: > >> BUSCO may be generating too few models. BUSCO also identifies classes of >> conserved short genes that may not represent enough training diversity for >> your organism. Try running MAKER in protein2genome or est2genome mode, and >> then train with those results. >> >> ?Carson >> >> >> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos >> wrote: >> >> Hi, >> I have been annotating a fungal genome as usual, using Busco-trained >> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus >> is predicting a mere 207 genes compared to 15-20k from the other two. >> I've never had this problem. The genome has an unusual repeat content >> close to 50%, not sure if that might suppose a problem. >> Has anybody come up with any similar issue? >> I also asked to Busco developers if they have any idea >> https://gitlab.com/ezlab/busco/issues/49 >> Cheers, >> Xabi >> >> -- >> Xabier V?zquez-Campos, *PhD* >> *Research Associate* >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA >> >> >> > > > -- > Xabier V?zquez-Campos, *PhD* > *Research Associate* > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA > > > -- Xabier V?zquez-Campos, *PhD* *Research Associate* NSW Systems Biology Initiative School of Biotechnology and Biomolecular Sciences The University of New South Wales Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From himanimalhotra89 at gmail.com Tue Sep 19 23:56:55 2017 From: himanimalhotra89 at gmail.com (himani malhotra) Date: Wed, 20 Sep 2017 10:26:55 +0530 Subject: [maker-devel] Fwd: maker error In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: himani malhotra Date: Wed, Sep 20, 2017 at 10:24 AM Subject: maker error To: maker-devel-request at box290.bluehost.com hello I am using MAKER for gene prediction.I am getting error in Repbase installation.I am sending you the error also,please help me.I have installed repbase manually and unpacked its libraries in RepeatMasker Library but still I am getting error. Please help me. Thanks Himani -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: makererror.png Type: image/png Size: 212522 bytes Desc: not available URL: From munholl at uwindsor.ca Wed Sep 20 09:53:04 2017 From: munholl at uwindsor.ca (Seth Munholland) Date: Wed, 20 Sep 2017 10:53:04 -0400 Subject: [maker-devel] Fwd: maker error In-Reply-To: References: Message-ID: Hello, When this happened to me it was a faulty pathing on my part when I configured RepeatMasker (which I also manually installed). Seth Munholland, B.Sc., Ph.D. Candidate Department of Biological Sciences Rm. 304 Biology Building University of Windsor 401 Sunset Ave. N9B 3P4 T: (519) 253-3000 Ext: 4755 On Wed, Sep 20, 2017 at 12:56 AM, himani malhotra < himanimalhotra89 at gmail.com> wrote: > > ---------- Forwarded message ---------- > From: himani malhotra > Date: Wed, Sep 20, 2017 at 10:24 AM > Subject: maker error > To: maker-devel-request at box290.bluehost.com > > > hello > I am using MAKER for gene prediction.I am getting error in Repbase > installation.I am sending you the error also,please help me.I have > installed repbase manually and unpacked its libraries in RepeatMasker > Library but still I am getting error. > Please help me. > > > > Thanks > > Himani > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jimmy.Cross at uea.ac.uk Wed Sep 20 09:02:53 2017 From: Jimmy.Cross at uea.ac.uk (James Cross (ITCS - Staff)) Date: Wed, 20 Sep 2017 14:02:53 +0000 Subject: [maker-devel] Maker MPI across nodes Message-ID: Hi Maker Developers, We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core's so 56 Core's in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core's) as opposed to being run on a single node (28 Core's). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes? Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network. The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp). The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker Any help or advise you could give would be greatly appreciated. Best Wishes Jimmy ---------------------------------------------------------------------- Mr James Cross HPC Systems Developer University of East Anglia Norwich Research Park ITCS Norwich, Norfolk NR4 7TJ Information Services -------------- next part -------------- An HTML attachment was scrubbed... URL: From patrick.tranvan at unil.ch Thu Sep 21 04:26:52 2017 From: patrick.tranvan at unil.ch (Patrick Tran Van) Date: Thu, 21 Sep 2017 09:26:52 +0000 Subject: [maker-devel] Advice on my pipeline In-Reply-To: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com> References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch> <1498470630221.84642@unil.ch> <696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com> <1498908228256.16549@unil.ch>, <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com> Message-ID: <1505986013492.52354@unil.ch> Hi Carson, I have a doubt for the round 2, so in a previous reply you said: " Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). " Does it means that I don't need to modify the section : #-----Re-annotation Using MAKER Derived GFF3 ? If I let everything by default such as : altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no It will not look again for repeat and protein + transcriptome alignment ? Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt Sent: Monday, July 3, 2017 10:50 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think). So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models. The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split). You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ Thanks, Carson On Jul 1, 2017, at 5:21 AM, Patrick Tran Van > wrote: So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion. I have then use SNAP to train/filter it with: maker2zff specie.all.gff Here are my results: Number of gene after maker -> Number of gene after maker2zff - Without corrected_est_fusion: 21621 -> 13875 - With corrected_est_fusion: 16850 -> 9098 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ? Normally I should find more genes with corrected_est_fusion right ? 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ? Thanks for your help Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt > Sent: Monday, June 26, 2017 11:38 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Sorry the option is ?> correct_est_fusion It is in the maker_opts.ctl file. I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both. ?Carson On Jun 26, 2017, at 3:48 AM, Patrick Tran Van > wrote: Thanks for your answer. 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ? Because I am using autoAug for this and it tooks a while to compute .. 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error: WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl (I am using v 2.31.8 ) Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 ________________________________ From: Carson Holt > Sent: Monday, June 5, 2017 8:29 PM To: Patrick Tran Van Cc: maker-devel at yandell-lab.org Subject: Re: [maker-devel] Advice on my pipeline Your plan sounds good. A couple of related notes. Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER. Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). ?Carson On Jun 2, 2017, at 3:56 AM, Patrick Tran Van > wrote: Hello, This is my first time running Maker for an insect genome annotation. I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things: What I have: - RNA evidence: transcriptome - Proteine evidence: swissprot/uniprot + busco protein set of insect - Cegma and busco results of my genome 1) Train SNAP with CEGMA 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco). 3) Create SNAP model from run A. 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 5) Create SNAP model from run B. 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). 7) Create SNAP model from run C AND Create Augustus gene model from run C 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1 Does it seems coherent ? Cheers, Patrick Tran Van Groups Chapuisat, Robinson-Rechavi & Schwander Department of Ecology and Evolution University of Lausanne Le Biophore CH-1015 Lausanne Switzerland Office 3206 _______________________________________________ maker-devel mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Sep 22 12:57:56 2017 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 22 Sep 2017 11:57:56 -0600 Subject: [maker-devel] augustus underpredicting In-Reply-To: References: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com> Message-ID: <06E8D6C3-B278-4820-B309-5CF61186FDCB@gmail.com> I don?t think you can use the protein2genome option to estimate gene count. It will turn any alignment that matches at east 50% into a gene model. So you can get a lot of partial models which will inflate gene count. It?s good enough for training but not so much annotation. ?Carson > On Sep 19, 2017, at 6:02 PM, Xabier V?zquez-Campos wrote: > > Thanks Carson. > > Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for. > In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome. > How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that? > > PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside) > > On 20 September 2017 at 07:34, Carson Holt > wrote: > Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER. > > ?Carson > > > > >> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos > wrote: >> >> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step. >> In comparison, SNAP gives 16000 and GeneMark 19000. >> >> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead? >> Thanks, >> >> >> >> On 12 September 2017 at 02:50, Carson Holt > wrote: >> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results. >> >> ?Carson >> >> >>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos > wrote: >>> >>> Hi, >>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two. >>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem. >>> Has anybody come up with any similar issue? >>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 >>> Cheers, >>> Xabi >>> >>> -- >>> Xabier V?zquez-Campos, PhD >>> Research Associate >>> NSW Systems Biology Initiative >>> School of Biotechnology and Biomolecular Sciences >>> The University of New South Wales >>> Sydney NSW 2052 AUSTRALIA >> >> >> >> >> -- >> Xabier V?zquez-Campos, PhD >> Research Associate >> NSW Systems Biology Initiative >> School of Biotechnology and Biomolecular Sciences >> The University of New South Wales >> Sydney NSW 2052 AUSTRALIA > > > > > -- > Xabier V?zquez-Campos, PhD > Research Associate > NSW Systems Biology Initiative > School of Biotechnology and Biomolecular Sciences > The University of New South Wales > Sydney NSW 2052 AUSTRALIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Sep 22 14:47:36 2017 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 22 Sep 2017 13:47:36 -0600 Subject: [maker-devel] Fwd: maker error In-Reply-To: References: Message-ID: <5196E0C2-9FDC-4B6A-9D14-CA8514E002EF@gmail.com> You have a couple of errors at the start indicating that you may have an issue with the perl forks module as well as RepeatMasker installations. I?d recommend redoing both installations. Also the screen shot you show is not the failure, it is MAKER giving up after failing 2 times. To capture the actual failure set the try count to 3, then rerun and see what comes up in STDERR. Redirect STDERR to a file using ?&>? . Example: maker &> err.log Thanks, Carson On Sep 19, 2017, at 10:56 PM, himani malhotra > wrote: > > ---------- Forwarded message ---------- > From: himani malhotra > > Date: Wed, Sep 20, 2017 at 10:24 AM > Subject: maker error > To: maker-devel-request at box290.bluehost.com > > > hello > I am using MAKER for gene prediction.I am getting error in Repbase installation.I am sending you the error also,please help me.I have installed repbase manually and unpacked its libraries in RepeatMasker Library but still I am getting error. > Please help me. > > > > Thanks > > Himani > > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsonhh at gmail.com Fri Sep 22 14:59:17 2017 From: carsonhh at gmail.com (Carson Holt) Date: Fri, 22 Sep 2017 13:59:17 -0600 Subject: [maker-devel] Maker MPI across nodes In-Reply-To: References: Message-ID: The "-mca btl ^openib? flag has the side affect of bypassing infiniband and using ethernet. But if alternate communicators are too slow, you can switch back to indirect infiniband by using '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?. That option will force IP over infiniband whichb instead of direct infiniband. OpenFabrics libraries used by infiniband has a know issue because of how it uses registered memory (it generates seg faults whenever a program does system calls - i.e. MAKER calling BLAST). So you can?t use direct infinband with MAKER. So try this instead ?> '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0? Also if it stays slow, it likely means you are hitting IO limits. If that is the case, make sure you are note setting TMP= to a network mounted disk location, and that whatever temp space exists on your cluster it needs to be per node real local mounted disk and not network mounted disk. ?Carson > On Sep 20, 2017, at 8:02 AM, James Cross (ITCS - Staff) wrote: > > Hi Maker Developers, > > We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core?s so 56 Core?s in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core?s) as opposed to being run on a single node (28 Core?s). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes? > > Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network. > > The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp). > > The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker > > Any help or advise you could give would be greatly appreciated. > > Best Wishes > Jimmy > ---------------------------------------------------------------------- > Mr James Cross > HPC Systems Developer > University of East Anglia > Norwich Research Park > ITCS > Norwich, Norfolk > NR4 7TJ > > Information Services > > _______________________________________________ > maker-devel mailing list > maker-devel at box290.bluehost.com > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From carson.holt at genetics.utah.edu Fri Sep 22 15:04:10 2017 From: carson.holt at genetics.utah.edu (Carson Hinton Holt) Date: Fri, 22 Sep 2017 20:04:10 +0000 Subject: [maker-devel] MAKER In-Reply-To: References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern> <3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu> <9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu> Message-ID: MAKER won?t produce est2genome results for est_gff. This is partially because est2genome results are only used for training gene predictors. So you are essentially just getting protein2genome results from your runs. Once you get a gene predictor trained you will see a difference, as it will use the intron/exon structure of alignments as hints to improve gene predictor performance. ?Carson > On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens wrote: > > Hi Carson, > > I have tried the proposed options for a small example (yeast). > > I had > - proteins (fasta) from another yeast and > - transcript annotation (gff) from cufflinks and StringTie > > I'd like to compare the maker results for > - proteins and StringTie > Vs. > - proteins and cufflinks > > I used the default options, except: > genome= > > protein= > est_gff= > > est2genome=1 > protein2genome=1 > > (An example is attached.) > > Then I ran maker: > > maker -RM_off -c 24 > find . -type f -name *.gff -exec cat {} + | grep maker > filtered-maker-prediction.gff > > (The run seems to be okay. There were no FAILED, ... in the log. Cf. attachment) > > Each maker run was started in a separate subdirectory. > However, I realized that both maker runs yielded almost the same result (just one minor edit). This made me curious. > As far as I understood the files, I received the (filtered?) exonerate predictions for the proteins (from the other yeast). Is this correct? Why did I not receive any predictions (purely) based on the RNA-seq data? Did I something wrong? > > I'm looking forward to your reply. > > Best regards, Jens > > >> -----Urspr?ngliche Nachricht----- >> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu] >> Gesendet: Dienstag, 19. September 2017 23:37 >> An: Keilwagen, Jens >> Betreff: Re: MAKER >> >> MAKER cannot use the BAM directly, but you can use something like >> stringtie or trinity to assemble a transcript fasta that can be given >> to the est= option. >> >> Ab initio gene prediction is only enabled if you specify an hmm or >> species file to use. If all you want is homology based annotation, you >> can try the est2genome and protein2genome options. Note the final >> models may be partial if the alignments do not cover the gene end to >> end. >> >> ?Carson >> >> >> >>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens > kuehn.de> wrote: >>> >>> Hi Carson, >>> >>> thanks a lot for your last email that . >>> >>> I was asked to do homology-based gene prediction using RNA-seq and >> Maker was proposed as one option. >>> Hence I'd like to ask how to do that in the best possible way. >>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a >> related species. How can I integrate the RNA-seq data? >>> >>> Is it possible to deactivate ab-initio gene prediction by Augustus or >> SNAP? >>> >>> Thanks a lot in advance. >>> >>> Bets regards, Jens >>> >>>> -----Urspr?ngliche Nachricht----- >>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu] >>>> Gesendet: Donnerstag, 18. Februar 2016 19:03 >>>> An: Keilwagen, Jens >>>> Cc: Mark Yandell >>>> Betreff: Re: MAKER >>>> >>>> GeMoMa sounds like an interesting tool. If it produces GFF3, you >>>> could give the GFF3 results to the pred_gff= option in MAKER (comma >>>> separated lists accepted). The GFF3 file of predictions must be in >>>> the same coordinate space as the assembly being annotated (genome= >> option). >>>> Whatever you give to pred_gff will be treated as a raw predictions >> by >>>> MAKER and will only be accepted as a final model if there are >>>> evidence alignments (protein/EST) that support the model, and if >>>> there are multiple alternate models at the same locus, only the >> model >>>> that is best supported by the protein/transcript evidence is kept. >>>> >>>> You can also set the keep_preds=1 option when using pred_gff. This >>>> will cause even raw predictions with no evidence support to be >> maintained. >>>> In the event of multiple models with no evidence support, the model >>>> best matching the consensus of alternate models will be maintained. >>>> >>>> Alternatively you can use the model_gff= options (comma separated >>>> list >>>> ok) to input the GFF3 file. model_gff features are given higher >>>> confidence than pred_gff. At least one model will always be kept >>>> regardless of evidence support (same rules as pred_gff selection for >>>> which model to keep when there are multiple). But model_gff will >> also >>>> affect how evidence clusters are determined compared to pred_gff >>>> (model_gff features are allowed to merge bridging evidence >> clusters). >>>> MAKER will also go to extra lengths to pull forward existing names >>>> and other data in the GFF3 for model_gff features. >>>> >>>> If you do not have GFF3 files in the right coordinate space, but do >>>> have protein fasta or transcript fasta for the GeMoMa predictions, >>>> you can supply these to the protein= and transcript= options in >> MAKER >>>> together with est2genome=1 or protein2genome=1. This will cause >> MAKER >>>> to place the models using exonerate. You would probably also need to >>>> add est_forward=1 to the control files to have MAKER try and derive >>>> model names from the name of evidence alignments they were derived >>>> from if you go this route. >>>> >>>> You can also try treating the GFF3 predictions as hints to >>>> traditional ab initio gene finders like SNAP or Augustus by giving >>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa >>>> predictions inform the behavior of predictors like SNAP and >>>> Augustus). Might be interesting. You would have to alter results to >>>> be match/match_part >>>> GFF3 features to give them to the est_gff or protein_gff options. >>>> >>>> Let me know if you have any more questions, and I?ll do my best to >>>> help. >>>> >>>> Thanks, >>>> Carson >>>> >>>> >>>> >>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell >>>> wrote: >>>>> >>>>> >>>>> Mark Yandell >>>>> Professor of Human Genetics >>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR >>>>> Center for Genetic Discovery Eccles Institute of Human Genetics >>>>> University of Utah >>>>> 15 North 2030 East, Room 2100 >>>>> Salt Lake City, UT 84112-5330 >>>>> ph:801-587-7707 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens" >>>> wrote: >>>>> >>>>>> Dear Prof. Yandell, >>>>>> >>>>>> we have published a homology-based gene prediction program today: >>>>>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw092 >>>>>> and I'd like to ask how we can use MAKER to combine predictions of >>>>>> GeMoMa using different reference organisms, i.e. we try to predict >>>>>> the genes of an target organism (e.g. wheat) using the annotated >>>>>> genes of other reference organisms (e.g. grasses). GeMoMa returns >>>> for >>>>>> each reference organism a GFF with the predicted gene models in >> the >>>> target organism. >>>>>> >>>>>> It would be great if you or someone from your team could give us >>>> some >>>>>> hints or point us to correct paragraph in the documentation. >>>>>> >>>>>> Thanks a lot and best regards, Jens >>>>>> >>>>>> --- >>>>>> >>>>>> Dr. Jens Keilwagen >>>>>> >>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for >> Cultivated >>>>>> Plants >>>>>> Institute for Biosafety in Plant Biotechnology >>>>>> >>>>>> Erwin-Baur-Stra?e 27 >>>>>> 06484 Quedlinburg >>>>>> Germany >>>>>> >>>>>> Phone: ++49 (0)3946 47 510 >>>>>> EMail: jens.keilwagen at jki.bund.de >>>>>> >>>>>> >>>>> >>> > > From eennadi at gmail.com Fri Sep 22 14:27:37 2017 From: eennadi at gmail.com (Emmanuel Nnadi) Date: Fri, 22 Sep 2017 20:27:37 +0100 Subject: [maker-devel] Maker not installing In-Reply-To: References: <8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu> <113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu> <546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu> <8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com> <7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com> <426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com> Message-ID: Hello all, Please how can I determine the following in maker: 1. The total number of chromosomes 2. The size of my genome Thanks Nnadi Nnaemeka Emmanuel Department of Microbiology, Faculty of Natural and Applied Science, Plateau State University, Bokkos, Plateau State, Nigeria. Publications: https://www.researchgate.net/profile/Emmanuel_Nnadi/publications On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi wrote: > Ok, thanks. > Nnadi Nnaemeka Emmanuel > Department of Microbiology, > Faculty of Natural and Applied Science, > Plateau State University, Bokkos, Plateau State, Nigeria. > Publications: https://www.researchgate.net/profile/Emmanuel_Nnadi/ > publications > > > > On Sep 1, 2017 10:50 PM, "Carson Holt" wrote: > >> It would need to be a new run. You won't be able to use the updated >> contig names with the old run. >> >> --Carson >> >> Sent from my iPhone >> >> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi wrote: >> >> Hi carson >> Thanks for the tip >> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' >> genome.fasta >> >> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_ >> trimmed_\(paired\)_, >> >> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, >> >> 1. How can I effect the change when maker has produced some files from >> the the old sequence? >> >> I have spent more than 24 hours running maker and it has produced some >> folders already. >> >> How can I make this change? >> >> Thanks >> >> >> >> >> Nnadi Nnaemeka Emmanuel >> Department of Microbiology, >> Faculty of Natural and Applied Science, >> Plateau State University, Bokkos, Plateau State, Nigeria. >> Publications: https://www.researchgate.net/ >> profile/Emmanuel_Nnadi/publications >> >> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt wrote: >> >>> BLAST which is used by MAKER can not handle really long contig names. >>> MAKER tries to get around this by adding a secondary tag to the fasta >>> header when long names are detected. Even then it would be better to change >>> the IDs of your contigs to avoid downstream failures. >>> >>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? >>> from each contig name. >>> >>> Example command to do that ?> >>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' >>> genome.fasta >>> >>> ?Carson >>> >>> >>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi wrote: >>> >>> Hi Carson >>> Thanks for your response its been helpful >>> >>> Please bear with me as I work through this >>> >>> 1. Please how do I generate EST for my novel sequences? >>> 2. I am currently running maker without EST and protein sequences is it >>> wrong? Can it predict properly? >>> 3. One error in the contig just returned this value >>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence >>> identifier which is too long ( max id length = 50 ) >>> at /usr/local/bin/RepeatMasker line 1464. >>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence >>> identifier which is too long ( max id length = 50 ) >>> at /usr/local/bin/RepeatMasker line 1464. >>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence >>> identifier which is too long ( max id length = 50 ) >>> at /usr/local/bin/RepeatMasker line 1464. >>> ERROR: RepeatMasker failed >>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local >>> ERROR: Failed while doing repeat masking >>> ERROR: Chunk failed at level:0, tier_type:1 >>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2 >>> >>> ERROR: Chunk failed at level:2, tier_type:0 >>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2 >>> >>> examining contents of the fasta file and run log >>> >>> >>> Nnadi Nnaemeka Emmanuel >>> Department of Microbiology, >>> Faculty of Natural and Applied Science, >>> Plateau State University, Bokkos, Plateau State, Nigeria. >>> Publications: https://www.researchgate.net/ >>> profile/Emmanuel_Nnadi/publications >>> >>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt wrote: >>> >>>> You can query valid species names using the queryTaxonomyDatabase.pl >>>> script that comes with RepeatMasker. Try not to be too specific. In general >>>> you should use the genus rather than the species for example (or even use >>>> all of RepBase). >>>> >>>> Example ?> >>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila" >>>> >>>> ?Carson >>>> >>>> >>>> >>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi wrote: >>>> >>>> Hi Carson, >>>> >>>> Thanks >>>> I was able to start using maker. >>>> >>>> However I am working with a plant Genome novel. I had set the >>>> repeatmasking to >>>> 1. Dcotrep a names from the repbase release but maker returned it back >>>> as not known to repeat masker >>>> >>>> How can I use specific known genomes for repeat masking >>>> Thanks >>>> >>>> Nnadi Nnaemeka Emmanuel >>>> Department of Microbiology, >>>> Faculty of Natural and Applied Science, >>>> Plateau State University, Bokkos, Plateau State, Nigeria. >>>> Publications: https://www.researchgate.net/ >>>> profile/Emmanuel_Nnadi/publications >>>> >>>> >>>> >>>> On Aug 29, 2017 4:26 PM, "Carson Holt" wrote: >>>> >>>>> MAKER will read the genome= options from the maker_opts.ctl file in >>>>> your current directory or the maker_opts.ctl you specified on the command >>>>> line. The error means you have left the value empty. Perhaps you did not >>>>> save the changes you made or you did not specify the location of >>>>> the maker_opts.ctl file to use. >>>>> >>>>> You can check the contents of the file using cat. Example ?> >>>>> cat maker_opts.ctl >>>>> >>>>> ?Carson >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi wrote: >>>>> >>>>> Hi Carson, >>>>> Thanks a lot for yesterday. I was able to resolve the issue of running >>>>> maker and i followed the commands in the tutorial. >>>>> I however encountered another problem >>>>> >>>>> when I ran the command nano -c maker_opts.ctl >>>>> >>>>> It gave the following *1_S7_assembly.fa I specified the name of the >>>>> genome but when I ran maker in another tab it gave * >>>>> >>>>> #-----Genome (these are always required) >>>>> genome=*1_S7_assembly.fa* #genome sequence (fasta file or fasta >>>>> embeded in GFF3 file) >>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is >>>>> eukaryotic >>>>> >>>>> #-----Re-annotation Using MAKER Derived GFF3 >>>>> maker_gff= #MAKER derived GFF3 file >>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no >>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = >>>>> no >>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no >>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no >>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no >>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no >>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no >>>>> >>>>> #-----EST Evidence (for best results provide a file for at least one) >>>>> est= #set of ESTs or assembled mRNA-seq in fasta format >>>>> altest= #EST/cDNA sequence file in fasta format from an alternate >>>>> organism >>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file >>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format >>>>> >>>>> #-----Protein Homology Evidence (for best results provide a file for >>>>> at least one) >>>>> protein= #protein sequence file in fasta format (i.e. from mutiple >>>>> oransisms) >>>>> protein_gff= #aligned protein homology evidence from an external GFF3 >>>>> file >>>>> >>>>> #-----Repeat Masking (leave values blank to skip repeat masking) >>>>> model_org=all #select a model organism for RepBase masking in >>>>> RepeatMasker >>>>> rmlib= #provide an organism specific repeat library in fasta format >>>>> for RepeatMasker >>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta >>>>> #provide a fasta file of transposable element proteins for RepeatRunner >>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file >>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change >>>>> this), 1 = yes, 0 = no >>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. >>>>> seg and dust filtering) >>>>> >>>>> >>>>> *I ran maker command on another tab and it returned the following* >>>>> STATUS: Parsing control files... >>>>> ERROR: You have failed to provide a value for 'genome' in the control >>>>> files. >>>>> >>>>> --> rank=NA, hostname=emmannamekasMBP >>>>> >>>>> >>>>> Questions >>>>> 1. Specifying the genome location, do I need to run maker on the same >>>>> tab or open another bash tab? >>>>> 2. My genome is novel and do not have proteins, how do I generate >>>>> protein fast for the de novo sequence and EST? >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Nnadi Nnaemeka Emmanuel >>>>> Department of Microbiology, >>>>> Faculty of Natural and Applied Science, >>>>> Plateau State University, Bokkos, Plateau State, Nigeria. >>>>> Publications: https://www.researchgate.net/ >>>>> profile/Emmanuel_Nnadi/publications >>>>> >>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt >>>>> wrote: >>>>> >>>>>> Here is a class on how to use MAKER taught a couple of years back ?> >>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/M >>>>>> AKER_Tutorial_for_GMOD_Online_Training_2014 >>>>>> >>>>>> There is also a linked video as well as an amazon image of the class >>>>>> material where you can run the image in the cloud and follow along. >>>>>> >>>>>> Thanks, >>>>>> Carson >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi >>>>>> wrote: >>>>>> >>>>>> Hi Carson, >>>>>> Thanks a lot >>>>>> >>>>>> I ran this command maker -h it returned the following >>>>>> >>>>>> The last thing I wish to ask you, how can I load my genome fine and >>>>>> being annotation? >>>>>> >>>>>> Thanks >>>>>> >>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h >>>>>> >>>>>> MAKER version 2.31.9 >>>>>> >>>>>> Usage: >>>>>> >>>>>> maker [options] >>>>>> >>>>>> >>>>>> Description: >>>>>> >>>>>> MAKER is a program that produces gene annotations in GFF3 format >>>>>> using >>>>>> evidence such as EST alignments and protein homology. MAKER can >>>>>> be used to >>>>>> produce gene annotations for new genomes as well as update >>>>>> annotations >>>>>> from existing genome databases. >>>>>> >>>>>> The three input arguments are control files that specify how >>>>>> MAKER should >>>>>> behave. All options for MAKER should be set in the control >>>>>> files, but a >>>>>> few can also be set on the command line. Command line options >>>>>> provide a >>>>>> convenient machanism to override commonly altered control file >>>>>> values. >>>>>> MAKER will automatically search for the control files in the >>>>>> current >>>>>> working directory if they are not specified on the command line. >>>>>> >>>>>> Input files listed in the control options files must be in fasta >>>>>> format >>>>>> unless otherwise specified. Please see MAKER documentation to >>>>>> learn more >>>>>> about control file configuration. MAKER will automatically try >>>>>> and >>>>>> locate the user control files in the current working directory >>>>>> if these >>>>>> arguments are not supplied when initializing MAKER. >>>>>> >>>>>> It is important to note that MAKER does not try and recalculated >>>>>> data that >>>>>> it has already calculated. For example, if you run an analysis >>>>>> twice on >>>>>> the same dataset you will notice that MAKER does not rerun any >>>>>> of the >>>>>> BLAST analyses, but instead uses the blast analyses stored from >>>>>> the >>>>>> previous run. To force MAKER to rerun all analyses, use the -f >>>>>> flag. >>>>>> >>>>>> MAKER also supports parallelization via MPI on computer >>>>>> clusters. Just >>>>>> launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support >>>>>> must be >>>>>> configured during the MAKER installation process for this to >>>>>> work though >>>>>> >>>>>> >>>>>> Options: >>>>>> >>>>>> -genome|g Overrides the genome file path in the >>>>>> control files >>>>>> >>>>>> -RM_off|R Turns all repeat masking options off. >>>>>> >>>>>> -datastore/ Forcably turn on/off MAKER's two deep >>>>>> directory >>>>>> nodatastore structure for output. Always on by default. >>>>>> >>>>>> -old_struct Use the old directory styles (MAKER 2.26 and >>>>>> lower) >>>>>> >>>>>> -base Set the base name MAKER uses to save output >>>>>> files. >>>>>> MAKER uses the input genome file name by >>>>>> default. >>>>>> >>>>>> -tries|t Run contigs up to the specified number of >>>>>> tries. >>>>>> >>>>>> -cpus|c Tells how many cpus to use for BLAST >>>>>> analysis. >>>>>> Note: this is for BLAST and not for MPI! >>>>>> >>>>>> -force|f Forces MAKER to delete old files before >>>>>> running again. >>>>>> This will require all blast analyses to be rerun. >>>>>> >>>>>> -again|a recaculate all annotations and output files >>>>>> even if no >>>>>> settings have changed. Does not delete old analyses. >>>>>> >>>>>> -quiet|q Regular quiet. Only a handlful of status >>>>>> messages. >>>>>> >>>>>> -qq Even more quiet. There are no status >>>>>> messages. >>>>>> >>>>>> -dsindex Quickly generate datastore index file. Note >>>>>> that this >>>>>> will not check if run settings have changed >>>>>> on contigs >>>>>> >>>>>> -nolock Turn off file locks. May be usful on some >>>>>> file systems, >>>>>> but can cause race conditions if running in >>>>>> parallel. >>>>>> >>>>>> -TMP Specify temporary directory to use. >>>>>> >>>>>> -CTL Generate empty control files in the current >>>>>> directory. >>>>>> >>>>>> -OPTS Generates just the maker_opts.ctl file. >>>>>> >>>>>> -BOPTS Generates just the maker_bopts.ctl file. >>>>>> >>>>>> -EXE Generates just the maker_exe.ctl file. >>>>>> >>>>>> -MWAS