From willett4 at email.unc.edu  Fri Sep  1 10:22:34 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 1 Sep 2017 15:22:34 +0000
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
Message-ID: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>

Hi Everyone-

I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:

"Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?

This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 

We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.

If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).

Thanks,

Best,

Chris Willett


error 48600

#--------- command -------------#
Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_3

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_3

error 48599

Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_11

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_11

error 48592

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-193-25.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_5

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_5

error 47069

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-183-35.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_12

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_12


Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
 

From chzelin at gmail.com  Tue Sep  5 08:59:09 2017
From: chzelin at gmail.com (zl c)
Date: Tue, 5 Sep 2017 09:59:09 -0400
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
Message-ID: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>

Hello,

I run maker for most sequences successfully but fail some long sequences.
The error is:

Widget::tblastx:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db
db.778415-832259.for_tblastx.fasta -query ...778415.832259.0
-num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000
-searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking
true -show_gis -out   OUT.tblastx

#-------------------------------#


------------- EXCEPTION: Bio::Root::Exception -------------

MSG: Can't get HSPs: data not collected.

STACK: Error::throw

STACK: Bio::Root::Root::throw
/usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486

STACK: Bio::Search::Hit::PhatHit::Base::hsps
/spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552

STACK: Widget::tblastx::keepers
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192

STACK: Widget::tblastx::parse
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260

STACK: GI::reblast_merged_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471

STACK: GI::merge_resolve_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291

STACK: Process::MpiChunk::_go
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320

STACK: Process::MpiChunk::run
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340

STACK: Process::MpiChunk::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: /home/chenz11/program/maker/bin/maker:695

-----------------------------------------------------------

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

ERROR: Failed while collecting tblastx reports

ERROR: Chunk failed at level:5, tier_type:3

FAILED CONTIG:tig00011625_arrow


ERROR: Chunk failed at level:4, tier_type:0

FAILED CONTIG:tig00011625_arrow


examining contents of the fasta file and run log

I've read a relative thread on the google group and checked my tblastx
output. I found that the number of HSPs should be larger than 1000,000, but
only output 1000,000, which make some alignments have no HSPs. Is there any
setting that could solve the problem?

Thanks,
Zelin

--------------------------------------------
Zelin Chen [chzelin at gmail.com]


NIH/NHGRI
Building 50, Room 5531
50 SOUTH DR, MSC 8004
BETHESDA, MD 20892-8004
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170905/867d1aef/attachment.html>

From qwzhang0601 at gmail.com  Tue Sep  5 15:24:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 16:24:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
Message-ID: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>

Hello:

We are doing genome annotation for a new rodent species. We have finished
the training of the ab initio gene predictors successful by setting the
following parameters (split_hit=40000, max_dna_len=1000000, and 99k
mammalian Swiss protein sequences as evidences.

But when I used the trained model to do the genome annotation, I got the
following kinds of errors (shown in red). I used the same parameters as
those for training, except for addition of 340k rodent TrEMBL protein
sequences for protein evidences (i.e., I use both 99k mammalian Swiss
protein sequences and 340k rodent TrEMBL protein sequences).

I am doing the annotation on a cluster and started multiple Maker in the
same directory (I had tried to use MPI but met some problems).

Do you have any suggestions? Many thanks
#some kinds of errors
open3: fork failed: Cannot allocate memory at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
--> rank=NA, hostname=n520
ERROR: Failed while doing blastx of proteins
ERROR: Chunk failed at level:8, tier_type:3
FAILED CONTIG:Contig2


setting up GFF3 output and fasta chunks
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n513
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig12378


Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170905/d504a94d/attachment.html>

From carsonhh at gmail.com  Tue Sep  5 15:56:01 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 14:56:01 -0600
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
In-Reply-To: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
References: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
Message-ID: <7DCB519E-9AFA-4D10-8046-72DE99C5E4FF@gmail.com>

Did you use gff3 input to MAKER for any steps (example pred_gff or est_gff)?

?Carson

> On Sep 1, 2017, at 9:22 AM, Willett, Christopher S <willett4 at email.unc.edu> wrote:
> 
> Hi Everyone-
> 
> I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:
> 
> "Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?
> 
> This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 
> 
> We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.
> 
> If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).
> 
> Thanks,
> 
> Best,
> 
> Chris Willett
> 
> 
> 
> error 48600
> 
> #--------- command -------------#
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_3
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_3
> 
> error 48599
> 
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_11
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_11
> 
> error 48592
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-193-25.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_5
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_5
> 
> error 47069
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-183-35.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_12
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_12
> 
> 
> Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Tue Sep  5 16:48:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 15:48:56 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
Message-ID: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>

You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.

So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).

?Carson


> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
> 
> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
> 
> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
> 
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
> 
> 
> Best
> Quanwei

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170905/c2fb8514/attachment.html>

From carsonhh at gmail.com  Tue Sep  5 17:04:00 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:04:00 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
Message-ID: <846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>

The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.

?Carson


> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com> wrote:
> 
> Hello,
> 
> I run maker for most sequences successfully but fail some long sequences. The error is: 
> 
> Widget::tblastx:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
> #-------------------------------#
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Can't get HSPs: data not collected.
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: /home/chenz11/program/maker/bin/maker:695
> -----------------------------------------------------------
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> ERROR: Failed while collecting tblastx reports
> ERROR: Chunk failed at level:5, tier_type:3
> FAILED CONTIG:tig00011625_arrow
> 
> ERROR: Chunk failed at level:4, tier_type:0
> FAILED CONTIG:tig00011625_arrow
> 
> examining contents of the fasta file and run log
> 
> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>  
> Thanks,
> Zelin
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170905/a316398a/attachment.html>

From qwzhang0601 at gmail.com  Tue Sep  5 17:04:23 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 18:04:23 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
Message-ID: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>

Dear Carson:

Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds.
I set max_dna_len as 1Mb, because there are quite many long scaffolds
(e.g., the longest one is about 100Mb). Would you explain whether smaller
"max_dna_len" will decrease the quality of annotation (e.g., split some
genes in the same scaffold)?


Best
Quanwei

2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> You ran out of memory. You probably set max_dna_len too high for the
> machines you are using. There is a note in the maker_opts.ctl file that
> tells you that this value affects memory usage.
>
> So you can either set it lower, or if running under MPI, use fewer CPUs
> per node (how you do this is MPI flavor dependent, but some flavors let you
> do this by setting process count lower combined with the round robin
> option).
>
> ?Carson
>
>
>
> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> We are doing genome annotation for a new rodent species. We have finished
> the training of the ab initio gene predictors successful by setting the
> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
> mammalian Swiss protein sequences as evidences.
>
> But when I used the trained model to do the genome annotation, I got the
> following kinds of errors (shown in red). I used the same parameters as
> those for training, except for addition of 340k rodent TrEMBL protein
> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
> protein sequences and 340k rodent TrEMBL protein sequences).
>
> I am doing the annotation on a cluster and started multiple Maker in the
> same directory (I had tried to use MPI but met some problems).
>
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
>
>
> Best
> Quanwei
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170905/8c55b5a3/attachment.html>

From carsonhh at gmail.com  Tue Sep  5 17:08:28 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:08:28 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
Message-ID: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>

max_dna_len is the window size for keeping data in RAM. Smaller values do not split genes. But values lower than 100kb can create issues (if a single gene models spans 3 or more windows, it creates a weird failure).

?Carson


> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds. I set max_dna_len as 1Mb, because there are quite many long scaffolds (e.g., the longest one is about 100Mb). Would you explain whether smaller "max_dna_len" will decrease the quality of annotation (e.g., split some genes in the same scaffold)? 
> 
> 
> Best
> Quanwei  
> 
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.
> 
> So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).
> 
> ?Carson
> 
> 
> 
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
>> 
>> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
>> 
>> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
>> 
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>> 
>> 
>> Best
>> Quanwei
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170905/6032bfb2/attachment.html>

From qwzhang0601 at gmail.com  Wed Sep  6 10:51:54 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 6 Sep 2017 11:51:54 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
Message-ID: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>

Dear Carson:

(1) Thank you for your explanation. I will try to set max_dna_len as 400kb
for our rodent species, which is a little bit higher than the suggested
value for large vertebrate genome (in the maker manual it mentioned
"300,000 is a good max_dna_len on large vertebrate genomes if memory is not
a limiting factor").

(2) By reading some of your replies in the maker google group, and I
noticed that it can reduce memory and save time for annotation if I set
depth_blast to a certain number. So I changed the following parameters. But
I wonder, whether it will decrease the quality of annotation? If it won't
affect the quality, can I even use a smaller number (e.g., 20) to save more
memory and time?

depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

(3) I also have some concerns about the speed, especially for the long
scaffolds (around 100Mb). I wonder which part is the most time consuming
for genome annotation (repeat masking, blast, or polishing?).
Particularly, I wonder whether the blastx of protein evidence will take
majority of time. Now, I have prepared 99k mammalian Swiss protein
sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
am considering whether I can save much time if I only use the 99k mammalian
Swiss protein sequences as evidences.

(4) For some reasons, I can not run maker though MPI on our cluster. So I
can only start multiple maker. I wonder if it is possible to let multiple
maker to annotate the same long scaffold (i.e., for a single sequence I
start multiple maker, without splitting the long sequence into shorter
ones).

(5) Still about the speed issue. I read some of your comments about "cpus"
parameters in the maker_opts file (
http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html).
And I know it indicate the number of cpus for a single chunk. So if I set
"cpus=2" in the maker_opts file, then I can use the following command to
submit the job, right?

**************** the bash file used to submit the maker job
#!/bin/bash

#$ -cwd
#$ -S /bin/bash
#$ -j y
#$ -N makerT2
#$ -l h_vmem=8g
#$ -pe smp 2

module load MAKER/2.31.9/perl.5.22.1

maker --q 2> maker_test.error


Many thanks

Best
Qaunwei


2017-09-05 18:08 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> max_dna_len is the window size for keeping data in RAM. Smaller values do
> not split genes. But values lower than 100kb can create issues (if a single
> gene models spans 3 or more windows, it creates a weird failure).
>
> ?Carson
>
>
>
>
> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thanks. I wonder whether smaller "max_dna_len" will split longer
> scaffolds. I set max_dna_len as 1Mb, because there are quite many long
> scaffolds (e.g., the longest one is about 100Mb). Would you explain whether
> smaller "max_dna_len" will decrease the quality of annotation (e.g., split
> some genes in the same scaffold)?
>
>
> Best
> Quanwei
>
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> You ran out of memory. You probably set max_dna_len too high for the
>> machines you are using. There is a note in the maker_opts.ctl file that
>> tells you that this value affects memory usage.
>>
>> So you can either set it lower, or if running under MPI, use fewer CPUs
>> per node (how you do this is MPI flavor dependent, but some flavors let you
>> do this by setting process count lower combined with the round robin
>> option).
>>
>> ?Carson
>>
>>
>>
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Hello:
>>
>> We are doing genome annotation for a new rodent species. We have finished
>> the training of the ab initio gene predictors successful by setting the
>> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
>> mammalian Swiss protein sequences as evidences.
>>
>> But when I used the trained model to do the genome annotation, I got the
>> following kinds of errors (shown in red). I used the same parameters as
>> those for training, except for addition of 340k rodent TrEMBL protein
>> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
>> protein sequences and 340k rodent TrEMBL protein sequences).
>>
>> I am doing the annotation on a cluster and started multiple Maker in the
>> same directory (I had tried to use MPI but met some problems).
>>
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>>
>>
>> Best
>> Quanwei
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170906/5ef9f187/attachment.html>

From carsonhh at gmail.com  Wed Sep  6 11:06:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 6 Sep 2017 10:06:46 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
Message-ID: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>


> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
> 
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.


> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.

BLASTN (ESTs) -> fastest as it is searching nucleotide space
BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX

Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.


> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).

Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.


> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  

The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.


?Carson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170906/2e1e3d6b/attachment.html>

From carsonhh at gmail.com  Thu Sep  7 10:12:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 7 Sep 2017 09:12:46 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
	<846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>
	<CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
Message-ID: <2B046506-1E32-4840-B3B6-6DABB4A5D4C2@gmail.com>

I?m glad it fixed it.

?Carson

> On Sep 6, 2017, at 8:27 PM, zl c <chzelin at gmail.com> wrote:
> 
> Hi Carson,
> 
> I try blast-2.6.0+ and it works. Thank you very much.
> 
> Thanks
> Zelin Chen
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> 
> On Tue, Sep 5, 2017 at 6:04 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.
> 
> ?Carson
> 
> 
>> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I run maker for most sequences successfully but fail some long sequences. The error is: 
>> 
>> Widget::tblastx:
>> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
>> #-------------------------------#
>> 
>> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: Can't get HSPs: data not collected.
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
>> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
>> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
>> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
>> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
>> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
>> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
>> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
>> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: /home/chenz11/program/maker/bin/maker:695
>> -----------------------------------------------------------
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> ERROR: Failed while collecting tblastx reports
>> ERROR: Chunk failed at level:5, tier_type:3
>> FAILED CONTIG:tig00011625_arrow
>> 
>> ERROR: Chunk failed at level:4, tier_type:0
>> FAILED CONTIG:tig00011625_arrow
>> 
>> examining contents of the fasta file and run log
>> 
>> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>>  
>> Thanks,
>> Zelin
>> 
>> --------------------------------------------
>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>> 
>> 
>> NIH/NHGRI
>> Building 50, Room 5531
>> 50 SOUTH DR, MSC 8004 
>> BETHESDA, MD 20892-8004
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170907/441f80c9/attachment.html>

From qwzhang0601 at gmail.com  Fri Sep  8 22:25:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Fri, 8 Sep 2017 23:25:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
Message-ID: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>

Dear Carson:

I got the following error again. Is this still related to memory issues? I
wonder whether there can be other reasons lead to this error? This time, I
got this error during training of the SNAP model. Before, even I set
max_dna_len=1Mb, I can train the model successfully.  And in the current
training (where I get the following error),  I have decreased the
max_dna_len to 300kb. I required the same amount memory as before. The only
difference is that I am using both mammalian repeat library and species
specific repeat library, while previously I only use the mammalian repeat
library. Will it greatly increases the requirement of memory to use both
repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
have also set the depth_blast as 30 in current training.

Thank you! Have a nice weekend!


#---------------------------------------------------------------------
Now starting the contig!!
SeqID: Contig10
Length: 18773588
#---------------------------------------------------------------------


setting up GFF3 output and fasta chunks
doing repeat masking
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
collecting blastx repeatmasking
processing all repeats
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n224
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig10

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:Contig10

Best
Quanwei

2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

>
> (2) By reading some of your replies in the maker google group, and I
> noticed that it can reduce memory and save time for annotation if I set
> depth_blast to a certain number. So I changed the following parameters. But
> I wonder, whether it will decrease the quality of annotation? If it won't
> affect the quality, can I even use a smaller number (e.g., 20) to save more
> memory and time?
>
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> This values really only affects the final evidence kept in the GFF3 when
> you look at it in a browser. It has not affect on the annotation. This is
> because internally MAKER already collapses evidence down to the 10 best
> non-redundant features per evidence set per locus. The rest are put in the
> GFF3 just for reference. by setting it lower, you are just letting MAKER
> know it can through things away even sooner since you don?t want them in
> the GFF3. It provides a minor improvement for memory use, but
> max_dna_length is the big one that has the greatest effect.
>
>
> (3) I also have some concerns about the speed, especially for the long
> scaffolds (around 100Mb). I wonder which part is the most time consuming
> for genome annotation (repeat masking, blast, or polishing?).
> Particularly, I wonder whether the blastx of protein evidence will take
> majority of time. Now, I have prepared 99k mammalian Swiss protein
> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
> am considering whether I can save much time if I only use the 99k mammalian
> Swiss protein sequences as evidences.
>
>
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
> times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12
> times slower than BLASTN and twice as slow as BLASTX
>
> Also double the dataset size, double the runtime. Larger window sizes via
> max_dna_length will also increase runtimes.
>
>
> (4) For some reasons, I can not run maker though MPI on our cluster. So I
> can only start multiple maker. I wonder if it is possible to let multiple
> maker to annotate the same long scaffold (i.e., for a single sequence I
> start multiple maker, without splitting the long sequence into shorter
> ones).
>
>
> Without MPI you won?t be able to split up large contigs. At the very least
> you can try and run on a single node and set MPI to use all CPUs on that
> node. It?s less difficult to set up compared to cross node jobs via MPI.
>
>
> (5) Still about the speed issue. I read some of your comments about "cpus"
> parameters in the maker_opts file (http://gmod.827538.n3.nabble.
> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
> the maker_opts file, then I can use the following command to submit the
> job, right?
>
>
> The cpu parameter only affects how many CPUs are given to the blast
> command line. So only the BLASt step will speed up, so I recommend using
> MPI to get all steps to speed up. Even if you are only running on a single
> node, you can give all CPUs to the mpiexec command.
>
>
> ?Carson
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170908/08852c2f/attachment.html>

From xvazquezc at gmail.com  Sun Sep 10 20:03:11 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 11 Sep 2017 11:03:11 +1000
Subject: [maker-devel] augustus underpredicting
Message-ID: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>

Hi,
I have been annotating a fungal genome as usual, using Busco-trained
Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close
to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea
https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/7ac7b97f/attachment.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:19:50 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 12:19:50 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>

Dear Carson:

About the error in my above email, I found the contig was correctly
annotated at the second time RETRY. So please ignore my last email. But
now, for a few number of scaffolds, I met problems to process the repeats
(as shown below in red). I used both Mammalia repeat library and species
specific repeat library (which is generated by your pipeline "
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic").
There were no such problems when I only used Mammalia repeat library. Do
you have any ideas about this? What could be the reason? Or do you have any
suggestions for me to find the reason? Many thanks

Here are some parameters I used

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in
RepeatMasker
rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
repeat library in fasta format for Repe

max_dna_len=300000
split_hit=40000
depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking


Died at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31


Best
Quanwei

2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I got the following error again. Is this still related to memory issues? I
> wonder whether there can be other reasons lead to this error? This time, I
> got this error during training of the SNAP model. Before, even I set
> max_dna_len=1Mb, I can train the model successfully.  And in the current
> training (where I get the following error),  I have decreased the
> max_dna_len to 300kb. I required the same amount memory as before. The only
> difference is that I am using both mammalian repeat library and species
> specific repeat library, while previously I only use the mammalian repeat
> library. Will it greatly increases the requirement of memory to use both
> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
> have also set the depth_blast as 30 in current training.
>
> Thank you! Have a nice weekend!
>
>
>
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
>
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
>
> Best
> Quanwei
>
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>>
>> (2) By reading some of your replies in the maker google group, and I
>> noticed that it can reduce memory and save time for annotation if I set
>> depth_blast to a certain number. So I changed the following parameters. But
>> I wonder, whether it will decrease the quality of annotation? If it won't
>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>> memory and time?
>>
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> This values really only affects the final evidence kept in the GFF3 when
>> you look at it in a browser. It has not affect on the annotation. This is
>> because internally MAKER already collapses evidence down to the 10 best
>> non-redundant features per evidence set per locus. The rest are put in the
>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>> know it can through things away even sooner since you don?t want them in
>> the GFF3. It provides a minor improvement for memory use, but
>> max_dna_length is the big one that has the greatest effect.
>>
>>
>> (3) I also have some concerns about the speed, especially for the long
>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>> for genome annotation (repeat masking, blast, or polishing?).
>> Particularly, I wonder whether the blastx of protein evidence will take
>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>> am considering whether I can save much time if I only use the 99k mammalian
>> Swiss protein sequences as evidences.
>>
>>
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>> times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>> 12 times slower than BLASTN and twice as slow as BLASTX
>>
>> Also double the dataset size, double the runtime. Larger window sizes via
>> max_dna_length will also increase runtimes.
>>
>>
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I
>> can only start multiple maker. I wonder if it is possible to let multiple
>> maker to annotate the same long scaffold (i.e., for a single sequence I
>> start multiple maker, without splitting the long sequence into shorter
>> ones).
>>
>>
>> Without MPI you won?t be able to split up large contigs. At the very
>> least you can try and run on a single node and set MPI to use all CPUs on
>> that node. It?s less difficult to set up compared to cross node jobs via
>> MPI.
>>
>>
>> (5) Still about the speed issue. I read some of your comments about
>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
>> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
>> the maker_opts file, then I can use the following command to submit the
>> job, right?
>>
>>
>> The cpu parameter only affects how many CPUs are given to the blast
>> command line. So only the BLASt step will speed up, so I recommend using
>> MPI to get all steps to speed up. Even if you are only running on a single
>> node, you can give all CPUs to the mpiexec command.
>>
>>
>> ?Carson
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/126b5351/attachment.html>

From carsonhh at gmail.com  Mon Sep 11 11:48:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:48:16 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <5C2477A3-CDBA-458A-95CA-E6DC912417B3@gmail.com>

It may can a memory issue or an IO issue. Some resource is being taxed and creating a non-responsive bottleneck. If you are running MAKER multiple times in the same directory, you may have to run fewer processes. Also if you are running without MPI, run with MPI instead as it will better manage the parallelization and use fewer resources than multiple individual processes.

?Carson


> On Sep 8, 2017, at 9:25 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/a9e87402/attachment.html>

From carsonhh at gmail.com  Mon Sep 11 11:50:41 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:50:41 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
Message-ID: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>

BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

?Carson


> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
> Cheers,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f7e3efe3/attachment.html>

From carsonhh at gmail.com  Mon Sep 11 12:07:12 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:07:12 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
Message-ID: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>

I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.

For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).

?Carson


> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
> 
> Here are some parameters I used
> 
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
> 
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> 
> Best
> Quanwei
> 
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/0885c26a/attachment.html>

From qwzhang0601 at gmail.com  Mon Sep 11 12:12:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:12:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
Message-ID: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>

Dear Carson:

I only run 5 Maker instances in each directory (and set cpus=2). If it is
related to memory issue or an IO issue, I am not sure why the much longer
scaffolds (than the failed ones) were all annotated successfully, but the
relatively shorter ones failed.

I have set "tries=5" (#number of times to try a contig if there is a
failure for some reason). I will try "clean_try=1" and test on the failed
scaffolds individually with larger memory to see whether they can be
annotated.

Thank you!

Best
Quanwei

2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> I think the cause of the error may have been a little further upstream
> from what you pasted in the e-mail. One thing that may be happening is that
> you are taxing resources (like IO) if running MAKER multiple times or on
> too many CPUs. That can lead to failures because of truncated BLAST reports
> etc. In which case you can just retry and that will get around those types
> of IO derived errors. MAKER can generate a lot of IO, and if you are
> working on network mounted locations (i.e. the storage being used is
> actually across the network), then they can be lest robust than local
> storage (when under heavy load NFS can falsely report success on read/write
> operations that actually failed). It?s the reason we built in the retry
> capabilities of MAKER.
>
> For contigs that continuously fail, you may need to set clean_try=1. That
> will cause failures to start from scratch (i.e. delete all old reports on
> failure rather than just those suspected of being truncated).
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> About the error in my above email, I found the contig was correctly
> annotated at the second time RETRY. So please ignore my last email. But
> now, for a few number of scaffolds, I met problems to process the repeats
> (as shown below in red). I used both Mammalia repeat library and species
> specific repeat library (which is generated by your pipeline "
> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/
> Repeat_Library_Construction--Basic"). There were no such problems when I
> only used Mammalia repeat library. Do you have any ideas about this? What
> could be the reason? Or do you have any suggestions for me to find the
> reason? Many thanks
>
> Here are some parameters I used
>
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in
> RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
> repeat library in fasta format for Repe
>
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
>
> Best
> Quanwei
>
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I got the following error again. Is this still related to memory issues?
>> I wonder whether there can be other reasons lead to this error? This time,
>> I got this error during training of the SNAP model. Before, even I set
>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>> training (where I get the following error),  I have decreased the
>> max_dna_len to 300kb. I required the same amount memory as before. The only
>> difference is that I am using both mammalian repeat library and species
>> specific repeat library, while previously I only use the mammalian repeat
>> library. Will it greatly increases the requirement of memory to use both
>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>> have also set the depth_blast as 30 in current training.
>>
>> Thank you! Have a nice weekend!
>>
>>
>>
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>>
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>>
>> Best
>> Quanwei
>>
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>>
>>> (2) By reading some of your replies in the maker google group, and I
>>> noticed that it can reduce memory and save time for annotation if I set
>>> depth_blast to a certain number. So I changed the following parameters. But
>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>> memory and time?
>>>
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> This values really only affects the final evidence kept in the GFF3 when
>>> you look at it in a browser. It has not affect on the annotation. This is
>>> because internally MAKER already collapses evidence down to the 10 best
>>> non-redundant features per evidence set per locus. The rest are put in the
>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>> know it can through things away even sooner since you don?t want them in
>>> the GFF3. It provides a minor improvement for memory use, but
>>> max_dna_length is the big one that has the greatest effect.
>>>
>>>
>>> (3) I also have some concerns about the speed, especially for the long
>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>> for genome annotation (repeat masking, blast, or polishing?).
>>> Particularly, I wonder whether the blastx of protein evidence will take
>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>> am considering whether I can save much time if I only use the 99k mammalian
>>> Swiss protein sequences as evidences.
>>>
>>>
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>> times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>
>>> Also double the dataset size, double the runtime. Larger window sizes
>>> via max_dna_length will also increase runtimes.
>>>
>>>
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>> start multiple maker, without splitting the long sequence into shorter
>>> ones).
>>>
>>>
>>> Without MPI you won?t be able to split up large contigs. At the very
>>> least you can try and run on a single node and set MPI to use all CPUs on
>>> that node. It?s less difficult to set up compared to cross node jobs via
>>> MPI.
>>>
>>>
>>> (5) Still about the speed issue. I read some of your comments about
>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>> know it indicate the number of cpus for a single chunk. So if I set
>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>> submit the job, right?
>>>
>>>
>>> The cpu parameter only affects how many CPUs are given to the blast
>>> command line. So only the BLASt step will speed up, so I recommend using
>>> MPI to get all steps to speed up. Even if you are only running on a single
>>> node, you can give all CPUs to the mpiexec command.
>>>
>>>
>>> ?Carson
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f02b6a0b/attachment.html>

From carsonhh at gmail.com  Mon Sep 11 12:14:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:14:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
Message-ID: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>

It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.

?Carson


> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
> 
> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
> 
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
> 
> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>> 
>> Here are some parameters I used
>> 
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>> 
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> 
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>> 
>> Thank you! Have a nice weekend! 
>> 
>> 
>> 
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>> 
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> 
>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>> 
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>> 
>> 
>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>> 
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>> 
>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>> 
>> 
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>> 
>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>> 
>> 
>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>> 
>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>> 
>> 
>> ?Carson
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/2a88e334/attachment.html>

From qwzhang0601 at gmail.com  Mon Sep 11 12:16:49 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:16:49 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
Message-ID: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>

Dear Carson:

I met some problems to use MPI. I will give it another try.
Thank you!

Best
Quanwei

2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> It could be either. Please use MPI instead of starting multiple instances.
> It will greatly reduce both IO and RAM usage.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I only run 5 Maker instances in each directory (and set cpus=2). If it is
> related to memory issue or an IO issue, I am not sure why the much longer
> scaffolds (than the failed ones) were all annotated successfully, but the
> relatively shorter ones failed.
>
> I have set "tries=5" (#number of times to try a contig if there is a
> failure for some reason). I will try "clean_try=1" and test on the failed
> scaffolds individually with larger memory to see whether they can be
> annotated.
>
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> I think the cause of the error may have been a little further upstream
>> from what you pasted in the e-mail. One thing that may be happening is that
>> you are taxing resources (like IO) if running MAKER multiple times or on
>> too many CPUs. That can lead to failures because of truncated BLAST reports
>> etc. In which case you can just retry and that will get around those types
>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>> working on network mounted locations (i.e. the storage being used is
>> actually across the network), then they can be lest robust than local
>> storage (when under heavy load NFS can falsely report success on read/write
>> operations that actually failed). It?s the reason we built in the retry
>> capabilities of MAKER.
>>
>> For contigs that continuously fail, you may need to set clean_try=1. That
>> will cause failures to start from scratch (i.e. delete all old reports on
>> failure rather than just those suspected of being truncated).
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> About the error in my above email, I found the contig was correctly
>> annotated at the second time RETRY. So please ignore my last email. But
>> now, for a few number of scaffolds, I met problems to process the repeats
>> (as shown below in red). I used both Mammalia repeat library and species
>> specific repeat library (which is generated by your pipeline "
>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>> eat_Library_Construction--Basic"). There were no such problems when I
>> only used Mammalia repeat library. Do you have any ideas about this? What
>> could be the reason? Or do you have any suggestions for me to find the
>> reason? Many thanks
>>
>> Here are some parameters I used
>>
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in
>> RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
>> repeat library in fasta format for Repe
>>
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I got the following error again. Is this still related to memory issues?
>>> I wonder whether there can be other reasons lead to this error? This time,
>>> I got this error during training of the SNAP model. Before, even I set
>>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>>> training (where I get the following error),  I have decreased the
>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>> difference is that I am using both mammalian repeat library and species
>>> specific repeat library, while previously I only use the mammalian repeat
>>> library. Will it greatly increases the requirement of memory to use both
>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>> have also set the depth_blast as 30 in current training.
>>>
>>> Thank you! Have a nice weekend!
>>>
>>>
>>>
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>>
>>>
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>> line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>>
>>>> (2) By reading some of your replies in the maker google group, and I
>>>> noticed that it can reduce memory and save time for annotation if I set
>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>> memory and time?
>>>>
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> This values really only affects the final evidence kept in the GFF3
>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>> know it can through things away even sooner since you don?t want them in
>>>> the GFF3. It provides a minor improvement for memory use, but
>>>> max_dna_length is the big one that has the greatest effect.
>>>>
>>>>
>>>> (3) I also have some concerns about the speed, especially for the long
>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>> Swiss protein sequences as evidences.
>>>>
>>>>
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>>> times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>>
>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>> via max_dna_length will also increase runtimes.
>>>>
>>>>
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>>> start multiple maker, without splitting the long sequence into shorter
>>>> ones).
>>>>
>>>>
>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>> MPI.
>>>>
>>>>
>>>> (5) Still about the speed issue. I read some of your comments about
>>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>>> know it indicate the number of cpus for a single chunk. So if I set
>>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>>> submit the job, right?
>>>>
>>>>
>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>> node, you can give all CPUs to the mpiexec command.
>>>>
>>>>
>>>> ?Carson
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6edaec49/attachment.html>

From carsonhh at gmail.com  Mon Sep 11 12:18:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:18:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
Message-ID: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>

If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>

It?s easy to install yourself, and tends to be very robust to failure.

?Carson


> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I met some problems to use MPI. I will give it another try.
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>> 
>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>> 
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>> 
>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>> 
>>> Here are some parameters I used
>>> 
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>> 
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> 
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>> 
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>> Dear Carson:
>>> 
>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>> 
>>> Thank you! Have a nice weekend! 
>>> 
>>> 
>>> 
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>> 
>>> 
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> 
>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>> 
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>> 
>>> 
>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>> 
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>> 
>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>> 
>>> 
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>> 
>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>> 
>>> 
>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>> 
>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>> 
>>> 
>>> ?Carson
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/ee287570/attachment.html>

From qwzhang0601 at gmail.com  Mon Sep 11 12:27:22 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:27:22 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
Message-ID: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>

Dear Carson:

Would you please explain what do you mean by "a single machine"? I am
running maker2 on our high performance cluster. The cluster has more than
1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
as the scheduler. Can I use MPICH3?

Thanks

Best
Quanwei

2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> If you are just using a single machine (and not cross machine MPI), use
> MPICH3 ?> https://www.mpich.org
>
> It?s easy to install yourself, and tends to be very robust to failure.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I met some problems to use MPI. I will give it another try.
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> It could be either. Please use MPI instead of starting multiple
>> instances. It will greatly reduce both IO and RAM usage.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is
>> related to memory issue or an IO issue, I am not sure why the much longer
>> scaffolds (than the failed ones) were all annotated successfully, but the
>> relatively shorter ones failed.
>>
>> I have set "tries=5" (#number of times to try a contig if there is a
>> failure for some reason). I will try "clean_try=1" and test on the failed
>> scaffolds individually with larger memory to see whether they can be
>> annotated.
>>
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> I think the cause of the error may have been a little further upstream
>>> from what you pasted in the e-mail. One thing that may be happening is that
>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>> etc. In which case you can just retry and that will get around those types
>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>> working on network mounted locations (i.e. the storage being used is
>>> actually across the network), then they can be lest robust than local
>>> storage (when under heavy load NFS can falsely report success on read/write
>>> operations that actually failed). It?s the reason we built in the retry
>>> capabilities of MAKER.
>>>
>>> For contigs that continuously fail, you may need to set clean_try=1.
>>> That will cause failures to start from scratch (i.e. delete all old reports
>>> on failure rather than just those suspected of being truncated).
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> About the error in my above email, I found the contig was correctly
>>> annotated at the second time RETRY. So please ignore my last email. But
>>> now, for a few number of scaffolds, I met problems to process the repeats
>>> (as shown below in red). I used both Mammalia repeat library and species
>>> specific repeat library (which is generated by your pipeline "
>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>> eat_Library_Construction--Basic"). There were no such problems when I
>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>> could be the reason? Or do you have any suggestions for me to find the
>>> reason? Many thanks
>>>
>>> Here are some parameters I used
>>>
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in
>>> RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>> specific repeat library in fasta format for Repe
>>>
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>> line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>>
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>
>>>> Dear Carson:
>>>>
>>>> I got the following error again. Is this still related to memory
>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>> This time, I got this error during training of the SNAP model. Before, even
>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>> current training (where I get the following error),  I have decreased the
>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>> difference is that I am using both mammalian repeat library and species
>>>> specific repeat library, while previously I only use the mammalian repeat
>>>> library. Will it greatly increases the requirement of memory to use both
>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>> have also set the depth_blast as 30 in current training.
>>>>
>>>> Thank you! Have a nice weekend!
>>>>
>>>>
>>>>
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>>
>>>>
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>> line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>>
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>>
>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>> memory and time?
>>>>>
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> This values really only affects the final evidence kept in the GFF3
>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>> know it can through things away even sooner since you don?t want them in
>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>
>>>>>
>>>>> (3) I also have some concerns about the speed, especially for the long
>>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>> Swiss protein sequences as evidences.
>>>>>
>>>>>
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>> 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>
>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>> via max_dna_length will also increase runtimes.
>>>>>
>>>>>
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>> shorter ones).
>>>>>
>>>>>
>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>> MPI.
>>>>>
>>>>>
>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>> "cpus" parameters in the maker_opts file (
>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>> I can use the following command to submit the job, right?
>>>>>
>>>>>
>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>
>>>>>
>>>>> ?Carson
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6fd07594/attachment.html>

From carsonhh at gmail.com  Mon Sep 11 12:46:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:46:39 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
Message-ID: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>

Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.

MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.

Example command for a 20 CPU node ?>  mpiexec -n 20 maker

?Carson


> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson: 
> 
> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
> 
> Thanks
> 
> Best
> Quanwei
> 
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
> 
> It?s easy to install yourself, and tends to be very robust to failure.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>> 
>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>> 
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>> 
>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>> 
>>>> Here are some parameters I used
>>>> 
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>> 
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> 
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>> 
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>> Dear Carson:
>>>> 
>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>> 
>>>> Thank you! Have a nice weekend! 
>>>> 
>>>> 
>>>> 
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>> 
>>>> 
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> 
>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>> 
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>> 
>>>> 
>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>> 
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>> 
>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>> 
>>>> 
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>> 
>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>> 
>>>> 
>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>> 
>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>> 
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/cef90e76/attachment.html>

From qwzhang0601 at gmail.com  Mon Sep 11 13:33:51 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 14:33:51 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
Message-ID: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>

Dear Carson:

I see. Thank you. I will try it.

Best
Quanwei

2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> Each node is a single machine. Because you currently run without MPI, each
> MAKER job you submit runs on a single machine. So you are either running
> multiple times on the same node, or you submitted 5 separate batch jobs in
> which case you may have a single maker process on each of 5 nodes.
>
> MPI can parallelize on the same node or across nodes. If you request 10
> nodes, then it can communicate across nodes to run the job on all hardware.
> Or you can run MPI on a single node and ask for all CPUs on that node. In
> that case it will split up work within a single node and use all resources
> just on that node. So if you can?t get MPI to work across nodes, you can
> just submit a job that goes to a single node and ask for all CPUs on that
> node (multinode jobs may be hard to configure, but single node jobs are
> very easy). Just set the -n parameter of mpiexec to the CPU count of that
> node, and it will parallelize within the node.
>
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>
> ?Carson
>
>
>
>
>
> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Would you please explain what do you mean by "a single machine"? I am
> running maker2 on our high performance cluster. The cluster has more than
> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
> as the scheduler. Can I use MPICH3?
>
> Thanks
>
> Best
> Quanwei
>
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> If you are just using a single machine (and not cross machine MPI), use
>> MPICH3 ?> https://www.mpich.org
>>
>> It?s easy to install yourself, and tends to be very robust to failure.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> It could be either. Please use MPI instead of starting multiple
>>> instances. It will greatly reduce both IO and RAM usage.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>> is related to memory issue or an IO issue, I am not sure why the much
>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>> but the relatively shorter ones failed.
>>>
>>> I have set "tries=5" (#number of times to try a contig if there is a
>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>> scaffolds individually with larger memory to see whether they can be
>>> annotated.
>>>
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> I think the cause of the error may have been a little further upstream
>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>> etc. In which case you can just retry and that will get around those types
>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>> working on network mounted locations (i.e. the storage being used is
>>>> actually across the network), then they can be lest robust than local
>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>> operations that actually failed). It?s the reason we built in the retry
>>>> capabilities of MAKER.
>>>>
>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>> on failure rather than just those suspected of being truncated).
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> About the error in my above email, I found the contig was correctly
>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>> specific repeat library (which is generated by your pipeline "
>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>> could be the reason? Or do you have any suggestions for me to find the
>>>> reason? Many thanks
>>>>
>>>> Here are some parameters I used
>>>>
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>> RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>> specific repeat library in fasta format for Repe
>>>>
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>> line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>>
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I got the following error again. Is this still related to memory
>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>> current training (where I get the following error),  I have decreased the
>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>> difference is that I am using both mammalian repeat library and species
>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>> have also set the depth_blast as 30 in current training.
>>>>>
>>>>> Thank you! Have a nice weekend!
>>>>>
>>>>>
>>>>>
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>>
>>>>>
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>> line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>>
>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>> memory and time?
>>>>>>
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>
>>>>>>
>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>> Swiss protein sequences as evidences.
>>>>>>
>>>>>>
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>>> 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>
>>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>>> via max_dna_length will also increase runtimes.
>>>>>>
>>>>>>
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>> shorter ones).
>>>>>>
>>>>>>
>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>> MPI.
>>>>>>
>>>>>>
>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>> "cpus" parameters in the maker_opts file (
>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>>> I can use the following command to submit the job, right?
>>>>>>
>>>>>>
>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170911/e23e5faa/attachment.html>

From qwzhang0601 at gmail.com  Wed Sep 13 09:51:32 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:51:32 -0400
Subject: [maker-devel] Repeats annotation
Message-ID: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>

Dear Carson:

We have generated species specific repeat library following your pipeline (
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic).
And did genome annotation by maker2 by using both species specific repeat
library and mammalian repeat library.

Now, we want to do some comparison about the repeat contexts among
different species. So I want to generate species specific for other species
and also use both their species specific repeat library and mammalian
repeat library. But I found, I can only provide either the species specific
repeat library or mammalian repeat library to RepeatMasker (not for both).
I wonder whether I can run maker2 on those genome but only for repeat
masking.

BTW, by running RepeatMasker we can get a summary report (as below), I
wonder whether there is any script from maker2 to analyze repeats element
(or other tools to process the output of maker2).

Many thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     301634 bp ( 34.93 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:               134        14362 bp    1.66 %
      Alu/B1          28         2183 bp    0.25 %
      MIRs            21         2860 bp    0.33 %

LINEs:               188       129104 bp   14.95 %
      LINE1          168       124633 bp   14.43 %
      LINE2           16         4266 bp    0.49 %
      L3/CR1           4          205 bp    0.02 %
      RTE              0            0 bp    0.00 %

LTR elements:        127       101129 bp   11.71 %
      ERVL            10         3057 bp    0.35 %
      ERVL-MaLRs      22         6902 bp    0.80 %
      ERV_classI      66        80258 bp    9.29 %
      ERV_classII     29        10912 bp    1.26 %

DNA elements:         27         4402 bp    0.51 %
      hAT-Charlie     13         1836 bp    0.21 %
      TcMar-Tigger     8         1651 bp    0.19 %

Unclassified:          4         1590 bp    0.18 %

Total interspersed repeats:    250587 bp   29.02 %


Small RNA:             9          616 bp    0.07 %

Satellites:           66        40820 bp    4.73 %
Simple repeats:      159         7235 bp    0.84 %
Low complexity:       50         2766 bp    0.32 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be mammalia
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170913/739f1e6a/attachment.html>

From qwzhang0601 at gmail.com  Wed Sep 13 09:32:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:32:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
Message-ID: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>

Dear Carson:

I did more tests on one of the contigs (with length 863kb) that failed when
doing repeat masking. I found it only fail when I added the species
specific repeat library, and it can be successfully annotated when only
considering mammalian repeat library. When I did the test I only picked the
this contig and run maker with 64G memory. So I think the failure should
not be the problem with memory or IO, because even the contigs with length
98Mb can be annotated with memory 32G.

I also run RepeatMasker on this contig with mammalian and species specific
repeat library, separately. I found when I use  mammalian repeat library,
about 35% was masked as repeats, while it is 65% when I use species
specific repeat library (as shown below in blue). I wonder whether the high
level of repeats can lead to the failure of this contig.  Do you have any
ideas about this. Thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     562909 bp ( 65.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:              113        16134 bp    1.87 %
      ALUs           71        12479 bp    1.45 %
      MIRs            1          133 bp    0.02 %

LINEs:              251       380142 bp   44.02 %
      LINE1         211       210623 bp   24.39 %
      LINE2           1           86 bp    0.01 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:       246       101221 bp   11.72 %
      ERVL            5         1037 bp    0.12 %
      ERVL-MaLRs     18         2744 bp    0.32 %
      ERV_classI    201        90942 bp   10.53 %
      ERV_classII    18         5964 bp    0.69 %

DNA elements:        39        14177 bp    1.64 %
     hAT-Charlie      7         3864 bp    0.45 %
     TcMar-Tigger     7         1706 bp    0.20 %

Unclassified:       196        45831 bp    5.31 %

Total interspersed repeats:   557505 bp   64.56 %


Small RNA:            3          823 bp    0.10 %

Satellites:           2          237 bp    0.03 %
Simple repeats:      94         4472 bp    0.52 %
Low complexity:      18          766 bp    0.09 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
The query was compared to classified sequences in
".../consensi.fa.classifiednoProtFinal"


Best
Quanwei

2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I see. Thank you. I will try it.
>
> Best
> Quanwei
>
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> Each node is a single machine. Because you currently run without MPI,
>> each MAKER job you submit runs on a single machine. So you are either
>> running multiple times on the same node, or you submitted 5 separate batch
>> jobs in which case you may have a single maker process on each of 5 nodes.
>>
>> MPI can parallelize on the same node or across nodes. If you request 10
>> nodes, then it can communicate across nodes to run the job on all hardware.
>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>> that case it will split up work within a single node and use all resources
>> just on that node. So if you can?t get MPI to work across nodes, you can
>> just submit a job that goes to a single node and ask for all CPUs on that
>> node (multinode jobs may be hard to configure, but single node jobs are
>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>> node, and it will parallelize within the node.
>>
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>
>> ?Carson
>>
>>
>>
>>
>>
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> Would you please explain what do you mean by "a single machine"? I am
>> running maker2 on our high performance cluster. The cluster has more than
>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>> as the scheduler. Can I use MPICH3?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> If you are just using a single machine (and not cross machine MPI), use
>>> MPICH3 ?> https://www.mpich.org
>>>
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> It could be either. Please use MPI instead of starting multiple
>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>> but the relatively shorter ones failed.
>>>>
>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>> scaffolds individually with larger memory to see whether they can be
>>>> annotated.
>>>>
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> I think the cause of the error may have been a little further upstream
>>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>>> etc. In which case you can just retry and that will get around those types
>>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>>> working on network mounted locations (i.e. the storage being used is
>>>>> actually across the network), then they can be lest robust than local
>>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>>> operations that actually failed). It?s the reason we built in the retry
>>>>> capabilities of MAKER.
>>>>>
>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>> on failure rather than just those suspected of being truncated).
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> About the error in my above email, I found the contig was correctly
>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>> specific repeat library (which is generated by your pipeline "
>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>> reason? Many thanks
>>>>>
>>>>> Here are some parameters I used
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>> specific repeat library in fasta format for Repe
>>>>>
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>> line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>>
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I got the following error again. Is this still related to memory
>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>> current training (where I get the following error),  I have decreased the
>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>
>>>>>> Thank you! Have a nice weekend!
>>>>>>
>>>>>>
>>>>>>
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>>
>>>>>>
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>> line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>>
>>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>> memory and time?
>>>>>>>
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>
>>>>>>>
>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>> Swiss protein sequences as evidences.
>>>>>>>
>>>>>>>
>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>> least 6 times slower than BLASTN
>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>
>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>
>>>>>>>
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>> shorter ones).
>>>>>>>
>>>>>>>
>>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>> MPI.
>>>>>>>
>>>>>>>
>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>
>>>>>>>
>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170913/c1467038/attachment.html>

From mathog at caltech.edu  Wed Sep 13 13:01:11 2017
From: mathog at caltech.edu (mathog)
Date: Wed, 13 Sep 2017 11:01:11 -0700
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
Message-ID: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>

Greetings,

I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It 
just won't start.  OpenMPI works fine with a small test program, it just 
doesn't work with maker.  It fails in exactly the same way on a second 
Centos system with minor software differences (Centos 6.9 and perl 5.20 
compiled without thread support, the perl on the first machine had 
thread support.) The gory details were posted already in a Centos forum 
so rather than repeat it all here, this is a link to that thread:

    https://www.centos.org/forums/viewtopic.php?f=14&t=64099

maker was unpacked from the maker-2.31.9.tgz a second time (after moving 
the original) after setting up the "module add openmpi-x86_64" to my 
.bash_profile
and logging in cleanly.  It was rebuilt.  The build messages were 
identical to the previous ones and when a run was attempted it also 
failed in exactly the same way.

I also tried to subscribe to the list here

   
https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

once yesterday, and once today, but no email ever came back.  Hopefully 
this message gets through!

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From carsonhh at gmail.com  Wed Sep 13 13:23:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:23:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
Message-ID: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>

These are the 3 errors you have shown in your e-mails ?>
open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.

The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.

The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.


IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.

Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.

1. Some things to check. Make sure TMP= is not being set to a network mounted location.
2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.

Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.

Thanks,
Carson


> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
> 
> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
> 
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
> 
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
> 
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
> 
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
> 
> Unclassified:       196        45831 bp    5.31 %
> 
> Total interspersed repeats:   557505 bp   64.56 %
> 
> 
> Small RNA:            3          823 bp    0.10 %
> 
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be homo          
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
> 
> 
> Best
> Quanwei
> 
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I see. Thank you. I will try it.
> 
> Best
> Quanwei
> 
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
> 
> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
> 
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
> 
> ?Carson
> 
> 
> 
> 
> 
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson: 
>> 
>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>> 
>> It?s easy to install yourself, and tends to be very robust to failure.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>> 
>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>> 
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>> 
>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>> 
>>>>> Here are some parameters I used
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>> 
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> 
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>> 
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>> Dear Carson:
>>>>> 
>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>> 
>>>>> Thank you! Have a nice weekend! 
>>>>> 
>>>>> 
>>>>> 
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> 
>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>> 
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>> 
>>>>> 
>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>> 
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>> 
>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>> 
>>>>> 
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>> 
>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>> 
>>>>> 
>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>> 
>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>> 
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170913/3c646981/attachment.html>

From carsonhh at gmail.com  Wed Sep 13 13:26:08 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:26:08 -0600
Subject: [maker-devel] Repeats annotation
In-Reply-To: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
References: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
Message-ID: <40F80C42-836A-41FF-9C9F-1F45C5816283@gmail.com>

I don?t know of any tool to analyze the repeat info. MAKER really only focuses on getting the masking done for the gene prediction, and while it does keep the repeats as features in the GFF3, it does not do any kind of analysis. You would have to do that outside of MAKER.

?Carson


> On Sep 13, 2017, at 8:51 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> We have generated species specific repeat library following your pipeline (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>). And did genome annotation by maker2 by using both species specific repeat library and mammalian repeat library. 
> 
> Now, we want to do some comparison about the repeat contexts among different species. So I want to generate species specific for other species and also use both their species specific repeat library and mammalian repeat library. But I found, I can only provide either the species specific repeat library or mammalian repeat library to RepeatMasker (not for both). I wonder whether I can run maker2 on those genome but only for repeat masking. 
> 
> BTW, by running RepeatMasker we can get a summary report (as below), I wonder whether there is any script from maker2 to analyze repeats element (or other tools to process the output of maker2). 
> 
> Many thanks
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     301634 bp ( 34.93 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:               134        14362 bp    1.66 %
>       Alu/B1          28         2183 bp    0.25 %
>       MIRs            21         2860 bp    0.33 %
> 
> LINEs:               188       129104 bp   14.95 %
>       LINE1          168       124633 bp   14.43 %
>       LINE2           16         4266 bp    0.49 %
>       L3/CR1           4          205 bp    0.02 %
>       RTE              0            0 bp    0.00 %
> 
> LTR elements:        127       101129 bp   11.71 %
>       ERVL            10         3057 bp    0.35 %
>       ERVL-MaLRs      22         6902 bp    0.80 %
>       ERV_classI      66        80258 bp    9.29 %
>       ERV_classII     29        10912 bp    1.26 %
> 
> DNA elements:         27         4402 bp    0.51 %
>       hAT-Charlie     13         1836 bp    0.21 %
>       TcMar-Tigger     8         1651 bp    0.19 %
> 
> Unclassified:          4         1590 bp    0.18 %
> 
> Total interspersed repeats:    250587 bp   29.02 %
> 
> 
> Small RNA:             9          616 bp    0.07 %
> 
> Satellites:           66        40820 bp    4.73 %
> Simple repeats:      159         7235 bp    0.84 %
> Low complexity:       50         2766 bp    0.32 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be mammalia      
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170913/9744da83/attachment.html>

From carsonhh at gmail.com  Wed Sep 13 13:41:24 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:41:24 -0600
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
In-Reply-To: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
References: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
Message-ID: <BA16E294-BE01-47DC-8113-C018C38480FC@gmail.com>

Mi David,

First thing. MAKER binds shared C libraries using Perl, so you have to tell MAKER where to find the needed files before you install it. Then it compiles the bindings and saves them for MAKER to use. If you have two MPI installation, you may have MAKER setup to use one of the installations then you are trying to call it with the other one. That would break the compiles bindings.

Also make sure you did the following (info from the ?/maker/INSTALL instructions file) ?> 

"make sure to set LD_PRELOAD to the location of libmpi.so before even trying to install MAKER. It must also be set before running MAKER (or any program that binds OpenMPI's shared libraries), so it's best just to add it to your ~/.bash_profile. (i.e. export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so)."

Remember to replace '/usr/local/openmpi/lib/libmpi.so? with the actual location of the file.

Second once you can get maker to start under OpenMPI, you may get freezes or failures part way into a run because OpenFabrics libraries use registered memory in a weird way that can cause system calls in a program to fail with a snowballing error effect. Adding this to the mpiexec options can stop this from occurring ?> '-mca btl ^openib'

That option has the side effect of disabling infiniband and using the ethernet adapter instead. However if you need to use the infiniband adapter, you can use this flag instead '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0'

That command will use IP over infiniband rather than the native infiniband which will have the same effect of diabling the OpenFabrics libraries.

Thanks,
Carson


> On Sep 13, 2017, at 12:01 PM, mathog <mathog at caltech.edu> wrote:
> 
> Greetings,
> 
> I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It just won't start.  OpenMPI works fine with a small test program, it just doesn't work with maker.  It fails in exactly the same way on a second Centos system with minor software differences (Centos 6.9 and perl 5.20 compiled without thread support, the perl on the first machine had thread support.) The gory details were posted already in a Centos forum so rather than repeat it all here, this is a link to that thread:
> 
>   https://www.centos.org/forums/viewtopic.php?f=14&t=64099
> 
> maker was unpacked from the maker-2.31.9.tgz a second time (after moving the original) after setting up the "module add openmpi-x86_64" to my .bash_profile
> and logging in cleanly.  It was rebuilt.  The build messages were identical to the previous ones and when a run was attempted it also failed in exactly the same way.
> 
> I also tried to subscribe to the list here
> 
>  https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> once yesterday, and once today, but no email ever came back.  Hopefully this message gets through!
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From qwzhang0601 at gmail.com  Wed Sep 13 14:42:01 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 15:42:01 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
Message-ID: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>

Dear Carson:

Thank you for your explanation.  Sorry for not describing my problem
clearly. The first two errors were all gone after I changed the parameters
you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
following error for two contigs among thousands of contigs. One of the two
failed contigs has length 863k, and I have done more tests on this contig
individually. By running repeatmask on this contig, 65% was masked when
using species specific repeat library, while it is only 35% when using
mammalian repeat library. Since longer contigs (even 98Mb) can all be
annotated, I doubt why this much shorter one can fail due to IO.

I did not set "TMP", and I am running on a high performance cluster. I am
not sure whether it is a virtual memory or not. I will check it later. Many
thanks

Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31

Best
Quanwei

2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
>
> The first two are memory related with the second being because it cannot
> kill a lock maintainer thread that it was not able to start because of lack
> of memory.
>
> The third one is IO related. It is a truncated file that succeeded on the
> second try according to the e-mail you sent.
>
>
> IO errors are quite common with NFS (network mounted file systems). It?s
> one of the most frequent issues submitted to the devel list. MAKER can hit
> IO limits long before it hits CPU limits. One of the most frequent casues
> of these issues is that the user set TMP= in the control files to a manual
> location that is not suitable for high IO (note TMP= defaults to /tmp). The
> location should always be a true locally mounted disk. Sometimes this is a
> virtual location (not really local disk but network mounted disk or an in
> memory location). With the former you will get frequent IO failures and
> with the latter you will also get out of memory issues.
>
> Note that when you supply more data files you will also use more memory
> (to hold analysis results). According to your e-mail the last error you got
> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
> with two input files but not when you supply a single input file further
> suggests you are running low on RAM.
>
> 1. Some things to check. Make sure TMP= is not being set to a network
> mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory
> on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users
> or because you failed to request enough RAM during job submission.
>
> Finally, try running interactively so you can see what the memory and
> directory locations look like on the node you get assigned for the job
> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
> local disk?). Also run with MPI rather than starting multiple MAKER
> instances. It uses resources better.
>
> Thanks,
> Carson
>
>
>
>
>
>
> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I did more tests on one of the contigs (with length 863kb) that failed
> when doing repeat masking. I found it only fail when I added the species
> specific repeat library, and it can be successfully annotated when only
> considering mammalian repeat library. When I did the test I only picked the
> this contig and run maker with 64G memory. So I think the failure should
> not be the problem with memory or IO, because even the contigs with length
> 98Mb can be annotated with memory 32G.
>
> I also run RepeatMasker on this contig with mammalian and species specific
> repeat library, separately. I found when I use  mammalian repeat library,
> about 35% was masked as repeats, while it is 65% when I use species
> specific repeat library (as shown below in blue). I wonder whether the high
> level of repeats can lead to the failure of this contig.  Do you have any
> ideas about this. Thanks
>
>
>
> file name: test_scaffold31.fasta
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
>
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
>
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
>
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
>
> Unclassified:       196        45831 bp    5.31 %
>
> Total interspersed repeats:   557505 bp   64.56 %
>
>
> Small RNA:            3          823 bp    0.10 %
>
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
>
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>
>
> The query species was assumed to be homo
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"
>
>
>
> Best
> Quanwei
>
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I see. Thank you. I will try it.
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> Each node is a single machine. Because you currently run without MPI,
>>> each MAKER job you submit runs on a single machine. So you are either
>>> running multiple times on the same node, or you submitted 5 separate batch
>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>
>>> MPI can parallelize on the same node or across nodes. If you request 10
>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>> that case it will split up work within a single node and use all resources
>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>> just submit a job that goes to a single node and ask for all CPUs on that
>>> node (multinode jobs may be hard to configure, but single node jobs are
>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>> node, and it will parallelize within the node.
>>>
>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>
>>> ?Carson
>>>
>>>
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> Would you please explain what do you mean by "a single machine"? I am
>>> running maker2 on our high performance cluster. The cluster has more than
>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>> as the scheduler. Can I use MPICH3?
>>>
>>> Thanks
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> If you are just using a single machine (and not cross machine MPI), use
>>>> MPICH3 ?> https://www.mpich.org
>>>>
>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> It could be either. Please use MPI instead of starting multiple
>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>> but the relatively shorter ones failed.
>>>>>
>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>> scaffolds individually with larger memory to see whether they can be
>>>>> annotated.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> I think the cause of the error may have been a little further
>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>> being used is actually across the network), then they can be lest robust
>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>> the retry capabilities of MAKER.
>>>>>>
>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> About the error in my above email, I found the contig was correctly
>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>> reason? Many thanks
>>>>>>
>>>>>> Here are some parameters I used
>>>>>>
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>> RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>> specific repeat library in fasta format for Repe
>>>>>>
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>> line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> I got the following error again. Is this still related to memory
>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>
>>>>>>> Thank you! Have a nice weekend!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>> Now starting the contig!!
>>>>>>> SeqID: Contig10
>>>>>>> Length: 18773588
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>>
>>>>>>>
>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>> doing repeat masking
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> collecting blastx repeatmasking
>>>>>>> processing all repeats
>>>>>>> doing repeat masking
>>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>>> line 1050.
>>>>>>> --> rank=NA, hostname=n224
>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>
>>>>>>>>
>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>> memory and time?
>>>>>>>>
>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>>
>>>>>>>>
>>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>
>>>>>>>>
>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>
>>>>>>>>
>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>> least 6 times slower than BLASTN
>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>
>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>
>>>>>>>>
>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>> shorter ones).
>>>>>>>>
>>>>>>>>
>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>> MPI.
>>>>>>>>
>>>>>>>>
>>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>
>>>>>>>>
>>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>>
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170913/31f8118a/attachment.html>

From carsonhh at gmail.com  Wed Sep 13 15:21:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 14:21:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
Message-ID: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>

One final thought. If you are using rmblast as part of the RepeatMasker installation, it may be suffering a bug that some blast version suffer from that can sometimes lead to truncation of a blast report  (example of a separate error related to blast report truncation here)?> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ <https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ>

As a result there is a special update to rmblast ?> http://www.repeatmasker.org/RMBlast.html <http://www.repeatmasker.org/RMBlast.html>

So if you are not using the update try it, but if you are using the update and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update may be the cause or the cure or RepeatMasker errors).

?Carson


> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thank you for your explanation.  Sorry for not describing my problem clearly. The first two errors were all gone after I changed the parameters you suggested (e.g., max_dna_len, depeth_blast). Now I only get the following error for two contigs among thousands of contigs. One of the two failed contigs has length 863k, and I have done more tests on this contig individually. By running repeatmask on this contig, 65% was masked when using species specific repeat library, while it is only 35% when using mammalian repeat library. Since longer contigs (even 98Mb) can all be annotated, I doubt why this much shorter one can fail due to IO.
> 
> I did not set "TMP", and I am running on a high performance cluster. I am not sure whether it is a virtual memory or not. I will check it later. Many thanks
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> Best
> Quanwei
> 
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 
> The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.
> 
> The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.
> 
> 
> IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.
> 
> Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.
> 
> 1. Some things to check. Make sure TMP= is not being set to a network mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.
> 
> Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.
> 
> Thanks,
> Carson
> 
> 
> 
> 
> 
> 
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
>> 
>> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
>> 
>> 
>> 
>> file name: test_scaffold31.fasta    
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>> 
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>> 
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>> 
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>> 
>> Unclassified:       196        45831 bp    5.31 %
>> 
>> Total interspersed repeats:   557505 bp   64.56 %
>> 
>> 
>> Small RNA:            3          823 bp    0.10 %
>> 
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>> 
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>                                                       
>> 
>> The query species was assumed to be homo          
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>         
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I see. Thank you. I will try it.
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
>> 
>> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
>> 
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>> 
>> ?Carson
>> 
>> 
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson: 
>>> 
>>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>>> 
>>> Thanks
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>>> 
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>>> 
>>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>>> 
>>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>>> 
>>>>>> Dear Carson:
>>>>>> 
>>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>>> 
>>>>>> Here are some parameters I used
>>>>>> 
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>>> 
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> 
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>> 
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>>> Dear Carson:
>>>>>> 
>>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>>> 
>>>>>> Thank you! Have a nice weekend! 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> #---------------------------------------------------------------------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #---------------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>>> 
>>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>>> 
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>>> 
>>>>>> 
>>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>>> 
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>> 
>>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>>> 
>>>>>> 
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>>> 
>>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>>> 
>>>>>> 
>>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>>> 
>>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>>> 
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170913/5707fd81/attachment.html>

From qwzhang0601 at gmail.com  Wed Sep 13 15:26:11 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 16:26:11 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
	<55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
Message-ID: <CAOW6FSKU9Tn6HN3fZAnXquVU0OrdsxZuHB8GCG76BNQAZ_kdKg@mail.gmail.com>

Dear Carson:

I will take a look at try it. Thank you.

Best
Quanwei

2017-09-13 16:21 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> One final thought. If you are using rmblast as part of the RepeatMasker
> installation, it may be suffering a bug that some blast version suffer from
> that can sometimes lead to truncation of a blast report  (example of a
> separate error related to blast report truncation here)?>
> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ
>
> As a result there is a special update to rmblast ?>
> http://www.repeatmasker.org/RMBlast.html
>
> So if you are not using the update try it, but if you are using the update
> and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update
> may be the cause or the cure or RepeatMasker errors).
>
> ?Carson
>
>
>
> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thank you for your explanation.  Sorry for not describing my problem
> clearly. The first two errors were all gone after I changed the parameters
> you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
> following error for two contigs among thousands of contigs. One of the two
> failed contigs has length 863k, and I have done more tests on this contig
> individually. By running repeatmask on this contig, 65% was masked when
> using species specific repeat library, while it is only 35% when using
> mammalian repeat library. Since longer contigs (even 98Mb) can all be
> annotated, I doubt why this much shorter one can fail due to IO.
>
> I did not set "TMP", and I am running on a high performance cluster. I am
> not sure whether it is a virtual memory or not. I will check it later. Many
> thanks
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
> Best
> Quanwei
>
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> These are the 3 errors you have shown in your e-mails ?>
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>>
>> The first two are memory related with the second being because it cannot
>> kill a lock maintainer thread that it was not able to start because of lack
>> of memory.
>>
>> The third one is IO related. It is a truncated file that succeeded on the
>> second try according to the e-mail you sent.
>>
>>
>> IO errors are quite common with NFS (network mounted file systems). It?s
>> one of the most frequent issues submitted to the devel list. MAKER can hit
>> IO limits long before it hits CPU limits. One of the most frequent casues
>> of these issues is that the user set TMP= in the control files to a manual
>> location that is not suitable for high IO (note TMP= defaults to /tmp). The
>> location should always be a true locally mounted disk. Sometimes this is a
>> virtual location (not really local disk but network mounted disk or an in
>> memory location). With the former you will get frequent IO failures and
>> with the latter you will also get out of memory issues.
>>
>> Note that when you supply more data files you will also use more memory
>> (to hold analysis results). According to your e-mail the last error you got
>> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
>> with two input files but not when you supply a single input file further
>> suggests you are running low on RAM.
>>
>> 1. Some things to check. Make sure TMP= is not being set to a network
>> mounted location.
>> 2. Make sure your temporary directory is not a virtual in memory
>> directory on the node being used.
>> 3. If nodes are shared, you may run out of memory because of other users
>> or because you failed to request enough RAM during job submission.
>>
>> Finally, try running interactively so you can see what the memory and
>> directory locations look like on the node you get assigned for the job
>> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
>> local disk?). Also run with MPI rather than starting multiple MAKER
>> instances. It uses resources better.
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Dear Carson:
>>
>> I did more tests on one of the contigs (with length 863kb) that failed
>> when doing repeat masking. I found it only fail when I added the species
>> specific repeat library, and it can be successfully annotated when only
>> considering mammalian repeat library. When I did the test I only picked the
>> this contig and run maker with 64G memory. So I think the failure should
>> not be the problem with memory or IO, because even the contigs with length
>> 98Mb can be annotated with memory 32G.
>>
>> I also run RepeatMasker on this contig with mammalian and species
>> specific repeat library, separately. I found when I use  mammalian repeat
>> library, about 35% was masked as repeats, while it is 65% when I use
>> species specific repeat library (as shown below in blue). I wonder whether
>> the high level of repeats can lead to the failure of this contig.  Do you
>> have any ideas about this. Thanks
>>
>>
>>
>> file name: test_scaffold31.fasta
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>>
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>>
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>>
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>>
>> Unclassified:       196        45831 bp    5.31 %
>>
>> Total interspersed repeats:   557505 bp   64.56 %
>>
>>
>> Small RNA:            3          823 bp    0.10 %
>>
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>>
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>
>>
>> The query species was assumed to be homo
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in
>> ".../consensi.fa.classifiednoProtFinal"
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I see. Thank you. I will try it.
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> Each node is a single machine. Because you currently run without MPI,
>>>> each MAKER job you submit runs on a single machine. So you are either
>>>> running multiple times on the same node, or you submitted 5 separate batch
>>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>>
>>>> MPI can parallelize on the same node or across nodes. If you request 10
>>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>>> that case it will split up work within a single node and use all resources
>>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>>> just submit a job that goes to a single node and ask for all CPUs on that
>>>> node (multinode jobs may be hard to configure, but single node jobs are
>>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>>> node, and it will parallelize within the node.
>>>>
>>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> Would you please explain what do you mean by "a single machine"? I am
>>>> running maker2 on our high performance cluster. The cluster has more than
>>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>>> as the scheduler. Can I use MPICH3?
>>>>
>>>> Thanks
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> If you are just using a single machine (and not cross machine MPI),
>>>>> use MPICH3 ?> https://www.mpich.org
>>>>>
>>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I met some problems to use MPI. I will give it another try.
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> It could be either. Please use MPI instead of starting multiple
>>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If
>>>>>> it is related to memory issue or an IO issue, I am not sure why the much
>>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>>> but the relatively shorter ones failed.
>>>>>>
>>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>>> scaffolds individually with larger memory to see whether they can be
>>>>>> annotated.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>> I think the cause of the error may have been a little further
>>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>>> being used is actually across the network), then they can be lest robust
>>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>>> the retry capabilities of MAKER.
>>>>>>>
>>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> About the error in my above email, I found the contig was correctly
>>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>>> reason? Many thanks
>>>>>>>
>>>>>>> Here are some parameters I used
>>>>>>>
>>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>>> RepeatMasker
>>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>>> specific repeat library in fasta format for Repe
>>>>>>>
>>>>>>> max_dna_len=300000
>>>>>>> split_hit=40000
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>>> line 188.
>>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>>
>>>>>>>> Dear Carson:
>>>>>>>>
>>>>>>>> I got the following error again. Is this still related to memory
>>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>>
>>>>>>>> Thank you! Have a nice weekend!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>> Now starting the contig!!
>>>>>>>> SeqID: Contig10
>>>>>>>> Length: 18773588
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>>
>>>>>>>>
>>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>>> doing repeat masking
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> collecting blastx repeatmasking
>>>>>>>> processing all repeats
>>>>>>>> doing repeat masking
>>>>>>>> Can't kill a non-numeric process ID at
>>>>>>>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line
>>>>>>>> 1050.
>>>>>>>> --> rank=NA, hostname=n224
>>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> Best
>>>>>>>> Quanwei
>>>>>>>>
>>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>>> memory and time?
>>>>>>>>>
>>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element
>>>>>>>>> masking
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This values really only affects the final evidence kept in the
>>>>>>>>> GFF3 when you look at it in a browser. It has not affect on the annotation.
>>>>>>>>> This is because internally MAKER already collapses evidence down to the 10
>>>>>>>>> best non-redundant features per evidence set per locus. The rest are put in
>>>>>>>>> the GFF3 just for reference. by setting it lower, you are just letting
>>>>>>>>> MAKER know it can through things away even sooner since you don?t want them
>>>>>>>>> in the GFF3. It provides a minor improvement for memory use, but
>>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>>> least 6 times slower than BLASTN
>>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>>
>>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>>> shorter ones).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>>> MPI.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (5) Still about the speed issue. I read some of your comments
>>>>>>>>> about "cpus" parameters in the maker_opts file (
>>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The cpu parameter only affects how many CPUs are given to the
>>>>>>>>> blast command line. So only the BLASt step will speed up, so I recommend
>>>>>>>>> using MPI to get all steps to speed up. Even if you are only running on a
>>>>>>>>> single node, you can give all CPUs to the mpiexec command.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ?Carson
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170913/42eb2d53/attachment.html>

From xvazquezc at gmail.com  Sun Sep 17 20:12:56 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 18 Sep 2017 11:12:56 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
Message-ID: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>

I did it that way and AUGUSTUS is predicting a more reasonable number of
genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train
Augustus over the masked genome instead?
Thanks,


On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:

> BUSCO may be generating too few models. BUSCO also identifies classes of
> conserved short genes that may not represent enough training diversity for
> your organism. Try running MAKER in protein2genome or est2genome mode, and
> then train with those results.
>
> ?Carson
>
>
> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained
> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
> is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content
> close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea
> https://gitlab.com/ezlab/busco/issues/49
> Cheers,
> Xabi
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a8cfffd6/attachment.html>

From qwzhang0601 at gmail.com  Mon Sep 18 22:07:25 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 18 Sep 2017 23:07:25 -0400
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
Message-ID: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>

Hello:

Would you please explain what is the difference between
"maker-...-agustus..." and "augustus_masked..." gene models?

I know  "augustus_masked..." gene models are raw august predictions, while
"maker-...-agustus..." are hit derived gene models. But by default, maker2
reports gene models with evidence support (protein sequences or
transcripts). Then why some gene models are hit derived while other models
(with evidence support) are raw augustus prediction (even there are protein
sequences or transcript evidence)?

BTW, is it true that generally the "maker-...-agustus..." gene models are
more reliable than the "augustus_masked..." gene models?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a273a8fe/attachment.html>

From qwzhang0601 at gmail.com  Mon Sep 18 23:14:38 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 00:14:38 -0400
Subject: [maker-devel] about min_protein
Message-ID: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>

Hello:

I am working on a rodent species and get 28k annotated genes, I wonder
whether you have any suggestions about the "min_protein" parameter?

I did not change the parameter in my current annotation. I get several very
short predicted proteins (even those with only 1 amino acid).

min_protein=0 #require at least this many amino acids in predicted proteins

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170919/3bd06e0a/attachment.html>

From qwzhang0601 at gmail.com  Tue Sep 19 07:47:00 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 08:47:00 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
Message-ID: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>

Thank you Daniel. I wonder whether there is a suggested value for the
?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people
often use. I am studying a rodent species.

Thank you.

Best
Quanwei

2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com>:

> Hi Quanwei,
>
> Increasing the ?min_protein" parameter should get ride of those very short
> predicted proteins.
>
>
>
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder
> whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several
> very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted
> proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170919/f2b950ea/attachment.html>

From dandence at gmail.com  Tue Sep 19 07:29:35 2017
From: dandence at gmail.com (Daniel Ence)
Date: Tue, 19 Sep 2017 08:29:35 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
Message-ID: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>

Hi Quanwei, 

Increasing the ?min_protein" parameter should get ride of those very short predicted proteins. 


> On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter? 
> 
> I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid). 
>  
> min_protein=0 #require at least this many amino acids in predicted proteins
> 
> Thanks
> 
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From tuanduonganh at gmail.com  Tue Sep 19 12:23:39 2017
From: tuanduonganh at gmail.com (Tuan Duong Anh)
Date: Tue, 19 Sep 2017 19:23:39 +0200
Subject: [maker-devel] MAKER3 beta - EVM under predicting
Message-ID: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>

Dear MAKER-devel group

I have been testing out MAKER3 beta version and found out that EVM always
returns much less number of models. Did any one experience this before? I
do expect that EVM will return less models when compare to other, but not
to this extend (only 20% of the expected gene models). Any suggestion would
be much appreciated.

## Number of models obtained by each gene predictors:

HLIG.all.maker.augustus_masked.proteins.fasta:11224

HLIG.all.maker.evm.proteins.fasta:1974

HLIG.all.maker.genemark.proteins.fasta:11352

HLIG.all.maker.proteins.fasta:13672

HLIG.all.maker.snap_masked.proteins.fasta:13404

## maker_evm.ctl

#-----Transcript weights

evmtrans=10 #default weight for source unspecified est/alt_est alignments

evmtrans:blastn=0 #weight for blastn sourced alignments

evmtrans:est2genome=10 #weight for est2genome sourced alignments

evmtrans:tblastx=0 #weight for tblastx sourced alignments

evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments


#-----Protein weights

evmprot=10 #default weight for source unspecified protein alignments

evmprot:blastx=2 #weight for blastx sourced alignments

evmprot:protein2genome=10 #weight for protein2genome sourced alignments


#-----Abinitio Prediction weights

evmab=10 #default weight for source unspecified ab initio predictions

evmab:snap=7 #weight for snap sourced predictions

evmab:augustus=10 #weight for augustus sourced predictions

evmab:fgenesh=10 #weight for fgenesh sourced predictions

evmab:genemark=10 #weight for genemark sourced predictions


Regards,

Tuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170919/4e1fc970/attachment.html>

From carsonhh at gmail.com  Tue Sep 19 16:34:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:34:40 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
Message-ID: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>

Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.

?Carson


> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
> 
> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
> Thanks,
> 
> 
> 
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
> 
> ?Carson
> 
> 
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>> Cheers,
>> Xabi
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170919/768b9648/attachment.html>

From carsonhh at gmail.com  Tue Sep 19 16:40:27 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:40:27 -0600
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
In-Reply-To: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
References: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
Message-ID: <56CC4BEB-083E-4DE6-99F3-CB34A1735AB4@gmail.com>

MAKER uses all derived models as a pool of alternate models for a given locus.  The one that best matches the aligned evidence is then selected using the AED calculation described in the MAKER2 publication. Overall hint based models tend to perform better than the raw models because they get extra info about observed intron/exon structure from alignments. There is also a discussion of this in the MAKER2 paper.

?Carson


> On Sep 18, 2017, at 9:07 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Would you please explain what is the difference between "maker-...-agustus..." and "augustus_masked..." gene models? 
> 
> I know  "augustus_masked..." gene models are raw august predictions, while "maker-...-agustus..." are hit derived gene models. But by default, maker2 reports gene models with evidence support (protein sequences or transcripts). Then why some gene models are hit derived while other models (with evidence support) are raw augustus prediction (even there are protein sequences or transcript evidence)?
> 
> BTW, is it true that generally the "maker-...-agustus..." gene models are more reliable than the "augustus_masked..." gene models?  
> 
> Thanks
> 
> Best
> Quanwei


From carsonhh at gmail.com  Tue Sep 19 16:41:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:41:40 -0600
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
	<CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
Message-ID: <FFA05628-32ED-4036-9FDC-E6C7BC4EAE4C@gmail.com>

The value is arbitrary, but some submission databases like NCBI will flag entries under ~20-30 amino acids as errors if you try and submit them (I can?t remember the exact number).

?Carson


> On Sep 19, 2017, at 6:47 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Thank you Daniel. I wonder whether there is a suggested value for the ?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people often use. I am studying a rodent species. 
> 
> Thank you.
> 
> Best
> Quanwei
> 
> 2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com <mailto:dandence at gmail.com>>:
> Hi Quanwei,
> 
> Increasing the ?min_protein" parameter should get ride of those very short predicted proteins.
> 
> 
> 
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170919/8b03be64/attachment.html>

From carsonhh at gmail.com  Tue Sep 19 16:47:42 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:47:42 -0600
Subject: [maker-devel] MAKER3 beta - EVM under predicting
In-Reply-To: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
References: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
Message-ID: <12FE3318-F0DE-485B-B43A-25A4A6EC9390@gmail.com>

If ab initio predictors and evidence alignments aren?t in high concordance, then EVM won?t produce results. This often indicates minor sequencing errors in the assembly (this is very common in draft assemblies). Ab initio predictors will slightly alter splicing and extend introns/exons to make a model work around these variations, but doing this does not always concord well with the alignment, so EVM produces nothing. In these cases it is often better just to train the predictor as well as you can, and then take the standard MAKER results.

?Carson


> On Sep 19, 2017, at 11:23 AM, Tuan Duong Anh <tuanduonganh at gmail.com> wrote:
> 
> Dear MAKER-devel group
> 
> I have been testing out MAKER3 beta version and found out that EVM always returns much less number of models. Did any one experience this before? I do expect that EVM will return less models when compare to other, but not to this extend (only 20% of the expected gene models). Any suggestion would be much appreciated.
> 
> ## Number of models obtained by each gene predictors:
> HLIG.all.maker.augustus_masked.proteins.fasta:11224
> HLIG.all.maker.evm.proteins.fasta:1974
> HLIG.all.maker.genemark.proteins.fasta:11352
> HLIG.all.maker.proteins.fasta:13672
> HLIG.all.maker.snap_masked.proteins.fasta:13404
> 
> ## maker_evm.ctl
> #-----Transcript weights
> evmtrans=10 #default weight for source unspecified est/alt_est alignments
> evmtrans:blastn=0 #weight for blastn sourced alignments
> evmtrans:est2genome=10 #weight for est2genome sourced alignments
> evmtrans:tblastx=0 #weight for tblastx sourced alignments
> evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments
> 
> #-----Protein weights
> evmprot=10 #default weight for source unspecified protein alignments
> evmprot:blastx=2 #weight for blastx sourced alignments
> evmprot:protein2genome=10 #weight for protein2genome sourced alignments
> 
> #-----Abinitio Prediction weights
> evmab=10 #default weight for source unspecified ab initio predictions
> evmab:snap=7 #weight for snap sourced predictions
> evmab:augustus=10 #weight for augustus sourced predictions
> evmab:fgenesh=10 #weight for fgenesh sourced predictions
> evmab:genemark=10 #weight for genemark sourced predictions
> 
> 
> Regards,
> Tuan
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170919/2c7d2669/attachment.html>

From xvazquezc at gmail.com  Tue Sep 19 19:02:04 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Wed, 20 Sep 2017 10:02:04 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
Message-ID: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>

Thanks Carson.

Last quick question. After the first run (before using the gene predictors)
I ran fasta_merge to get an idea of the numbers I should be looking for.
In summary, I got 14000 genes, only using Swissprot and a close highly
curated reference genome to avoid any "fake" protein or partial proteins
from draft annotations, plus assembled RNA-seq from my genome.
How should I consider this as a guide? (if I can do so) ... Is this a
number I should be aiming as a minimum number of genes? maximum? something
around that?

PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few
possible fragments due assembly (seq errors aside)

On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com> wrote:

> Gene predictors tend to over predict, so I would not take the high numbers
> given by SNAP and GeneMark as true counts. You will probably end up with
> something like 7-10k in the final results. But now Augustus is giving a
> higher count, you should be good to start running MAKER.
>
> ?Carson
>
>
>
>
> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> I did it that way and AUGUSTUS is predicting a more reasonable number of
> genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
>
> I haven't found any reference about but, would it be a good idea to train
> Augustus over the masked genome instead?
> Thanks,
>
>
>
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:
>
>> BUSCO may be generating too few models. BUSCO also identifies classes of
>> conserved short genes that may not represent enough training diversity for
>> your organism. Try running MAKER in protein2genome or est2genome mode, and
>> then train with those results.
>>
>> ?Carson
>>
>>
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
>> wrote:
>>
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained
>> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
>> is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content
>> close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea
>> https://gitlab.com/ezlab/busco/issues/49
>> Cheers,
>> Xabi
>>
>> --
>> Xabier V?zquez-Campos, *PhD*
>> *Research Associate*
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170920/ca7c08db/attachment.html>

From himanimalhotra89 at gmail.com  Tue Sep 19 23:56:55 2017
From: himanimalhotra89 at gmail.com (himani malhotra)
Date: Wed, 20 Sep 2017 10:26:55 +0530
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
Message-ID: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>

---------- Forwarded message ----------
From: himani malhotra <himanimalhotra89 at gmail.com>
Date: Wed, Sep 20, 2017 at 10:24 AM
Subject: maker error
To: maker-devel-request at box290.bluehost.com


hello
I am using MAKER for gene prediction.I am getting error in Repbase
installation.I am sending you the error also,please help me.I have
installed repbase manually and unpacked its libraries in RepeatMasker
Library but still I am getting error.
Please help me.


Thanks

Himani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: makererror.png
Type: image/png
Size: 212522 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment.png>

From munholl at uwindsor.ca  Wed Sep 20 09:53:04 2017
From: munholl at uwindsor.ca (Seth Munholland)
Date: Wed, 20 Sep 2017 10:53:04 -0400
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <CAL=sJwrjccQC0GdDa3Km1TojWMdN1aYoujntVsjdMjJ9ha2YUw@mail.gmail.com>

Hello,

When this happened to me it was a faulty pathing on my part when I
configured RepeatMasker (which I also manually installed).

Seth Munholland, B.Sc., Ph.D. Candidate
Department of Biological Sciences
Rm. 304 Biology Building
University of Windsor
401 Sunset Ave. N9B 3P4
T: (519) 253-3000 Ext: 4755

On Wed, Sep 20, 2017 at 12:56 AM, himani malhotra <
himanimalhotra89 at gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com
>
>
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase
> installation.I am sending you the error also,please help me.I have
> installed repbase manually and unpacked its libraries in RepeatMasker
> Library but still I am getting error.
> Please help me.
>
>
>
> Thanks
>
> Himani
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170920/c89e50fe/attachment.html>

From Jimmy.Cross at uea.ac.uk  Wed Sep 20 09:02:53 2017
From: Jimmy.Cross at uea.ac.uk (James Cross (ITCS - Staff))
Date: Wed, 20 Sep 2017 14:02:53 +0000
Subject: [maker-devel] Maker MPI across nodes
Message-ID: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>

Hi Maker Developers,

We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core's so 56 Core's in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core's) as opposed to being run on a single node (28 Core's). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?

Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.

The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp).

The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker

Any help or advise you could give would be greatly appreciated.

Best Wishes
Jimmy
----------------------------------------------------------------------
Mr  James Cross
HPC Systems Developer
University of East Anglia
Norwich Research Park
ITCS
Norwich, Norfolk
NR4 7TJ

Information Services

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170920/e1e9d5cb/attachment.html>

From patrick.tranvan at unil.ch  Thu Sep 21 04:26:52 2017
From: patrick.tranvan at unil.ch (Patrick Tran Van)
Date: Thu, 21 Sep 2017 09:26:52 +0000
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>,
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
Message-ID: <1505986013492.52354@unil.ch>

Hi Carson,


I have a doubt for the round 2, so in a previous reply you said:


" Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "


Does it means that I don't need to modify the section :


#-----Re-annotation Using MAKER Derived GFF3


?


If I let everything by default such as :


altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no


It will not look again for repeat and protein + transcriptome alignment ?

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Monday, July 3, 2017 10:50 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Advice on my pipeline

maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).

So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.

The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).

You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/

Thanks,
Carson


On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.

I have then use SNAP to train/filter it with:

maker2zff  specie.all.gff

Here are my results:

Number of gene after maker -> Number of gene after maker2zff

- Without corrected_est_fusion: 21621 -> 13875
- With corrected_est_fusion: 16850 -> 9098

1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
Normally I should find more genes with corrected_est_fusion right ?

2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?

 Thanks for your help


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Sorry the option is ?> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

?Carson


On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).

?Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Hello,

This is my first time running Maker for an insect genome annotation.

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1


Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170921/c54c44f5/attachment.html>

From carsonhh at gmail.com  Fri Sep 22 12:57:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 11:57:56 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
	<CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
Message-ID: <06E8D6C3-B278-4820-B309-5CF61186FDCB@gmail.com>

I don?t think you can use the protein2genome option to estimate gene count. It will turn any alignment that matches at east 50% into a gene model. So you can get a lot of partial models which will inflate gene count. It?s good enough for training but not so much annotation.

?Carson


> On Sep 19, 2017, at 6:02 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Thanks Carson.
> 
> Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for.
> In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome. 
> How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that?
> 
> PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside)
> 
> On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.
> 
> ?Carson
> 
> 
> 
> 
>> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
>> In comparison, SNAP gives 16000 and GeneMark 19000.
>> 
>> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
>> Thanks,
>> 
>> 
>> 
>> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> 
>>> Hi,
>>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>>> Has anybody come up with any similar issue?
>>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>>> Cheers,
>>> Xabi
>>> 
>>> -- 
>>> Xabier V?zquez-Campos, PhD
>>> Research Associate
>>> NSW Systems Biology Initiative
>>> School of Biotechnology and Biomolecular Sciences
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170922/edabec82/attachment.html>

From carsonhh at gmail.com  Fri Sep 22 14:47:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:47:36 -0600
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <5196E0C2-9FDC-4B6A-9D14-CA8514E002EF@gmail.com>

You have a couple of errors at the start indicating that you may have an issue with the perl forks module as well as RepeatMasker installations. I?d recommend redoing both installations. Also the screen shot you show is not the failure, it is MAKER giving up after failing 2 times. To capture the actual failure set the try count to 3, then rerun and see what comes up in STDERR. Redirect STDERR to a file using ?&>?
.
Example:
maker &> err.log

Thanks,
Carson


On Sep 19, 2017, at 10:56 PM, himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>> wrote:

> 
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com <mailto:maker-devel-request at box290.bluehost.com>
> 
> 
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase installation.I am sending you the error also,please help me.I have installed repbase manually and unpacked its libraries in RepeatMasker Library but still I am getting error.
> Please help me.
> 
> 
> 
> Thanks 
> 
> Himani
> 
> <makererror.png>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170922/fc4e340a/attachment.html>

From carsonhh at gmail.com  Fri Sep 22 14:59:17 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:59:17 -0600
Subject: [maker-devel] Maker MPI across nodes
In-Reply-To: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
References: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
Message-ID: <BD2A6E4D-280B-4B38-AA1C-05C03503848C@gmail.com>

The "-mca btl ^openib? flag has the side affect of bypassing infiniband and using ethernet. But if alternate communicators are too slow, you can switch back to indirect infiniband by using '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?. That option will force IP over infiniband whichb instead of direct infiniband. OpenFabrics libraries used by infiniband has a know issue because of how it uses registered memory (it generates seg faults whenever a program does system calls - i.e. MAKER calling BLAST). So you can?t use direct infinband with MAKER. So try this instead ?>  '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?

Also if it stays slow, it likely means you are hitting IO limits. If that is the case, make sure you are note setting TMP= to a network mounted disk location, and that whatever temp space exists on your cluster it needs to be per node real local mounted disk and not network mounted disk.

?Carson


> On Sep 20, 2017, at 8:02 AM, James Cross (ITCS - Staff) <Jimmy.Cross at uea.ac.uk> wrote:
> 
> Hi Maker Developers,
>  
> We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core?s so 56 Core?s in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core?s) as opposed to being run on a single node (28 Core?s). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?
>  
> Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.
>  
> The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp). 
>  
> The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker
>  
> Any help or advise you could give would be greatly appreciated. 
>  
> Best Wishes
> Jimmy
> ----------------------------------------------------------------------
> Mr  James Cross
> HPC Systems Developer
> University of East Anglia
> Norwich Research Park
> ITCS
> Norwich, Norfolk
> NR4 7TJ
>  
> Information Services
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170922/7fdc5720/attachment.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 15:04:10 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:04:10 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>

MAKER won?t produce est2genome results for est_gff. This is partially because est2genome results are only used for training gene predictors. So you are essentially just getting protein2genome results from your runs. Once you get a gene predictor trained you will see a difference, as it will use the intron/exon structure of alignments as hints to improve gene predictor performance.

?Carson


> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> I have tried the proposed options for a small example (yeast).
> 
> I had 
> - proteins (fasta) from another yeast and 
> - transcript annotation (gff) from cufflinks and StringTie
> 
> I'd like to compare the maker results for 
> - proteins and StringTie
> Vs.
> - proteins and cufflinks
> 
> I used the default options, except:
> genome=<genome fasta>
> 
> protein=<protein fasta>
> est_gff=<transcript gff>
> 
> est2genome=1
> protein2genome=1
> 
> (An example is attached.)
> 
> Then I ran maker:
> 
> maker -RM_off -c 24
> find . -type f -name *.gff -exec cat {} + | grep maker > filtered-maker-prediction.gff
> 
> (The run seems to be okay. There were no FAILED, ... in the log. Cf. attachment)
> 
> Each maker run was started in a separate subdirectory.
> However, I realized that both maker runs yielded almost the same result (just one minor edit). This made me curious. 
> As far as I understood the files, I received the (filtered?) exonerate predictions for the proteins (from the other yeast). Is this correct? Why did I not receive any predictions (purely) based on the RNA-seq data? Did I something wrong?
> 
> I'm looking forward to your reply.
> 
> Best regards, Jens
> 
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Dienstag, 19. September 2017 23:37
>> An: Keilwagen, Jens
>> Betreff: Re: MAKER
>> 
>> MAKER cannot use the BAM directly, but you can use something like
>> stringtie or trinity to assemble a transcript fasta that can be given
>> to the est= option.
>> 
>> Ab initio gene prediction is only enabled if you specify an hmm or
>> species file to use.  If all you want is homology based annotation, you
>> can try the est2genome and protein2genome options. Note the final
>> models may be partial if the alignments do not cover the gene end to
>> end.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> thanks a lot for your last email that .
>>> 
>>> I was asked to do homology-based gene prediction using RNA-seq and
>> Maker was proposed as one option.
>>> Hence I'd like to ask how to do that in the best possible way.
>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>> related species. How can I integrate the RNA-seq data?
>>> 
>>> Is it possible to deactivate ab-initio gene prediction by Augustus or
>> SNAP?
>>> 
>>> Thanks a lot in advance.
>>> 
>>> Bets regards, Jens
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>> An: Keilwagen, Jens
>>>> Cc: Mark Yandell
>>>> Betreff: Re: MAKER
>>>> 
>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>> could give the GFF3 results to the pred_gff= option in MAKER (comma
>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>> the same coordinate space as the assembly being annotated (genome=
>> option).
>>>> Whatever you give to pred_gff will be treated as a raw predictions
>> by
>>>> MAKER and will only be accepted as a final model if there are
>>>> evidence alignments (protein/EST) that support the model, and if
>>>> there are multiple alternate models at the same locus, only the
>> model
>>>> that is best supported by the protein/transcript evidence is kept.
>>>> 
>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>> will cause even raw predictions with no evidence support to be
>> maintained.
>>>> In the event of multiple models with no evidence support, the model
>>>> best matching the consensus of alternate models will be maintained.
>>>> 
>>>> Alternatively you can use the model_gff= options (comma separated
>>>> list
>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>> confidence than pred_gff. At least one model will always be kept
>>>> regardless of evidence support (same rules as pred_gff selection for
>>>> which model to keep when there are multiple). But model_gff will
>> also
>>>> affect how evidence clusters are determined compared to pred_gff
>>>> (model_gff features are allowed to merge bridging evidence
>> clusters).
>>>> MAKER will also go to extra lengths to pull forward existing names
>>>> and other data in the GFF3 for model_gff features.
>>>> 
>>>> If you do not have GFF3 files in the right coordinate space, but do
>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>> you can supply these to the protein= and transcript= options in
>> MAKER
>>>> together with est2genome=1 or protein2genome=1. This will cause
>> MAKER
>>>> to place the models using exonerate. You would probably also need to
>>>> add est_forward=1 to the control files to have MAKER try and derive
>>>> model names from the name of evidence alignments they were derived
>>>> from if you go this route.
>>>> 
>>>> You can also try treating the GFF3 predictions as hints to
>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>> predictions inform the behavior of predictors like SNAP and
>>>> Augustus). Might be interesting. You would have to alter results to
>>>> be match/match_part
>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>> 
>>>> Let me know if you have any more questions, and I?ll do my best to
>>>> help.
>>>> 
>>>> Thanks,
>>>> Carson
>>>> 
>>>> 
>>>> 
>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>> <myandell at genetics.utah.edu> wrote:
>>>>> 
>>>>> 
>>>>> Mark Yandell
>>>>> Professor of Human Genetics
>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>> University of Utah
>>>>> 15 North 2030 East, Room 2100
>>>>> Salt Lake City, UT 84112-5330
>>>>> ph:801-587-7707
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens" <jens.keilwagen at jki.bund.de>
>>>> wrote:
>>>>> 
>>>>>> Dear Prof. Yandell,
>>>>>> 
>>>>>> we have published a homology-based gene prediction program today:
>>>>>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw092
>>>>>> and I'd like to ask how we can use MAKER to combine predictions of
>>>>>> GeMoMa using different reference organisms, i.e. we try to predict
>>>>>> the genes of an target organism (e.g. wheat) using the annotated
>>>>>> genes of other reference organisms (e.g. grasses). GeMoMa returns
>>>> for
>>>>>> each reference organism a GFF with the predicted gene models in
>> the
>>>> target organism.
>>>>>> 
>>>>>> It would be great if you or someone from your team could give us
>>>> some
>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>> 
>>>>>> Thanks a lot and best regards, Jens
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> Dr. Jens Keilwagen
>>>>>> 
>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>> Cultivated
>>>>>> Plants
>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>> 
>>>>>> Erwin-Baur-Stra?e 27
>>>>>> 06484 Quedlinburg
>>>>>> Germany
>>>>>> 
>>>>>> Phone: ++49 (0)3946 47 510
>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>> 
>>>>>> 
>>>>> 
>>> 
> 
> <maker_opts.ctl><slurm-278767.out>


From eennadi at gmail.com  Fri Sep 22 14:27:37 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Fri, 22 Sep 2017 20:27:37 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
Message-ID: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:

> Ok, thanks.
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
>
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>
>> It would need to be a new run. You won't be able to use the updated
>> contig names with the old run.
>>
>> --Carson
>>
>> Sent from my iPhone
>>
>> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>> genome.fasta
>>
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_
>> trimmed_\(paired\)_,
>>
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,
>>
>> 1. How can I effect the change when maker has produced some files from
>> the the old sequence?
>>
>> I have spent more than 24 hours running maker and it has produced some
>> folders already.
>>
>> How can I make this change?
>>
>> Thanks
>>
>>
>>
>>
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/
>> profile/Emmanuel_Nnadi/publications
>>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>
>>> BLAST which is used by MAKER can not handle really long contig names.
>>> MAKER tries to get around this by adding a secondary tag to the fasta
>>> header when long names are detected. Even then it would be better to change
>>> the IDs of your contigs to avoid downstream failures.
>>>
>>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_?
>>> from each contig name.
>>>
>>> Example command to do that ?>
>>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>>> genome.fasta
>>>
>>> ?Carson
>>>
>>>
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>
>>> Hi Carson
>>> Thanks for your response its been helpful
>>>
>>> Please bear with me as I work through this
>>>
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it
>>> wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> examining contents of the fasta file and run log
>>>
>>>
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/
>>> profile/Emmanuel_Nnadi/publications
>>>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>
>>>> You can query valid species names using the queryTaxonomyDatabase.pl
>>>> script that comes with RepeatMasker. Try not to be too specific. In general
>>>> you should use the genus rather than the species for example (or even use
>>>> all of RepBase).
>>>>
>>>> Example ?>
>>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>
>>>> Hi Carson,
>>>>
>>>>  Thanks
>>>> I was able to start using maker.
>>>>
>>>> However I am working with a plant Genome novel. I had set the
>>>> repeatmasking to
>>>> 1. Dcotrep a names from the repbase release but maker returned it back
>>>> as not known to repeat masker
>>>>
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>>
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/
>>>> profile/Emmanuel_Nnadi/publications
>>>>
>>>>
>>>>
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>>>>
>>>>> MAKER will read the genome= options from the maker_opts.ctl file in
>>>>> your current directory or the maker_opts.ctl you specified on the command
>>>>> line. The error means you have left the value empty. Perhaps you did not
>>>>> save the changes you made or you did not specify the location of
>>>>> the maker_opts.ctl file to use.
>>>>>
>>>>> You can check the contents of the file using cat. Example ?>
>>>>> cat maker_opts.ctl
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>>
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running
>>>>> maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>>
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>>
>>>>> It gave the following *1_S7_assembly.fa I specified the name of the
>>>>> genome but when I ran maker in another tab it gave *
>>>>>
>>>>> #-----Genome (these are always required)
>>>>> genome=*1_S7_assembly.fa* #genome sequence (fasta file or fasta
>>>>> embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>> eukaryotic
>>>>>
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 =
>>>>> no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>>
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate
>>>>> organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>>
>>>>> #-----Protein Homology Evidence (for best results provide a file for
>>>>> at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple
>>>>> oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3
>>>>> file
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format
>>>>> for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta
>>>>> #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change
>>>>> this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e.
>>>>> seg and dust filtering)
>>>>>
>>>>>
>>>>> *I ran maker command on another tab and it returned the following*
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control
>>>>> files.
>>>>>
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>>
>>>>>
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same
>>>>> tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate
>>>>> protein fast for the de novo sequence and EST?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/
>>>>> profile/Emmanuel_Nnadi/publications
>>>>>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Here is a class on how to use MAKER taught a couple of years back ?>
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/M
>>>>>> AKER_Tutorial_for_GMOD_Online_Training_2014
>>>>>>
>>>>>> There is also a linked video as well as an amazon image of the class
>>>>>> material where you can run the image in the cloud and follow along.
>>>>>>
>>>>>> Thanks,
>>>>>> Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Carson,
>>>>>> Thanks a lot
>>>>>>
>>>>>> I ran this command maker -h it returned the following
>>>>>>
>>>>>> The last thing I wish to ask you, how can I load my genome fine and
>>>>>> being annotation?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>>
>>>>>> MAKER version 2.31.9
>>>>>>
>>>>>> Usage:
>>>>>>
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>>
>>>>>>
>>>>>> Description:
>>>>>>
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format
>>>>>> using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can
>>>>>> be used to
>>>>>>      produce gene annotations for new genomes as well as update
>>>>>> annotations
>>>>>>      from existing genome databases.
>>>>>>
>>>>>>      The three input arguments are control files that specify how
>>>>>> MAKER should
>>>>>>      behave. All options for MAKER should be set in the control
>>>>>> files, but a
>>>>>>      few can also be set on the command line. Command line options
>>>>>> provide a
>>>>>>      convenient machanism to override commonly altered control file
>>>>>> values.
>>>>>>      MAKER will automatically search for the control files in the
>>>>>> current
>>>>>>      working directory if they are not specified on the command line.
>>>>>>
>>>>>>      Input files listed in the control options files must be in fasta
>>>>>> format
>>>>>>      unless otherwise specified. Please see MAKER documentation to
>>>>>> learn more
>>>>>>      about control file  configuration.  MAKER will automatically try
>>>>>> and
>>>>>>      locate the user control files in the current working directory
>>>>>> if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>>
>>>>>>      It is important to note that MAKER does not try and recalculated
>>>>>> data that
>>>>>>      it has already calculated.  For example, if you run an analysis
>>>>>> twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any
>>>>>> of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from
>>>>>> the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f
>>>>>> flag.
>>>>>>
>>>>>>      MAKER also supports parallelization via MPI on computer
>>>>>> clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support
>>>>>> must be
>>>>>>      configured during the MAKER installation process for this to
>>>>>> work though
>>>>>>
>>>>>>
>>>>>> Options:
>>>>>>
>>>>>>      -genome|g <file>    Overrides the genome file path in the
>>>>>> control files
>>>>>>
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>>
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep
>>>>>> directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>>
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and
>>>>>> lower)
>>>>>>
>>>>>>      -base    <string>   Set the base name MAKER uses to save output
>>>>>> files.
>>>>>>                          MAKER uses the input genome file name by
>>>>>> default.
>>>>>>
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of
>>>>>> tries.
>>>>>>
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST
>>>>>> analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>>
>>>>>>      -force|f            Forces MAKER to delete old files before
>>>>>> running again.
>>>>>> This will require all blast analyses to be rerun.
>>>>>>
>>>>>>      -again|a            recaculate all annotations and output files
>>>>>> even if no
>>>>>> settings have changed. Does not delete old analyses.
>>>>>>
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status
>>>>>> messages.
>>>>>>
>>>>>>      -qq                 Even more quiet. There are no status
>>>>>> messages.
>>>>>>
>>>>>>      -dsindex            Quickly generate datastore index file. Note
>>>>>> that this
>>>>>>                          will not check if run settings have changed
>>>>>> on contigs
>>>>>>
>>>>>>      -nolock             Turn off file locks. May be usful on some
>>>>>> file systems,
>>>>>>                          but can cause race conditions if running in
>>>>>> parallel.
>>>>>>
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>>
>>>>>>      -CTL                Generate empty control files in the current
>>>>>> directory.
>>>>>>
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>>
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>>
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>>
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for
>>>>>> web-based GUI
>>>>>>
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>>
>>>>>>      -version            Prints the MAKER version.
>>>>>>
>>>>>>      -help|?             Prints this usage statement.
>>>>>>
>>>>>>
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/
>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Path needs to be a list of directories to search (you specified an
>>>>>>> executable location).
>>>>>>>
>>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop
>>>>>>> /Gpm/maker/bin
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> I tried to export PATH
>>>>>>>
>>>>>>> running
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>>
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaeme
>>>>>>> ka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>>
>>>>>>>
>>>>>>> secondly,
>>>>>>>
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>>
>>>>>>> nothing returned.
>>>>>>>
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> After install the executables will be in the ?/maker/bin directory.
>>>>>>>> Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>>>
>>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be
>>>>>>>> found just by typing ?maker'
>>>>>>>>
>>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_e
>>>>>>>> nv_var.html
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu> wrote:
>>>>>>>>
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what
>>>>>>>> is the result of ?which maker??
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Daniel
>>>>>>>> The reply is
>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>> -bash: MAKER: command not found
>>>>>>>>
>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>> Department of Microbiology,
>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>
>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you
>>>>>>>>> used to try to run MAKER? Can you show the result of running ?MAKER -ctl??
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ence,
>>>>>>>>> Thanks for your reply,
>>>>>>>>>
>>>>>>>>> This is the step and error received
>>>>>>>>>
>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>> Installing MAKER...
>>>>>>>>> Building MAKER
>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>
>>>>>>>>> The build status is
>>>>>>>>> =============================================================================
>>>>>>>>> STATUS MAKER v2.31.9==============================================================================
>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to
>>>>>>>>>> the mailing list the command and output (including errors) of the step that
>>>>>>>>>> didn?t work.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in
>>>>>>>>>> installing all prerequisites however running maker ./build install, it
>>>>>>>>>> showed that maker installed.
>>>>>>>>>>
>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>
>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>>>> ell-lab.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>> ell-lab.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170922/5d36dba0/attachment.html>

From carsonhh at gmail.com  Fri Sep 22 15:06:06 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:06:06 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
Message-ID: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

> On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello all,
> Please how can I determine the following in maker:
> 1. The total number of chromosomes
> 2. The size of my genome
> 
> 
> Thanks
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> Ok, thanks. 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> 
>    
> 
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> It would need to be a new run. You won't be able to use the updated contig names with the old run. 
> 
> --Carson
> 
> Sent from my iPhone
> 
> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> 
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> 1. How can I effect the change when maker has produced some files from the the old sequence?
>> 
>> I have spent more than 24 hours running maker and it has produced some folders already.
>> 
>> How can I make this change?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.
>> 
>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.
>> 
>> Example command to do that ?> 
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> ?Carson
>> 
>> 
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>> 
>>> Hi Carson
>>> Thanks for your response its been helpful
>>> 
>>> Please bear with me as I work through this
>>> 
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> examining contents of the fasta file and run log
>>> 
>>> 
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>> You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).
>>> 
>>> Example ?>
>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>> 
>>>> Hi Carson,
>>>> 
>>>>  Thanks
>>>> I was able to start using maker.
>>>> 
>>>> However I am working with a plant Genome novel. I had set the repeatmasking to 
>>>> 1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker
>>>> 
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>> 
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>> 
>>>>    
>>>> 
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>> MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.
>>>> 
>>>> You can check the contents of the file using cat. Example ?> cat maker_opts.ctl
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>> 
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>> 
>>>>> It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave 
>>>>> 
>>>>> #-----Genome (these are always required)
>>>>> genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
>>>>> 
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>> 
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>> 
>>>>> #-----Protein Homology Evidence (for best results provide a file for at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3 file
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
>>>>> 
>>>>> 
>>>>> I ran maker command on another tab and it returned the following
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control files.
>>>>> 
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>> 
>>>>> 
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>> Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014>
>>>>> 
>>>>> There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.
>>>>> 
>>>>> Thanks,
>>>>> Carson
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Carson,
>>>>>> Thanks a lot 
>>>>>> 
>>>>>> I ran this command maker -h it returned the following
>>>>>> 
>>>>>> The last thing I wish to ask you, how can I load my genome fine and being annotation?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>> 
>>>>>> MAKER version 2.31.9
>>>>>> 
>>>>>> Usage:
>>>>>> 
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>> 
>>>>>> 
>>>>>> Description:
>>>>>> 
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can be used to
>>>>>>      produce gene annotations for new genomes as well as update annotations
>>>>>>      from existing genome databases.
>>>>>> 
>>>>>>      The three input arguments are control files that specify how MAKER should
>>>>>>      behave. All options for MAKER should be set in the control files, but a
>>>>>>      few can also be set on the command line. Command line options provide a
>>>>>>      convenient machanism to override commonly altered control file values.
>>>>>>      MAKER will automatically search for the control files in the current
>>>>>>      working directory if they are not specified on the command line.
>>>>>> 
>>>>>>      Input files listed in the control options files must be in fasta format
>>>>>>      unless otherwise specified. Please see MAKER documentation to learn more
>>>>>>      about control file  configuration.  MAKER will automatically try and
>>>>>>      locate the user control files in the current working directory if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>> 
>>>>>>      It is important to note that MAKER does not try and recalculated data that
>>>>>>      it has already calculated.  For example, if you run an analysis twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f flag.
>>>>>> 
>>>>>>      MAKER also supports parallelization via MPI on computer clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
>>>>>>      configured during the MAKER installation process for this to work though
>>>>>>      
>>>>>> 
>>>>>> Options:
>>>>>> 
>>>>>>      -genome|g <file>    Overrides the genome file path in the control files
>>>>>> 
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>> 
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>> 
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and lower)
>>>>>> 
>>>>>>      -base    <string>   Set the base name MAKER uses to save output files.
>>>>>>                          MAKER uses the input genome file name by default.
>>>>>> 
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of tries.
>>>>>> 
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>> 
>>>>>>      -force|f            Forces MAKER to delete old files before running again.
>>>>>> 			 This will require all blast analyses to be rerun.
>>>>>> 
>>>>>>      -again|a            recaculate all annotations and output files even if no
>>>>>> 			 settings have changed. Does not delete old analyses.
>>>>>> 
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status messages.
>>>>>> 
>>>>>>      -qq                 Even more quiet. There are no status messages.
>>>>>> 
>>>>>>      -dsindex            Quickly generate datastore index file. Note that this
>>>>>>                          will not check if run settings have changed on contigs
>>>>>> 
>>>>>>      -nolock             Turn off file locks. May be usful on some file systems,
>>>>>>                          but can cause race conditions if running in parallel.
>>>>>> 
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>> 
>>>>>>      -CTL                Generate empty control files in the current directory.
>>>>>> 
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>> 
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>> 
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>> 
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for web-based GUI
>>>>>> 
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>> 
>>>>>>      -version            Prints the MAKER version.
>>>>>> 
>>>>>>      -help|?             Prints this usage statement.
>>>>>> 
>>>>>> 
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>> Path needs to be a list of directories to search (you specified an executable location).
>>>>>> 
>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>> 
>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Thanks 
>>>>>>> 
>>>>>>> I tried to export PATH
>>>>>>> 
>>>>>>> running 
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>> 
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>> 
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>> 
>>>>>>> 
>>>>>>> secondly,
>>>>>>> 
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>> 
>>>>>>> nothing returned.
>>>>>>> 
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>>> After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>> 
>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'
>>>>>>> 
>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html <http://www.linfo.org/path_env_var.html>
>>>>>>> 
>>>>>>> ?Carson
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>> 
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker?? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Daniel
>>>>>>>>> The reply is 
>>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>>> -bash: MAKER: command not found
>>>>>>>>> 
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl?? 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Ence,
>>>>>>>>>> Thanks for your reply,
>>>>>>>>>> 
>>>>>>>>>> This is the step and error received
>>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>>> Installing MAKER...
>>>>>>>>>> Building MAKER
>>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>> 
>>>>>>>>>> The build status is
>>>>>>>>>> 
>>>>>>>>>> =============================================================================
>>>>>>>>>> STATUS MAKER v2.31.9
>>>>>>>>>> ==============================================================================
>>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>> 
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hello all,
>>>>>>>>>>> 
>>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>>>>>>>>>>> 
>>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>> 
>>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>>> Department of Microbiology,
>>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>>> 
>>>>>>>>>>>    
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170922/64e7446c/attachment.html>

From carsonhh at gmail.com  Fri Sep 22 15:08:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:08:36 -0600
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <1505986013492.52354@unil.ch>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
	<1505986013492.52354@unil.ch>
Message-ID: <651D4267-0FD7-4A92-B778-8976B47353BB@gmail.com>

The gff3 passthrough options are there to help users get old data into MAKER when they have lost access to the original files. But for iterative running of the pipeline, it is more effective just to rerun in place so MAKER can access the raw alignment reports. The raw reports from the alignments have more detail than what is stored in the GFF3. Details that are lost when trying to use the GFF3 as input.

?Carson


> On Sep 21, 2017, at 3:26 AM, Patrick Tran Van <Patrick.TranVan at unil.ch> wrote:
> 
> Hi Carson,
> 
> I have a doubt for the round 2, so in a previous reply you said:
> 
> " Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "
>  
> Does it means that I don't need to modify the section :
> 
> #-----Re-annotation Using MAKER Derived GFF3
> 
> ?
> 
> If I let everything by default such as :
> 
> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no 
> 
> 
> It will not look again for repeat and protein + transcriptome alignment ?
> 
> Patrick Tran Van
> 
> Groups Chapuisat, Robinson-Rechavi & Schwander
> Department of Ecology and Evolution
> University of Lausanne
> Le Biophore
> CH-1015 Lausanne
> Switzerland
> Office 3206
> 
> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
> Sent: Monday, July 3, 2017 10:50 PM
> To: Patrick Tran Van
> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] Advice on my pipeline
>  
> maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).
> 
> So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.
> 
> The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).
> 
> You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/>
> 
> Thanks,
> Carson
> 
> 
> 
> 
>> On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>> 
>> So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.
>> 
>> I have then use SNAP to train/filter it with:
>> 
>> maker2zff  specie.all.gff
>> 
>> Here are my results:
>> 
>> Number of gene after maker -> Number of gene after maker2zff
>> 
>> - Without corrected_est_fusion: 21621 -> 13875
>> - With corrected_est_fusion: 16850 -> 9098
>> 
>> 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
>> Normally I should find more genes with corrected_est_fusion right ?
>> 
>> 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?
>> 
>>  Thanks for your help 
>> 
>> 
>> 
>> Patrick Tran Van
>> 
>> Groups Chapuisat, Robinson-Rechavi & Schwander
>> Department of Ecology and Evolution
>> University of Lausanne
>> Le Biophore
>> CH-1015 Lausanne
>> Switzerland
>> Office 3206
>> 
>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>> Sent: Monday, June 26, 2017 11:38 PM
>> To: Patrick Tran Van
>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>> Subject: Re: [maker-devel] Advice on my pipeline
>>  
>> Sorry the option is ?> correct_est_fusion
>> 
>> It is in the maker_opts.ctl file.
>> 
>> I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>> 
>>> Thanks for your answer.
>>> 
>>> 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
>>> Because I am using autoAug for this and it tooks a while to compute ..
>>> 
>>> 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:
>>> 
>>> WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl
>>> 
>>> (I am using v 2.31.8 )
>>> 
>>> 
>>> Patrick Tran Van
>>> 
>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>> Department of Ecology and Evolution
>>> University of Lausanne
>>> Le Biophore
>>> CH-1015 Lausanne
>>> Switzerland
>>> Office 3206
>>> 
>>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>>> Sent: Monday, June 5, 2017 8:29 PM
>>> To: Patrick Tran Van
>>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>>> Subject: Re: [maker-devel] Advice on my pipeline
>>>  
>>> Your plan sounds good. A couple of related notes.
>>> 
>>> Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.
>>> 
>>> Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).
>>> 
>>> ?Carson
>>> 
>>> 
>>>> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> This is my first time running Maker for an insect genome annotation. 
>>>> 
>>>> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:
>>>> 
>>>> 
>>>> What I have:
>>>> - RNA evidence: transcriptome
>>>> - Proteine evidence: swissprot/uniprot + busco protein set of insect
>>>> - Cegma and busco results of my genome
>>>> 
>>>> 
>>>> 1) Train SNAP with CEGMA
>>>> 
>>>> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).
>>>> 
>>>> 3) Create SNAP model from run A.
>>>> 
>>>> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 5) Create SNAP model from run B.
>>>> 
>>>> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 7)  Create SNAP model from run C AND Create Augustus gene model from run C
>>>> 
>>>> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1
>>>> 
>>>> 
>>>> 
>>>> Does it seems coherent ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Patrick Tran Van
>>>> 
>>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>>> Department of Ecology and Evolution
>>>> University of Lausanne
>>>> Le Biophore
>>>> CH-1015 Lausanne
>>>> Switzerland
>>>> Office 3206
>>>> 
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170922/3b6b64af/attachment.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 15:19:22 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:19:22 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
	<1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <ADB216BF-2828-4906-A32F-58CC3989102F@genetics.utah.edu>

All est2genome and protein2genome do is take exonerate alignments of the fasta inputs and translate the longest ORF to get a rough base model that can be used to train a gene predictor. That is why we have it in the documentation that once the predictor is trained they should be turned off.

Once you get the gene predictor trained, MAKER will feed hints to the gene predictor derived from alignments and input GFF3. These hints greatly improve the performance of the gene predictors. MAKER will also use the alignemnts to filter out predictions htat do not match the evidence alignments.

?Carson


> On Sep 22, 2017, at 2:15 PM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> Thanks a lot for the information.
> 
> Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?
> 
> Could you confirm that?
> 
> Thanks a lot and best regards, Jens
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Freitag, 22. September 2017 22:04
>> An: Keilwagen, Jens
>> Cc: Maker Mailing List
>> Betreff: Re: MAKER
>> 
>> MAKER won?t produce est2genome results for est_gff. This is partially
>> because est2genome results are only used for training gene predictors.
>> So you are essentially just getting protein2genome results from your
>> runs. Once you get a gene predictor trained you will see a difference,
>> as it will use the intron/exon structure of alignments as hints to
>> improve gene predictor performance.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> I have tried the proposed options for a small example (yeast).
>>> 
>>> I had
>>> - proteins (fasta) from another yeast and
>>> - transcript annotation (gff) from cufflinks and StringTie
>>> 
>>> I'd like to compare the maker results for
>>> - proteins and StringTie
>>> Vs.
>>> - proteins and cufflinks
>>> 
>>> I used the default options, except:
>>> genome=<genome fasta>
>>> 
>>> protein=<protein fasta>
>>> est_gff=<transcript gff>
>>> 
>>> est2genome=1
>>> protein2genome=1
>>> 
>>> (An example is attached.)
>>> 
>>> Then I ran maker:
>>> 
>>> maker -RM_off -c 24
>>> find . -type f -name *.gff -exec cat {} + | grep maker >
>>> filtered-maker-prediction.gff
>>> 
>>> (The run seems to be okay. There were no FAILED, ... in the log. Cf.
>>> attachment)
>>> 
>>> Each maker run was started in a separate subdirectory.
>>> However, I realized that both maker runs yielded almost the same
>> result (just one minor edit). This made me curious.
>>> As far as I understood the files, I received the (filtered?)
>> exonerate predictions for the proteins (from the other yeast). Is this
>> correct? Why did I not receive any predictions (purely) based on the
>> RNA-seq data? Did I something wrong?
>>> 
>>> I'm looking forward to your reply.
>>> 
>>> Best regards, Jens
>>> 
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Dienstag, 19. September 2017 23:37
>>>> An: Keilwagen, Jens
>>>> Betreff: Re: MAKER
>>>> 
>>>> MAKER cannot use the BAM directly, but you can use something like
>>>> stringtie or trinity to assemble a transcript fasta that can be
>> given
>>>> to the est= option.
>>>> 
>>>> Ab initio gene prediction is only enabled if you specify an hmm or
>>>> species file to use.  If all you want is homology based annotation,
>>>> you can try the est2genome and protein2genome options. Note the
>> final
>>>> models may be partial if the alignments do not cover the gene end to
>>>> end.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
>> <jens.keilwagen at julius-
>>>> kuehn.de> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> 
>>>>> thanks a lot for your last email that .
>>>>> 
>>>>> I was asked to do homology-based gene prediction using RNA-seq and
>>>> Maker was proposed as one option.
>>>>> Hence I'd like to ask how to do that in the best possible way.
>>>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>>>> related species. How can I integrate the RNA-seq data?
>>>>> 
>>>>> Is it possible to deactivate ab-initio gene prediction by Augustus
>>>>> or
>>>> SNAP?
>>>>> 
>>>>> Thanks a lot in advance.
>>>>> 
>>>>> Bets regards, Jens
>>>>> 
>>>>>> -----Urspr?ngliche Nachricht-----
>>>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>>>> An: Keilwagen, Jens
>>>>>> Cc: Mark Yandell
>>>>>> Betreff: Re: MAKER
>>>>>> 
>>>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>>>> could give the GFF3 results to the pred_gff= option in MAKER
>> (comma
>>>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>>>> the same coordinate space as the assembly being annotated (genome=
>>>> option).
>>>>>> Whatever you give to pred_gff will be treated as a raw predictions
>>>> by
>>>>>> MAKER and will only be accepted as a final model if there are
>>>>>> evidence alignments (protein/EST) that support the model, and if
>>>>>> there are multiple alternate models at the same locus, only the
>>>> model
>>>>>> that is best supported by the protein/transcript evidence is kept.
>>>>>> 
>>>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>>>> will cause even raw predictions with no evidence support to be
>>>> maintained.
>>>>>> In the event of multiple models with no evidence support, the
>> model
>>>>>> best matching the consensus of alternate models will be
>> maintained.
>>>>>> 
>>>>>> Alternatively you can use the model_gff= options (comma separated
>>>>>> list
>>>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>>>> confidence than pred_gff. At least one model will always be kept
>>>>>> regardless of evidence support (same rules as pred_gff selection
>>>>>> for which model to keep when there are multiple). But model_gff
>>>>>> will
>>>> also
>>>>>> affect how evidence clusters are determined compared to pred_gff
>>>>>> (model_gff features are allowed to merge bridging evidence
>>>> clusters).
>>>>>> MAKER will also go to extra lengths to pull forward existing names
>>>>>> and other data in the GFF3 for model_gff features.
>>>>>> 
>>>>>> If you do not have GFF3 files in the right coordinate space, but
>> do
>>>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>>>> you can supply these to the protein= and transcript= options in
>>>> MAKER
>>>>>> together with est2genome=1 or protein2genome=1. This will cause
>>>> MAKER
>>>>>> to place the models using exonerate. You would probably also need
>>>>>> to add est_forward=1 to the control files to have MAKER try and
>>>>>> derive model names from the name of evidence alignments they were
>>>>>> derived from if you go this route.
>>>>>> 
>>>>>> You can also try treating the GFF3 predictions as hints to
>>>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>>>> predictions inform the behavior of predictors like SNAP and
>>>>>> Augustus). Might be interesting. You would have to alter results
>> to
>>>>>> be match/match_part
>>>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>>>> 
>>>>>> Let me know if you have any more questions, and I?ll do my best to
>>>>>> help.
>>>>>> 
>>>>>> Thanks,
>>>>>> Carson
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>>>> <myandell at genetics.utah.edu> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Mark Yandell
>>>>>>> Professor of Human Genetics
>>>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>>>> University of Utah
>>>>>>> 15 North 2030 East, Room 2100
>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>> ph:801-587-7707
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
>>>>>>> <jens.keilwagen at jki.bund.de>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Dear Prof. Yandell,
>>>>>>>> 
>>>>>>>> we have published a homology-based gene prediction program
>> today:
>>>>>>>> 
>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
>>>>>>>> 2 and I'd like to ask how we can use MAKER to combine
>> predictions
>>>>>>>> of GeMoMa using different reference organisms, i.e. we try to
>>>>>>>> predict the genes of an target organism (e.g. wheat) using the
>>>>>>>> annotated genes of other reference organisms (e.g. grasses).
>>>>>>>> GeMoMa returns
>>>>>> for
>>>>>>>> each reference organism a GFF with the predicted gene models in
>>>> the
>>>>>> target organism.
>>>>>>>> 
>>>>>>>> It would be great if you or someone from your team could give us
>>>>>> some
>>>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>>>> 
>>>>>>>> Thanks a lot and best regards, Jens
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> 
>>>>>>>> Dr. Jens Keilwagen
>>>>>>>> 
>>>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>>>> Cultivated
>>>>>>>> Plants
>>>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>>>> 
>>>>>>>> Erwin-Baur-Stra?e 27
>>>>>>>> 06484 Quedlinburg
>>>>>>>> Germany
>>>>>>>> 
>>>>>>>> Phone: ++49 (0)3946 47 510
>>>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>> <maker_opts.ctl><slurm-278767.out>
> 


From jens.keilwagen at julius-kuehn.de  Fri Sep 22 15:15:23 2017
From: jens.keilwagen at julius-kuehn.de (Keilwagen, Jens)
Date: Fri, 22 Sep 2017 20:15:23 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
Message-ID: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>

Hi Carson,

Thanks a lot for the information.

Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?

Could you confirm that?

Thanks a lot and best regards, Jens

> -----Urspr?ngliche Nachricht-----
> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> Gesendet: Freitag, 22. September 2017 22:04
> An: Keilwagen, Jens
> Cc: Maker Mailing List
> Betreff: Re: MAKER
> 
> MAKER won?t produce est2genome results for est_gff. This is partially
> because est2genome results are only used for training gene predictors.
> So you are essentially just getting protein2genome results from your
> runs. Once you get a gene predictor trained you will see a difference,
> as it will use the intron/exon structure of alignments as hints to
> improve gene predictor performance.
> 
> ?Carson
> 
> 
> > On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
> kuehn.de> wrote:
> >
> > Hi Carson,
> >
> > I have tried the proposed options for a small example (yeast).
> >
> > I had
> > - proteins (fasta) from another yeast and
> > - transcript annotation (gff) from cufflinks and StringTie
> >
> > I'd like to compare the maker results for
> > - proteins and StringTie
> > Vs.
> > - proteins and cufflinks
> >
> > I used the default options, except:
> > genome=<genome fasta>
> >
> > protein=<protein fasta>
> > est_gff=<transcript gff>
> >
> > est2genome=1
> > protein2genome=1
> >
> > (An example is attached.)
> >
> > Then I ran maker:
> >
> > maker -RM_off -c 24
> > find . -type f -name *.gff -exec cat {} + | grep maker >
> > filtered-maker-prediction.gff
> >
> > (The run seems to be okay. There were no FAILED, ... in the log. Cf.
> > attachment)
> >
> > Each maker run was started in a separate subdirectory.
> > However, I realized that both maker runs yielded almost the same
> result (just one minor edit). This made me curious.
> > As far as I understood the files, I received the (filtered?)
> exonerate predictions for the proteins (from the other yeast). Is this
> correct? Why did I not receive any predictions (purely) based on the
> RNA-seq data? Did I something wrong?
> >
> > I'm looking forward to your reply.
> >
> > Best regards, Jens
> >
> >
> >> -----Urspr?ngliche Nachricht-----
> >> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> >> Gesendet: Dienstag, 19. September 2017 23:37
> >> An: Keilwagen, Jens
> >> Betreff: Re: MAKER
> >>
> >> MAKER cannot use the BAM directly, but you can use something like
> >> stringtie or trinity to assemble a transcript fasta that can be
> given
> >> to the est= option.
> >>
> >> Ab initio gene prediction is only enabled if you specify an hmm or
> >> species file to use.  If all you want is homology based annotation,
> >> you can try the est2genome and protein2genome options. Note the
> final
> >> models may be partial if the alignments do not cover the gene end to
> >> end.
> >>
> >> ?Carson
> >>
> >>
> >>
> >>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
> <jens.keilwagen at julius-
> >> kuehn.de> wrote:
> >>>
> >>> Hi Carson,
> >>>
> >>> thanks a lot for your last email that .
> >>>
> >>> I was asked to do homology-based gene prediction using RNA-seq and
> >> Maker was proposed as one option.
> >>> Hence I'd like to ask how to do that in the best possible way.
> >>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
> >> related species. How can I integrate the RNA-seq data?
> >>>
> >>> Is it possible to deactivate ab-initio gene prediction by Augustus
> >>> or
> >> SNAP?
> >>>
> >>> Thanks a lot in advance.
> >>>
> >>> Bets regards, Jens
> >>>
> >>>> -----Urspr?ngliche Nachricht-----
> >>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
> >>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
> >>>> An: Keilwagen, Jens
> >>>> Cc: Mark Yandell
> >>>> Betreff: Re: MAKER
> >>>>
> >>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
> >>>> could give the GFF3 results to the pred_gff= option in MAKER
> (comma
> >>>> separated lists accepted). The GFF3 file of predictions must be in
> >>>> the same coordinate space as the assembly being annotated (genome=
> >> option).
> >>>> Whatever you give to pred_gff will be treated as a raw predictions
> >> by
> >>>> MAKER and will only be accepted as a final model if there are
> >>>> evidence alignments (protein/EST) that support the model, and if
> >>>> there are multiple alternate models at the same locus, only the
> >> model
> >>>> that is best supported by the protein/transcript evidence is kept.
> >>>>
> >>>> You can also set the keep_preds=1 option when using pred_gff. This
> >>>> will cause even raw predictions with no evidence support to be
> >> maintained.
> >>>> In the event of multiple models with no evidence support, the
> model
> >>>> best matching the consensus of alternate models will be
> maintained.
> >>>>
> >>>> Alternatively you can use the model_gff= options (comma separated
> >>>> list
> >>>> ok) to input the GFF3 file.  model_gff features are given higher
> >>>> confidence than pred_gff. At least one model will always be kept
> >>>> regardless of evidence support (same rules as pred_gff selection
> >>>> for which model to keep when there are multiple). But model_gff
> >>>> will
> >> also
> >>>> affect how evidence clusters are determined compared to pred_gff
> >>>> (model_gff features are allowed to merge bridging evidence
> >> clusters).
> >>>> MAKER will also go to extra lengths to pull forward existing names
> >>>> and other data in the GFF3 for model_gff features.
> >>>>
> >>>> If you do not have GFF3 files in the right coordinate space, but
> do
> >>>> have protein fasta or transcript fasta for the GeMoMa predictions,
> >>>> you can supply these to the protein= and transcript= options in
> >> MAKER
> >>>> together with est2genome=1 or protein2genome=1. This will cause
> >> MAKER
> >>>> to place the models using exonerate. You would probably also need
> >>>> to add est_forward=1 to the control files to have MAKER try and
> >>>> derive model names from the name of evidence alignments they were
> >>>> derived from if you go this route.
> >>>>
> >>>> You can also try treating the GFF3 predictions as hints to
> >>>> traditional ab initio gene finders like SNAP or Augustus by giving
> >>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
> >>>> predictions inform the behavior of predictors like SNAP and
> >>>> Augustus). Might be interesting. You would have to alter results
> to
> >>>> be match/match_part
> >>>> GFF3 features to give them to the est_gff or protein_gff options.
> >>>>
> >>>> Let me know if you have any more questions, and I?ll do my best to
> >>>> help.
> >>>>
> >>>> Thanks,
> >>>> Carson
> >>>>
> >>>>
> >>>>
> >>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
> >>>> <myandell at genetics.utah.edu> wrote:
> >>>>>
> >>>>>
> >>>>> Mark Yandell
> >>>>> Professor of Human Genetics
> >>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
> >>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
> >>>>> University of Utah
> >>>>> 15 North 2030 East, Room 2100
> >>>>> Salt Lake City, UT 84112-5330
> >>>>> ph:801-587-7707
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
> >>>>> <jens.keilwagen at jki.bund.de>
> >>>> wrote:
> >>>>>
> >>>>>> Dear Prof. Yandell,
> >>>>>>
> >>>>>> we have published a homology-based gene prediction program
> today:
> >>>>>>
> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
> >>>>>> 2 and I'd like to ask how we can use MAKER to combine
> predictions
> >>>>>> of GeMoMa using different reference organisms, i.e. we try to
> >>>>>> predict the genes of an target organism (e.g. wheat) using the
> >>>>>> annotated genes of other reference organisms (e.g. grasses).
> >>>>>> GeMoMa returns
> >>>> for
> >>>>>> each reference organism a GFF with the predicted gene models in
> >> the
> >>>> target organism.
> >>>>>>
> >>>>>> It would be great if you or someone from your team could give us
> >>>> some
> >>>>>> hints or point us to correct paragraph in the documentation.
> >>>>>>
> >>>>>> Thanks a lot and best regards, Jens
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Dr. Jens Keilwagen
> >>>>>>
> >>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
> >> Cultivated
> >>>>>> Plants
> >>>>>> 	Institute for Biosafety in Plant Biotechnology
> >>>>>>
> >>>>>> Erwin-Baur-Stra?e 27
> >>>>>> 06484 Quedlinburg
> >>>>>> Germany
> >>>>>>
> >>>>>> Phone: ++49 (0)3946 47 510
> >>>>>> EMail: jens.keilwagen at jki.bund.de
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >
> > <maker_opts.ctl><slurm-278767.out>


From venyao at qq.com  Sun Sep 24 04:08:43 2017
From: venyao at qq.com (=?ISO-8859-1?B?V2VuIFlhbw==?=)
Date: Sun, 24 Sep 2017 17:08:43 +0800
Subject: [maker-devel] integrate gmap into Maker
Message-ID: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>

Dear Guys,


I am using Maker to annotate my genome sequence. However, it costs too much time.


By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.


I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the


alignment to Maker. If so, this may save much time, as GMAP is very fast.


Thanks!


Best regards,


Wen Yao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170924/8d42e58d/attachment.html>

From eennadi at gmail.com  Sun Sep 24 16:24:10 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Sun, 24 Sep 2017 22:24:10 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
Message-ID: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>

Hello,

Good day,

I am trying to assign putative gene function to the maker generated fasta.
I am using NCBI

I keep getting this error
  Command line argument error: Argument "query". File is not accessible:
`muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'

What do I do?

can I use blast2go in place of ncbi command line software?

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu> wrote:

> Hi Emmanuel, In order for anyone to help you, you need post to the mailing
> list the command and output (including errors) of the step that didn?t
> work.
>
> Thanks,
> Daniel Ence
>
>
> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>
> Hello all,
>
> I downloaded Maker and tried to install it. I succeeded in installing all
> prerequisites however running maker ./build install, it showed that maker
> installed.
>
> However trying to run maker it wouldn't run.
>
> Please how do I install maker to run on local computer?
>
> Thanks
>
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170924/90a7c717/attachment.html>

From dandence at gmail.com  Mon Sep 25 09:11:31 2017
From: dandence at gmail.com (Daniel Ence)
Date: Mon, 25 Sep 2017 10:11:31 -0400
Subject: [maker-devel] integrate gmap into Maker
In-Reply-To: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
References: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
Message-ID: <7E5F06C8-05B2-447F-A695-DDE7673BDEFF@gmail.com>

Without commenting on the merits of GMAP vs Blastn or Exonerate, you can provide evidence alignments from any source in gff format in the maker control files. I think for GMAP this would mean converting the sam/bam outputs to a gff3 format, but I don?t know those steps of the top of my head. 

~Daniel 


> On Sep 24, 2017, at 5:08 AM, Wen Yao <venyao at qq.com> wrote:
> 
> Dear Guys,
> 
>  
> 
> I am using Maker to annotate my genome sequence. However, it costs too much time.
> 
> By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.
> 
> I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the
> 
> alignment to Maker. If so, this may save much time, as GMAP is very fast.
> 
> 
> 
> Thanks!
> 
>  
> 
> Best regards,
> 
>  
> 
> Wen Yao
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170925/143d3024/attachment.html>

From carsonhh at gmail.com  Mon Sep 25 11:07:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 25 Sep 2017 10:07:39 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
Message-ID: <07342091-897A-46C2-B000-76A283FE5FB1@gmail.com>

I?m not sure what you mean by NCBI. Do you mean BLAST? If so, you probably did not format and index your input database before running BLAST. See BLAST documentation.

Also the file you are using ?> muc1_genome_snap2.all.maker.snap_masked.proteins.fasta

That is not the maker result file. That is a reference fasta of raw SNAP results. The MAKER result file will have a name like this (see maker documentation) ?> muc1_genome_snap2.all.maker.proteins.fasta

?Carson


> On Sep 24, 2017, at 3:24 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello,
> 
> Good day,
> 
> I am trying to assign putative gene function to the maker generated fasta. I am using NCBI
> 
> I keep getting this error
>   Command line argument error: Argument "query". File is not accessible:  `muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'
> 
> What do I do?
> 
> can I use blast2go in place of ncbi command line software?
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
> 
> Thanks,
> Daniel Ence
> 
> 
>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>> 
>> Hello all,
>> 
>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>> 
>> However trying to run maker it wouldn't run.
>> 
>> Please how do I install maker to run on local computer?
>> 
>> Thanks
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> 
>>    
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170925/c21cf4d8/attachment.html>

From xvazquezc at gmail.com  Tue Sep 26 02:23:13 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Tue, 26 Sep 2017 17:23:13 +1000
Subject: [maker-devel] question about Maker-MPI
Message-ID: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>

Hi Carson,
We finally got Maker working with MPI (mpich, openmpi was a dead end...)
and I have a question about how Maker distributes the computation load.
I know, correct me if I'm wrong, that with MPI, Maker runs blast in
parallel (1 instance per thread) for protein2genome and est2genome. This
indeed improves enormously the speed for the initial run.
But, does it take advance of this at the time of running the gene
predictors? I think there is no benefit on multiple cpus in non-MPI mode
but I have no idea in MPI.
Thank you in advance,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170926/f9539591/attachment.html>

From carsonhh at gmail.com  Tue Sep 26 10:28:58 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 09:28:58 -0600
Subject: [maker-devel] question about Maker-MPI
In-Reply-To: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
References: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
Message-ID: <E29F4653-61A3-4E33-967A-4E1A9C8C4721@gmail.com>

MAKER parallelizes at multiple levels. For the ab initio predictors, it will run multiple contigs simultaneously (so each one will get their own ab initio predictor running). For large contigs it will further divide it into 10Mb chunks, and each will run simultaneously.

?Carson


> On Sep 26, 2017, at 1:23 AM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi Carson,
> We finally got Maker working with MPI (mpich, openmpi was a dead end...) and I have a question about how Maker distributes the computation load.
> I know, correct me if I'm wrong, that with MPI, Maker runs blast in parallel (1 instance per thread) for protein2genome and est2genome. This indeed improves enormously the speed for the initial run.
> But, does it take advance of this at the time of running the gene predictors? I think there is no benefit on multiple cpus in non-MPI mode but I have no idea in MPI.
> Thank you in advance,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170926/52293014/attachment.html>

From cjfields at illinois.edu  Mon Sep 25 09:53:39 2017
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 25 Sep 2017 14:53:39 +0000
Subject: [maker-devel] Maker not installing
In-Reply-To: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
	<78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
Message-ID: <ED8DB3BD-0981-4883-8CE0-E920BCEE0CC6@illinois.edu>

Emmanuel,

Look for anything that will help calculate basic assembly metrics, such as N50, NG50, L50, etc.; these almost always give overall assembly size, and total scaffolds/contigs.  For instance I?ve used this:

http://korflab.ucdavis.edu/datasets/Assemblathon/Assemblathon2/Basic_metrics/assemblathon_stats.pl

(it requires FALite, which is here: http://korflab.ucdavis.edu/Unix_and_Perl/FAlite.pm )

The Broad also has GAEMR (http://software.broadinstitute.org/software/gaemr/ ), but I haven?t tested it myself (I?ve heard it?s a bit finicky).

Also, see this: https://www.biostars.org/p/237591/ , which has a few more options.

chris

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Carson Holt <carsonhh at gmail.com>
Date: Friday, September 22, 2017 at 3:09 PM
To: Emmanuel Nnadi <eennadi at gmail.com>
Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Maker not installing

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Ok, thanks.
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
It would need to be a new run. You won't be able to use the updated contig names with the old run.
--Carson

Sent from my iPhone

On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Hi carson
Thanks for the tip
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

1. How can I effect the change when maker has produced some files from the the old sequence?

I have spent more than 24 hours running maker and it has produced some folders already.

How can I make this change?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.

I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.

Example command to do that ?>
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

?Carson


On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson
Thanks for your response its been helpful

Please bear with me as I work through this

1. Please how do I generate EST for my novel sequences?
2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
3. One error in the contig just returned this value
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
ERROR: RepeatMasker failed
--> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

examining contents of the fasta file and run log


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).

Example ?>
perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"

?Carson


On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,

 Thanks
I was able to start using maker.

However I am working with a plant Genome novel. I had set the repeatmasking to
1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker

How can I use specific known genomes for repeat masking
Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.

You can check the contents of the file using cat. Example ?> cat maker_opts.ctl

?Carson


On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
I however encountered another problem

when I ran the command nano -c maker_opts.ctl

It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave

#-----Genome (these are always required)
genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)


I ran maker command on another tab and it returned the following
STATUS: Parsing control files...
ERROR: You have failed to provide a value for 'genome' in the control files.

--> rank=NA, hostname=emmannamekasMBP


Questions
1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014

There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.

Thanks,
Carson


On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot

I ran this command maker -h it returned the following

The last thing I wish to ask you, how can I load my genome fine and being annotation?

Thanks

emmannamekasMBP:maker emmannaemeka$ maker -h

MAKER version 2.31.9

Usage:

     maker [options] <maker_opts> <maker_bopts> <maker_exe>


Description:

     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.

     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.

     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.

     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.

     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though


Options:

     -genome|g <file>    Overrides the genome file path in the control files

     -RM_off|R           Turns all repeat masking options off.

     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.

     -old_struct         Use the old directory styles (MAKER 2.26 and lower)

     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.

     -tries|t <integer>  Run contigs up to the specified number of tries.

     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!

     -force|f            Forces MAKER to delete old files before running again.
This will require all blast analyses to be rerun.

     -again|a            recaculate all annotations and output files even if no
settings have changed. Does not delete old analyses.

     -quiet|q            Regular quiet. Only a handlful of status messages.

     -qq                 Even more quiet. There are no status messages.

     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs

     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.

     -TMP                Specify temporary directory to use.

     -CTL                Generate empty control files in the current directory.

     -OPTS               Generates just the maker_opts.ctl file.

     -BOPTS              Generates just the maker_bopts.ctl file.

     -EXE                Generates just the maker_exe.ctl file.

     -MWAS    <option>   Easy way to control mwas_server for web-based GUI

                              options:  STOP
                                        START
                                        RESTART

     -version            Prints the MAKER version.

     -help|?             Prints this usage statement.


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Path needs to be a list of directories to search (you specified an executable location).

So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker

Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin

?Carson


On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>>
wrote:

Thanks

I tried to export PATH

running
echo $PATH in the maker directory this returned

/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker


1. Does it mean that PATH has been exported?


secondly,

I tried to run
the command maker -h, which maker, maker -CTL

nothing returned.

2. how do i start up maker?
3. Do I need to be in maker directory to start maker?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker

You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'

Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html

?Carson


On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:

Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker??


On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Daniel
The reply is
emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
-bash: MAKER: command not found

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl??

Thanks,
Daniel Ence


On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Ence,
Thanks for your reply,

This is the step and error received

emmannamekasMBP:src emmannaemeka$ ./build install

Installing MAKER...

Building MAKER

Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)


The build status is


=============================================================================

STATUS MAKER v2.31.9

==============================================================================

PERL Dependencies:  VERIFIED

External Programs:  VERIFIED

External C Libraries:   VERIFIED

MPI SUPPORT:        DISABLED

MWAS Web Interface: DISABLED

MAKER PACKAGE:      CONFIGURATION OK

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work.

Thanks,
Daniel Ence


On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,

I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.

However trying to run maker it wouldn't run.

Please how do I install maker to run on local computer?

Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170925/2ac6b193/attachment.html>

From tfallon at mit.edu  Tue Sep 26 12:40:21 2017
From: tfallon at mit.edu (Tim Fallon)
Date: Tue, 26 Sep 2017 13:40:21 -0400
Subject: [maker-devel] MAKER changelog?
Message-ID: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>

Hi there,

I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?

All the best,
-Tim

Timothy R. Fallon
PhD candidate
Laboratory of Jing-Ke Weng
Department of Biology
MIT

tfallon at mit.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1853 bytes
Desc: not available
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment.p7s>

From carsonhh at gmail.com  Tue Sep 26 13:34:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 12:34:16 -0600
Subject: [maker-devel] MAKER changelog?
In-Reply-To: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
References: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
Message-ID: <C32D3C31-125B-4D3D-8E0B-CD4ED629E541@gmail.com>

Here you go.

*updated the locations for repbase and augustus
*make library install more portable for newer perl versions
*fix for cdna2genome single exon strand
*updates for beter hints in augustus (exact rather than partial intron match)
*added allow_overlap for UTR in fungi and prokaryotes
*uri escape snap name in zff conversion
*fix for BioPerl-live related error (also submitted fix to BioPerl)
*jaccard cluster and bug fixes for cigar string
*Added zff2genebank script for training augustus (adapted from Jason Stajich's zff2augustus_gbk.pl)

?Carson


> On Sep 26, 2017, at 11:40 AM, Tim Fallon <tfallon at mit.edu> wrote:
> 
> Hi there,
> 
> I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?
> 
> All the best,
> -Tim
> 
> Timothy R. Fallon
> PhD candidate
> Laboratory of Jing-Ke Weng
> Department of Biology
> MIT
> 
> tfallon at mit.edu <mailto:tfallon at mit.edu>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170926/a7ae24bf/attachment.html>

From qwzhang0601 at gmail.com  Wed Sep 27 09:30:28 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 10:30:28 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
Message-ID: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>

Hello:

Thank you for all your previous comments and suggestions. We annotated a
new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both
transcriptome and protein sequences as evidences (including 10k reviewed
Mammalian and 340k predicted rodent protein sequences from uniprot). We
predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5,
and 74% have domains by "InterProScan". It seems the genome was well
annotated, but I still feel  28800 protein coding genes are too many for a
rodent species. Do you think this gene set is good for downstream analysis
(e.g., gene family expansion analysis, positive selection analysis)? Or can
I do further filtering to make the number of genes closer to estimated
number (e.g., 22,000)?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b07f2f47/attachment.html>

From dandence at gmail.com  Wed Sep 27 09:54:30 2017
From: dandence at gmail.com (Daniel Ence)
Date: Wed, 27 Sep 2017 10:54:30 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
Message-ID: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>

Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 

I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
> 
> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
> 
> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
> 
> Thanks
> 
> Best
> Quanwei
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170927/17cf26fd/attachment.html>

From michael.s.campbell1 at gmail.com  Wed Sep 27 10:34:11 2017
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Wed, 27 Sep 2017 11:34:11 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
Message-ID: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>

Hi Quanwei,

The first thing that comes to mind with too many genes is undermasked repeats. You could check the Pfam donmains for things like integrase, GAG proteins, and other transposon related domains. I would also look a bit closer at the genes with AEDs greater than 0.5. Looking and things like average numner of exons per transcript and average gene and transcript lengths can help pick out dodgy genes. You could also do some filtering on the QI values output by MAKER. It is defensible to create a ?higher quality? set by limiting it to genes with AEDs less than 0.5 and puting some requirement on the fractions of splice sites confirmed by EST/mRNA-seq alignments. 

Take care,
Mike
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
> 
> Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 
> 
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.
> 
> Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 
> 
> Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 
> 
> Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 
> 
> Hope this helps, 
> Daniel
> 
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
>> 
>> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
>> 
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b72e2514/attachment.html>

From xvazquezc at gmail.com  Wed Sep 27 19:32:30 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Thu, 28 Sep 2017 10:32:30 +1000
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
Message-ID: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>

Hi Quanwei,
Following Michael comment, even if you use Swissprot, there are over 2700
transposases in it. If there is some undermasking, they will show up as
evidence.
Cheers,
Xabi

On 28 September 2017 at 01:34, Michael Campbell <
michael.s.campbell1 at gmail.com> wrote:

> Hi Quanwei,
>
> The first thing that comes to mind with too many genes is undermasked
> repeats. You could check the Pfam donmains for things like integrase, GAG
> proteins, and other transposon related domains. I would also look a bit
> closer at the genes with AEDs greater than 0.5. Looking and things like
> average numner of exons per transcript and average gene and transcript
> lengths can help pick out dodgy genes. You could also do some filtering on
> the QI values output by MAKER. It is defensible to create a ?higher
> quality? set by limiting it to genes with AEDs less than 0.5 and puting
> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
> alignments.
>
> Take care,
> Mike
>
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>
> Hi Quanwei, I think that your genome assembly probably contains many
> contigs that are too small to contain full gene sequences. Rather than
> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
> paramter: ?skip genome contigs below this length (under 10kbp are often
> useless)?.
>
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
> contigs but excluding those contigs would probably reduce your gene count.
> These may be fragments or duplicates of genes present on these sequences
> that weren?t assembled properly.
>
> Also using predicted protein sequences from uniprot as evidence in your
> annotation is probably not advisable since those sequences are not from
> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
> that you asked about earlier.
>
> Additionally requiring a minimum protein length as you asked about earlier
> could also reduce the gene count.
>
> Ultimately, you may do whatever filtering you find necessary and
> justifiable for your annotation depending on the biology of your organism
> and the methods that generated your assembly, and your annotation.
>
> Hope this helps,
> Daniel
>
> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> Thank you for all your previous comments and suggestions. We annotated a
> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
> annotation (about 250k scaffolds).
>
> For repeats masking, we also build a species specific library. We used
> both transcriptome and protein sequences as evidences (including 10k
> reviewed Mammalian and 340k predicted rodent protein sequences from
> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>
> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
> and 74% have domains by "InterProScan". It seems the genome was well
> annotated, but I still feel  28800 protein coding genes are too many for a
> rodent species. Do you think this gene set is good for downstream analysis
> (e.g., gene family expansion analysis, positive selection analysis)? Or can
> I do further filtering to make the number of genes closer to estimated
> number (e.g., 22,000)?
>
> Thanks
>
> Best
> Quanwei
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170928/1a63a2ec/attachment.html>

From qwzhang0601 at gmail.com  Wed Sep 27 21:04:43 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 22:04:43 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
	<CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
Message-ID: <CAOW6FSJPZBiriKh9L5knuGp_ZCSEVxw4+eftyddk+o3kFwTTCw@mail.gmail.com>

Thank you all for your comments and suggestions. Yes, even when I only use
Swissprot I still have 26.5k protein coding genes. As you mentioned one
reason may be related to repeat masking, and another one may be because of
inclusion of short scaffolds, which further lead to protein fragments.

About the repeat masking, I use the latest Repeatmaker and Repbase
(selected Mammalian), I also build species specific repeat libraries
following
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic.
About transposases I know the Maker pipe line already provided
"transposable element proteins". I do not know what else I can do.

About the short scaffolds, in  fact among the 26.5k genes, only about 400
genes are predicted from scaffolds shorter than 10kb. Besides, I know there
are some very short proteins (e.g.,the mouse protein RL41 (60s ribosomal
protein) has lengh 25). I think short scaffolds may also include some short
proteins.

Now, I plan to start from the 26.5k protein coding genes. I think the less
reliable ones will be filtered out in downstream analysis. For example,
when we construct the gene families, those fragments or falsely predicted
proteins will more like to be excluded from gene families.

Thank you all for your suggestions.

Best
Qaunwei


2017-09-27 20:32 GMT-04:00 Xabier V?zquez-Campos <xvazquezc at gmail.com>:

> Hi Quanwei,
> Following Michael comment, even if you use Swissprot, there are over 2700
> transposases in it. If there is some undermasking, they will show up as
> evidence.
> Cheers,
> Xabi
>
> On 28 September 2017 at 01:34, Michael Campbell <
> michael.s.campbell1 at gmail.com> wrote:
>
>> Hi Quanwei,
>>
>> The first thing that comes to mind with too many genes is undermasked
>> repeats. You could check the Pfam donmains for things like integrase, GAG
>> proteins, and other transposon related domains. I would also look a bit
>> closer at the genes with AEDs greater than 0.5. Looking and things like
>> average numner of exons per transcript and average gene and transcript
>> lengths can help pick out dodgy genes. You could also do some filtering on
>> the QI values output by MAKER. It is defensible to create a ?higher
>> quality? set by limiting it to genes with AEDs less than 0.5 and puting
>> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
>> alignments.
>>
>> Take care,
>> Mike
>>
>> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>>
>> Hi Quanwei, I think that your genome assembly probably contains many
>> contigs that are too small to contain full gene sequences. Rather than
>> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
>> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
>> paramter: ?skip genome contigs below this length (under 10kbp are often
>> useless)?.
>>
>> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
>> contigs but excluding those contigs would probably reduce your gene count.
>> These may be fragments or duplicates of genes present on these sequences
>> that weren?t assembled properly.
>>
>> Also using predicted protein sequences from uniprot as evidence in your
>> annotation is probably not advisable since those sequences are not from
>> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
>> that you asked about earlier.
>>
>> Additionally requiring a minimum protein length as you asked about
>> earlier could also reduce the gene count.
>>
>> Ultimately, you may do whatever filtering you find necessary and
>> justifiable for your annotation depending on the biology of your organism
>> and the methods that generated your assembly, and your annotation.
>>
>> Hope this helps,
>> Daniel
>>
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Hello:
>>
>> Thank you for all your previous comments and suggestions. We annotated a
>> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
>> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
>> annotation (about 250k scaffolds).
>>
>> For repeats masking, we also build a species specific library. We used
>> both transcriptome and protein sequences as evidences (including 10k
>> reviewed Mammalian and 340k predicted rodent protein sequences from
>> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>>
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
>> and 74% have domains by "InterProScan". It seems the genome was well
>> annotated, but I still feel  28800 protein coding genes are too many for a
>> rodent species. Do you think this gene set is good for downstream analysis
>> (e.g., gene family expansion analysis, positive selection analysis)? Or can
>> I do further filtering to make the number of genes closer to estimated
>> number (e.g., 22,000)?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170927/4b9e4898/attachment.html>

From qwzhang0601 at gmail.com  Thu Sep 28 07:05:19 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Thu, 28 Sep 2017 08:05:19 -0400
Subject: [maker-devel] gene annotation for a better genome
Message-ID: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>

Hello:

Recently, we got a new version of NMR genome, whose genome had been
assembled and annotated a few years ago. We can download the gene
annotation from NCBI.

Now we want to annotate the new genome using Maker2 pipeline. I wonder how
can I fully make use of existing annotations. On the other hand, since the
previous genome is not very well assemblies, some genes annotation maybe
false positives. I hope those false positive genes in previous annotation
won't mislead Maker2 for current gene annotation.

Do you have any suggestions. Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170928/4192c41f/attachment.html>

From carsonhh at gmail.com  Fri Sep 29 11:36:09 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 29 Sep 2017 10:36:09 -0600
Subject: [maker-devel] gene annotation for a better genome
In-Reply-To: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
References: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
Message-ID: <5AFEDD05-DF02-463F-A6EE-1619A9BB968D@gmail.com>

You can try using the est2genome=1 option to map the old models forward onto the new assembly as if they were ESTs (add a line that says est_forward=1 to the control file to maintain old naming and set est=1 to the old model transcript file). Then provide the final models as a pred_gff for a subsuquent run (i.e. a traditional MAKER run where you are annotating the new assembly with transcript and protein evidence and ab initio predictors). Don?t supply the old models to est= on that run.

The idea behind doing it this way is:
1. You need to get old models onto the new assembly so coordinates will change. So by doing it this way, you will at least be able to move many models forward based on homology.
2. By providing the models to pred_gff on a subsequent MAKER run, you are just letting old models compete against new annotations. They will be rejected if they have no evidence support, or can be kept if they score better than alternate models from SNAP/Augustus. That way you have the chance to integrate old models while at the same time rejecting some old models that have no evidence overlap.

?Carson


> On Sep 28, 2017, at 6:05 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Recently, we got a new version of NMR genome, whose genome had been assembled and annotated a few years ago. We can download the gene annotation from NCBI. 
> 
> Now we want to annotate the new genome using Maker2 pipeline. I wonder how can I fully make use of existing annotations. On the other hand, since the previous genome is not very well assemblies, some genes annotation maybe false positives. I hope those false positive genes in previous annotation won't mislead Maker2 for current gene annotation.
> 
> Do you have any suggestions. Thanks
> 
> Best
> Quanwei  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From willett4 at email.unc.edu  Fri Sep 29 12:20:46 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 29 Sep 2017 17:20:46 +0000
Subject: [maker-devel] question on gene numbers with quality_filter.pl
Message-ID: <16C1890A-2042-4BE1-93CE-8A8DC0C18151@ad.unc.edu>

Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this?

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625


http://labs.bio.unc.edu/Willett/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://box290.bluehost.com/pipermail/maker-devel_yandell-lab.org/attachments/20170929/740b9569/attachment.html>

From willett4 at email.unc.edu  Fri Sep  1 09:22:34 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 1 Sep 2017 15:22:34 +0000
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
Message-ID: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>

Hi Everyone-

I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:

"Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?

This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 

We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.

If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).

Thanks,

Best,

Chris Willett


error 48600

#--------- command -------------#
Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_3

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_3

error 48599

Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_11

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_11

error 48592

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-193-25.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_5

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_5

error 47069

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-183-35.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_12

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_12


Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
 

From chzelin at gmail.com  Tue Sep  5 07:59:09 2017
From: chzelin at gmail.com (zl c)
Date: Tue, 5 Sep 2017 09:59:09 -0400
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
Message-ID: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>

Hello,

I run maker for most sequences successfully but fail some long sequences.
The error is:

Widget::tblastx:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db
db.778415-832259.for_tblastx.fasta -query ...778415.832259.0
-num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000
-searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking
true -show_gis -out   OUT.tblastx

#-------------------------------#


------------- EXCEPTION: Bio::Root::Exception -------------

MSG: Can't get HSPs: data not collected.

STACK: Error::throw

STACK: Bio::Root::Root::throw
/usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486

STACK: Bio::Search::Hit::PhatHit::Base::hsps
/spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552

STACK: Widget::tblastx::keepers
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192

STACK: Widget::tblastx::parse
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260

STACK: GI::reblast_merged_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471

STACK: GI::merge_resolve_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291

STACK: Process::MpiChunk::_go
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320

STACK: Process::MpiChunk::run
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340

STACK: Process::MpiChunk::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: /home/chenz11/program/maker/bin/maker:695

-----------------------------------------------------------

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

ERROR: Failed while collecting tblastx reports

ERROR: Chunk failed at level:5, tier_type:3

FAILED CONTIG:tig00011625_arrow


ERROR: Chunk failed at level:4, tier_type:0

FAILED CONTIG:tig00011625_arrow


examining contents of the fasta file and run log

I've read a relative thread on the google group and checked my tblastx
output. I found that the number of HSPs should be larger than 1000,000, but
only output 1000,000, which make some alignments have no HSPs. Is there any
setting that could solve the problem?

Thanks,
Zelin

--------------------------------------------
Zelin Chen [chzelin at gmail.com]


NIH/NHGRI
Building 50, Room 5531
50 SOUTH DR, MSC 8004
BETHESDA, MD 20892-8004
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/867d1aef/attachment-0001.html>

From qwzhang0601 at gmail.com  Tue Sep  5 14:24:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 16:24:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
Message-ID: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>

Hello:

We are doing genome annotation for a new rodent species. We have finished
the training of the ab initio gene predictors successful by setting the
following parameters (split_hit=40000, max_dna_len=1000000, and 99k
mammalian Swiss protein sequences as evidences.

But when I used the trained model to do the genome annotation, I got the
following kinds of errors (shown in red). I used the same parameters as
those for training, except for addition of 340k rodent TrEMBL protein
sequences for protein evidences (i.e., I use both 99k mammalian Swiss
protein sequences and 340k rodent TrEMBL protein sequences).

I am doing the annotation on a cluster and started multiple Maker in the
same directory (I had tried to use MPI but met some problems).

Do you have any suggestions? Many thanks
#some kinds of errors
open3: fork failed: Cannot allocate memory at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
--> rank=NA, hostname=n520
ERROR: Failed while doing blastx of proteins
ERROR: Chunk failed at level:8, tier_type:3
FAILED CONTIG:Contig2


setting up GFF3 output and fasta chunks
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n513
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig12378


Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/d504a94d/attachment-0001.html>

From carsonhh at gmail.com  Tue Sep  5 14:56:01 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 14:56:01 -0600
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
In-Reply-To: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
References: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
Message-ID: <7DCB519E-9AFA-4D10-8046-72DE99C5E4FF@gmail.com>

Did you use gff3 input to MAKER for any steps (example pred_gff or est_gff)?

?Carson

> On Sep 1, 2017, at 9:22 AM, Willett, Christopher S <willett4 at email.unc.edu> wrote:
> 
> Hi Everyone-
> 
> I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:
> 
> "Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?
> 
> This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 
> 
> We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.
> 
> If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).
> 
> Thanks,
> 
> Best,
> 
> Chris Willett
> 
> 
> 
> error 48600
> 
> #--------- command -------------#
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_3
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_3
> 
> error 48599
> 
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_11
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_11
> 
> error 48592
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-193-25.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_5
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_5
> 
> error 47069
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-183-35.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_12
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_12
> 
> 
> Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Tue Sep  5 15:48:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 15:48:56 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
Message-ID: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>

You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.

So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).

?Carson


> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
> 
> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
> 
> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
> 
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
> 
> 
> Best
> Quanwei

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/c2fb8514/attachment-0001.html>

From carsonhh at gmail.com  Tue Sep  5 16:04:00 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:04:00 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
Message-ID: <846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>

The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.

?Carson


> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com> wrote:
> 
> Hello,
> 
> I run maker for most sequences successfully but fail some long sequences. The error is: 
> 
> Widget::tblastx:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
> #-------------------------------#
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Can't get HSPs: data not collected.
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: /home/chenz11/program/maker/bin/maker:695
> -----------------------------------------------------------
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> ERROR: Failed while collecting tblastx reports
> ERROR: Chunk failed at level:5, tier_type:3
> FAILED CONTIG:tig00011625_arrow
> 
> ERROR: Chunk failed at level:4, tier_type:0
> FAILED CONTIG:tig00011625_arrow
> 
> examining contents of the fasta file and run log
> 
> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>  
> Thanks,
> Zelin
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/a316398a/attachment-0001.html>

From qwzhang0601 at gmail.com  Tue Sep  5 16:04:23 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 18:04:23 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
Message-ID: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>

Dear Carson:

Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds.
I set max_dna_len as 1Mb, because there are quite many long scaffolds
(e.g., the longest one is about 100Mb). Would you explain whether smaller
"max_dna_len" will decrease the quality of annotation (e.g., split some
genes in the same scaffold)?


Best
Quanwei

2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> You ran out of memory. You probably set max_dna_len too high for the
> machines you are using. There is a note in the maker_opts.ctl file that
> tells you that this value affects memory usage.
>
> So you can either set it lower, or if running under MPI, use fewer CPUs
> per node (how you do this is MPI flavor dependent, but some flavors let you
> do this by setting process count lower combined with the round robin
> option).
>
> ?Carson
>
>
>
> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> We are doing genome annotation for a new rodent species. We have finished
> the training of the ab initio gene predictors successful by setting the
> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
> mammalian Swiss protein sequences as evidences.
>
> But when I used the trained model to do the genome annotation, I got the
> following kinds of errors (shown in red). I used the same parameters as
> those for training, except for addition of 340k rodent TrEMBL protein
> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
> protein sequences and 340k rodent TrEMBL protein sequences).
>
> I am doing the annotation on a cluster and started multiple Maker in the
> same directory (I had tried to use MPI but met some problems).
>
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
>
>
> Best
> Quanwei
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/8c55b5a3/attachment-0001.html>

From carsonhh at gmail.com  Tue Sep  5 16:08:28 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:08:28 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
Message-ID: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>

max_dna_len is the window size for keeping data in RAM. Smaller values do not split genes. But values lower than 100kb can create issues (if a single gene models spans 3 or more windows, it creates a weird failure).

?Carson


> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds. I set max_dna_len as 1Mb, because there are quite many long scaffolds (e.g., the longest one is about 100Mb). Would you explain whether smaller "max_dna_len" will decrease the quality of annotation (e.g., split some genes in the same scaffold)? 
> 
> 
> Best
> Quanwei  
> 
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.
> 
> So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).
> 
> ?Carson
> 
> 
> 
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
>> 
>> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
>> 
>> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
>> 
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>> 
>> 
>> Best
>> Quanwei
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/6032bfb2/attachment-0001.html>

From qwzhang0601 at gmail.com  Wed Sep  6 09:51:54 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 6 Sep 2017 11:51:54 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
Message-ID: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>

Dear Carson:

(1) Thank you for your explanation. I will try to set max_dna_len as 400kb
for our rodent species, which is a little bit higher than the suggested
value for large vertebrate genome (in the maker manual it mentioned
"300,000 is a good max_dna_len on large vertebrate genomes if memory is not
a limiting factor").

(2) By reading some of your replies in the maker google group, and I
noticed that it can reduce memory and save time for annotation if I set
depth_blast to a certain number. So I changed the following parameters. But
I wonder, whether it will decrease the quality of annotation? If it won't
affect the quality, can I even use a smaller number (e.g., 20) to save more
memory and time?

depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

(3) I also have some concerns about the speed, especially for the long
scaffolds (around 100Mb). I wonder which part is the most time consuming
for genome annotation (repeat masking, blast, or polishing?).
Particularly, I wonder whether the blastx of protein evidence will take
majority of time. Now, I have prepared 99k mammalian Swiss protein
sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
am considering whether I can save much time if I only use the 99k mammalian
Swiss protein sequences as evidences.

(4) For some reasons, I can not run maker though MPI on our cluster. So I
can only start multiple maker. I wonder if it is possible to let multiple
maker to annotate the same long scaffold (i.e., for a single sequence I
start multiple maker, without splitting the long sequence into shorter
ones).

(5) Still about the speed issue. I read some of your comments about "cpus"
parameters in the maker_opts file (
http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html).
And I know it indicate the number of cpus for a single chunk. So if I set
"cpus=2" in the maker_opts file, then I can use the following command to
submit the job, right?

**************** the bash file used to submit the maker job
#!/bin/bash

#$ -cwd
#$ -S /bin/bash
#$ -j y
#$ -N makerT2
#$ -l h_vmem=8g
#$ -pe smp 2

module load MAKER/2.31.9/perl.5.22.1

maker --q 2> maker_test.error


Many thanks

Best
Qaunwei


2017-09-05 18:08 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> max_dna_len is the window size for keeping data in RAM. Smaller values do
> not split genes. But values lower than 100kb can create issues (if a single
> gene models spans 3 or more windows, it creates a weird failure).
>
> ?Carson
>
>
>
>
> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thanks. I wonder whether smaller "max_dna_len" will split longer
> scaffolds. I set max_dna_len as 1Mb, because there are quite many long
> scaffolds (e.g., the longest one is about 100Mb). Would you explain whether
> smaller "max_dna_len" will decrease the quality of annotation (e.g., split
> some genes in the same scaffold)?
>
>
> Best
> Quanwei
>
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> You ran out of memory. You probably set max_dna_len too high for the
>> machines you are using. There is a note in the maker_opts.ctl file that
>> tells you that this value affects memory usage.
>>
>> So you can either set it lower, or if running under MPI, use fewer CPUs
>> per node (how you do this is MPI flavor dependent, but some flavors let you
>> do this by setting process count lower combined with the round robin
>> option).
>>
>> ?Carson
>>
>>
>>
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Hello:
>>
>> We are doing genome annotation for a new rodent species. We have finished
>> the training of the ab initio gene predictors successful by setting the
>> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
>> mammalian Swiss protein sequences as evidences.
>>
>> But when I used the trained model to do the genome annotation, I got the
>> following kinds of errors (shown in red). I used the same parameters as
>> those for training, except for addition of 340k rodent TrEMBL protein
>> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
>> protein sequences and 340k rodent TrEMBL protein sequences).
>>
>> I am doing the annotation on a cluster and started multiple Maker in the
>> same directory (I had tried to use MPI but met some problems).
>>
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>>
>>
>> Best
>> Quanwei
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170906/5ef9f187/attachment-0001.html>

From carsonhh at gmail.com  Wed Sep  6 10:06:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 6 Sep 2017 10:06:46 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
Message-ID: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>


> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
> 
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.


> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.

BLASTN (ESTs) -> fastest as it is searching nucleotide space
BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX

Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.


> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).

Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.


> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  

The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.


?Carson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170906/2e1e3d6b/attachment-0001.html>

From carsonhh at gmail.com  Thu Sep  7 09:12:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 7 Sep 2017 09:12:46 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
	<846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>
	<CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
Message-ID: <2B046506-1E32-4840-B3B6-6DABB4A5D4C2@gmail.com>

I?m glad it fixed it.

?Carson

> On Sep 6, 2017, at 8:27 PM, zl c <chzelin at gmail.com> wrote:
> 
> Hi Carson,
> 
> I try blast-2.6.0+ and it works. Thank you very much.
> 
> Thanks
> Zelin Chen
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> 
> On Tue, Sep 5, 2017 at 6:04 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.
> 
> ?Carson
> 
> 
>> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I run maker for most sequences successfully but fail some long sequences. The error is: 
>> 
>> Widget::tblastx:
>> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
>> #-------------------------------#
>> 
>> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: Can't get HSPs: data not collected.
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
>> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
>> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
>> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
>> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
>> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
>> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
>> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
>> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: /home/chenz11/program/maker/bin/maker:695
>> -----------------------------------------------------------
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> ERROR: Failed while collecting tblastx reports
>> ERROR: Chunk failed at level:5, tier_type:3
>> FAILED CONTIG:tig00011625_arrow
>> 
>> ERROR: Chunk failed at level:4, tier_type:0
>> FAILED CONTIG:tig00011625_arrow
>> 
>> examining contents of the fasta file and run log
>> 
>> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>>  
>> Thanks,
>> Zelin
>> 
>> --------------------------------------------
>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>> 
>> 
>> NIH/NHGRI
>> Building 50, Room 5531
>> 50 SOUTH DR, MSC 8004 
>> BETHESDA, MD 20892-8004
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170907/441f80c9/attachment-0001.html>

From qwzhang0601 at gmail.com  Fri Sep  8 21:25:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Fri, 8 Sep 2017 23:25:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
Message-ID: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>

Dear Carson:

I got the following error again. Is this still related to memory issues? I
wonder whether there can be other reasons lead to this error? This time, I
got this error during training of the SNAP model. Before, even I set
max_dna_len=1Mb, I can train the model successfully.  And in the current
training (where I get the following error),  I have decreased the
max_dna_len to 300kb. I required the same amount memory as before. The only
difference is that I am using both mammalian repeat library and species
specific repeat library, while previously I only use the mammalian repeat
library. Will it greatly increases the requirement of memory to use both
repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
have also set the depth_blast as 30 in current training.

Thank you! Have a nice weekend!


#---------------------------------------------------------------------
Now starting the contig!!
SeqID: Contig10
Length: 18773588
#---------------------------------------------------------------------


setting up GFF3 output and fasta chunks
doing repeat masking
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
collecting blastx repeatmasking
processing all repeats
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n224
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig10

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:Contig10

Best
Quanwei

2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

>
> (2) By reading some of your replies in the maker google group, and I
> noticed that it can reduce memory and save time for annotation if I set
> depth_blast to a certain number. So I changed the following parameters. But
> I wonder, whether it will decrease the quality of annotation? If it won't
> affect the quality, can I even use a smaller number (e.g., 20) to save more
> memory and time?
>
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> This values really only affects the final evidence kept in the GFF3 when
> you look at it in a browser. It has not affect on the annotation. This is
> because internally MAKER already collapses evidence down to the 10 best
> non-redundant features per evidence set per locus. The rest are put in the
> GFF3 just for reference. by setting it lower, you are just letting MAKER
> know it can through things away even sooner since you don?t want them in
> the GFF3. It provides a minor improvement for memory use, but
> max_dna_length is the big one that has the greatest effect.
>
>
> (3) I also have some concerns about the speed, especially for the long
> scaffolds (around 100Mb). I wonder which part is the most time consuming
> for genome annotation (repeat masking, blast, or polishing?).
> Particularly, I wonder whether the blastx of protein evidence will take
> majority of time. Now, I have prepared 99k mammalian Swiss protein
> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
> am considering whether I can save much time if I only use the 99k mammalian
> Swiss protein sequences as evidences.
>
>
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
> times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12
> times slower than BLASTN and twice as slow as BLASTX
>
> Also double the dataset size, double the runtime. Larger window sizes via
> max_dna_length will also increase runtimes.
>
>
> (4) For some reasons, I can not run maker though MPI on our cluster. So I
> can only start multiple maker. I wonder if it is possible to let multiple
> maker to annotate the same long scaffold (i.e., for a single sequence I
> start multiple maker, without splitting the long sequence into shorter
> ones).
>
>
> Without MPI you won?t be able to split up large contigs. At the very least
> you can try and run on a single node and set MPI to use all CPUs on that
> node. It?s less difficult to set up compared to cross node jobs via MPI.
>
>
> (5) Still about the speed issue. I read some of your comments about "cpus"
> parameters in the maker_opts file (http://gmod.827538.n3.nabble.
> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
> the maker_opts file, then I can use the following command to submit the
> job, right?
>
>
> The cpu parameter only affects how many CPUs are given to the blast
> command line. So only the BLASt step will speed up, so I recommend using
> MPI to get all steps to speed up. Even if you are only running on a single
> node, you can give all CPUs to the mpiexec command.
>
>
> ?Carson
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170908/08852c2f/attachment-0001.html>

From xvazquezc at gmail.com  Sun Sep 10 19:03:11 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 11 Sep 2017 11:03:11 +1000
Subject: [maker-devel] augustus underpredicting
Message-ID: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>

Hi,
I have been annotating a fungal genome as usual, using Busco-trained
Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close
to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea
https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/7ac7b97f/attachment-0001.html>

From qwzhang0601 at gmail.com  Mon Sep 11 10:19:50 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 12:19:50 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>

Dear Carson:

About the error in my above email, I found the contig was correctly
annotated at the second time RETRY. So please ignore my last email. But
now, for a few number of scaffolds, I met problems to process the repeats
(as shown below in red). I used both Mammalia repeat library and species
specific repeat library (which is generated by your pipeline "
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic").
There were no such problems when I only used Mammalia repeat library. Do
you have any ideas about this? What could be the reason? Or do you have any
suggestions for me to find the reason? Many thanks

Here are some parameters I used

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in
RepeatMasker
rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
repeat library in fasta format for Repe

max_dna_len=300000
split_hit=40000
depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking


Died at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31


Best
Quanwei

2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I got the following error again. Is this still related to memory issues? I
> wonder whether there can be other reasons lead to this error? This time, I
> got this error during training of the SNAP model. Before, even I set
> max_dna_len=1Mb, I can train the model successfully.  And in the current
> training (where I get the following error),  I have decreased the
> max_dna_len to 300kb. I required the same amount memory as before. The only
> difference is that I am using both mammalian repeat library and species
> specific repeat library, while previously I only use the mammalian repeat
> library. Will it greatly increases the requirement of memory to use both
> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
> have also set the depth_blast as 30 in current training.
>
> Thank you! Have a nice weekend!
>
>
>
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
>
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
>
> Best
> Quanwei
>
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>>
>> (2) By reading some of your replies in the maker google group, and I
>> noticed that it can reduce memory and save time for annotation if I set
>> depth_blast to a certain number. So I changed the following parameters. But
>> I wonder, whether it will decrease the quality of annotation? If it won't
>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>> memory and time?
>>
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> This values really only affects the final evidence kept in the GFF3 when
>> you look at it in a browser. It has not affect on the annotation. This is
>> because internally MAKER already collapses evidence down to the 10 best
>> non-redundant features per evidence set per locus. The rest are put in the
>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>> know it can through things away even sooner since you don?t want them in
>> the GFF3. It provides a minor improvement for memory use, but
>> max_dna_length is the big one that has the greatest effect.
>>
>>
>> (3) I also have some concerns about the speed, especially for the long
>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>> for genome annotation (repeat masking, blast, or polishing?).
>> Particularly, I wonder whether the blastx of protein evidence will take
>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>> am considering whether I can save much time if I only use the 99k mammalian
>> Swiss protein sequences as evidences.
>>
>>
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>> times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>> 12 times slower than BLASTN and twice as slow as BLASTX
>>
>> Also double the dataset size, double the runtime. Larger window sizes via
>> max_dna_length will also increase runtimes.
>>
>>
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I
>> can only start multiple maker. I wonder if it is possible to let multiple
>> maker to annotate the same long scaffold (i.e., for a single sequence I
>> start multiple maker, without splitting the long sequence into shorter
>> ones).
>>
>>
>> Without MPI you won?t be able to split up large contigs. At the very
>> least you can try and run on a single node and set MPI to use all CPUs on
>> that node. It?s less difficult to set up compared to cross node jobs via
>> MPI.
>>
>>
>> (5) Still about the speed issue. I read some of your comments about
>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
>> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
>> the maker_opts file, then I can use the following command to submit the
>> job, right?
>>
>>
>> The cpu parameter only affects how many CPUs are given to the blast
>> command line. So only the BLASt step will speed up, so I recommend using
>> MPI to get all steps to speed up. Even if you are only running on a single
>> node, you can give all CPUs to the mpiexec command.
>>
>>
>> ?Carson
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/126b5351/attachment-0001.html>

From carsonhh at gmail.com  Mon Sep 11 10:48:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:48:16 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <5C2477A3-CDBA-458A-95CA-E6DC912417B3@gmail.com>

It may can a memory issue or an IO issue. Some resource is being taxed and creating a non-responsive bottleneck. If you are running MAKER multiple times in the same directory, you may have to run fewer processes. Also if you are running without MPI, run with MPI instead as it will better manage the parallelization and use fewer resources than multiple individual processes.

?Carson


> On Sep 8, 2017, at 9:25 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/a9e87402/attachment-0001.html>

From carsonhh at gmail.com  Mon Sep 11 10:50:41 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:50:41 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
Message-ID: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>

BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

?Carson


> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
> Cheers,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f7e3efe3/attachment-0001.html>

From carsonhh at gmail.com  Mon Sep 11 11:07:12 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:07:12 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
Message-ID: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>

I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.

For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).

?Carson


> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
> 
> Here are some parameters I used
> 
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
> 
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> 
> Best
> Quanwei
> 
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/0885c26a/attachment-0001.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:12:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:12:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
Message-ID: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>

Dear Carson:

I only run 5 Maker instances in each directory (and set cpus=2). If it is
related to memory issue or an IO issue, I am not sure why the much longer
scaffolds (than the failed ones) were all annotated successfully, but the
relatively shorter ones failed.

I have set "tries=5" (#number of times to try a contig if there is a
failure for some reason). I will try "clean_try=1" and test on the failed
scaffolds individually with larger memory to see whether they can be
annotated.

Thank you!

Best
Quanwei

2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> I think the cause of the error may have been a little further upstream
> from what you pasted in the e-mail. One thing that may be happening is that
> you are taxing resources (like IO) if running MAKER multiple times or on
> too many CPUs. That can lead to failures because of truncated BLAST reports
> etc. In which case you can just retry and that will get around those types
> of IO derived errors. MAKER can generate a lot of IO, and if you are
> working on network mounted locations (i.e. the storage being used is
> actually across the network), then they can be lest robust than local
> storage (when under heavy load NFS can falsely report success on read/write
> operations that actually failed). It?s the reason we built in the retry
> capabilities of MAKER.
>
> For contigs that continuously fail, you may need to set clean_try=1. That
> will cause failures to start from scratch (i.e. delete all old reports on
> failure rather than just those suspected of being truncated).
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> About the error in my above email, I found the contig was correctly
> annotated at the second time RETRY. So please ignore my last email. But
> now, for a few number of scaffolds, I met problems to process the repeats
> (as shown below in red). I used both Mammalia repeat library and species
> specific repeat library (which is generated by your pipeline "
> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/
> Repeat_Library_Construction--Basic"). There were no such problems when I
> only used Mammalia repeat library. Do you have any ideas about this? What
> could be the reason? Or do you have any suggestions for me to find the
> reason? Many thanks
>
> Here are some parameters I used
>
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in
> RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
> repeat library in fasta format for Repe
>
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
>
> Best
> Quanwei
>
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I got the following error again. Is this still related to memory issues?
>> I wonder whether there can be other reasons lead to this error? This time,
>> I got this error during training of the SNAP model. Before, even I set
>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>> training (where I get the following error),  I have decreased the
>> max_dna_len to 300kb. I required the same amount memory as before. The only
>> difference is that I am using both mammalian repeat library and species
>> specific repeat library, while previously I only use the mammalian repeat
>> library. Will it greatly increases the requirement of memory to use both
>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>> have also set the depth_blast as 30 in current training.
>>
>> Thank you! Have a nice weekend!
>>
>>
>>
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>>
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>>
>> Best
>> Quanwei
>>
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>>
>>> (2) By reading some of your replies in the maker google group, and I
>>> noticed that it can reduce memory and save time for annotation if I set
>>> depth_blast to a certain number. So I changed the following parameters. But
>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>> memory and time?
>>>
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> This values really only affects the final evidence kept in the GFF3 when
>>> you look at it in a browser. It has not affect on the annotation. This is
>>> because internally MAKER already collapses evidence down to the 10 best
>>> non-redundant features per evidence set per locus. The rest are put in the
>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>> know it can through things away even sooner since you don?t want them in
>>> the GFF3. It provides a minor improvement for memory use, but
>>> max_dna_length is the big one that has the greatest effect.
>>>
>>>
>>> (3) I also have some concerns about the speed, especially for the long
>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>> for genome annotation (repeat masking, blast, or polishing?).
>>> Particularly, I wonder whether the blastx of protein evidence will take
>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>> am considering whether I can save much time if I only use the 99k mammalian
>>> Swiss protein sequences as evidences.
>>>
>>>
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>> times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>
>>> Also double the dataset size, double the runtime. Larger window sizes
>>> via max_dna_length will also increase runtimes.
>>>
>>>
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>> start multiple maker, without splitting the long sequence into shorter
>>> ones).
>>>
>>>
>>> Without MPI you won?t be able to split up large contigs. At the very
>>> least you can try and run on a single node and set MPI to use all CPUs on
>>> that node. It?s less difficult to set up compared to cross node jobs via
>>> MPI.
>>>
>>>
>>> (5) Still about the speed issue. I read some of your comments about
>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>> know it indicate the number of cpus for a single chunk. So if I set
>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>> submit the job, right?
>>>
>>>
>>> The cpu parameter only affects how many CPUs are given to the blast
>>> command line. So only the BLASt step will speed up, so I recommend using
>>> MPI to get all steps to speed up. Even if you are only running on a single
>>> node, you can give all CPUs to the mpiexec command.
>>>
>>>
>>> ?Carson
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f02b6a0b/attachment-0001.html>

From carsonhh at gmail.com  Mon Sep 11 11:14:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:14:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
Message-ID: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>

It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.

?Carson


> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
> 
> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
> 
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
> 
> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>> 
>> Here are some parameters I used
>> 
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>> 
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> 
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>> 
>> Thank you! Have a nice weekend! 
>> 
>> 
>> 
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>> 
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> 
>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>> 
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>> 
>> 
>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>> 
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>> 
>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>> 
>> 
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>> 
>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>> 
>> 
>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>> 
>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>> 
>> 
>> ?Carson
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/2a88e334/attachment-0001.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:16:49 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:16:49 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
Message-ID: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>

Dear Carson:

I met some problems to use MPI. I will give it another try.
Thank you!

Best
Quanwei

2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> It could be either. Please use MPI instead of starting multiple instances.
> It will greatly reduce both IO and RAM usage.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I only run 5 Maker instances in each directory (and set cpus=2). If it is
> related to memory issue or an IO issue, I am not sure why the much longer
> scaffolds (than the failed ones) were all annotated successfully, but the
> relatively shorter ones failed.
>
> I have set "tries=5" (#number of times to try a contig if there is a
> failure for some reason). I will try "clean_try=1" and test on the failed
> scaffolds individually with larger memory to see whether they can be
> annotated.
>
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> I think the cause of the error may have been a little further upstream
>> from what you pasted in the e-mail. One thing that may be happening is that
>> you are taxing resources (like IO) if running MAKER multiple times or on
>> too many CPUs. That can lead to failures because of truncated BLAST reports
>> etc. In which case you can just retry and that will get around those types
>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>> working on network mounted locations (i.e. the storage being used is
>> actually across the network), then they can be lest robust than local
>> storage (when under heavy load NFS can falsely report success on read/write
>> operations that actually failed). It?s the reason we built in the retry
>> capabilities of MAKER.
>>
>> For contigs that continuously fail, you may need to set clean_try=1. That
>> will cause failures to start from scratch (i.e. delete all old reports on
>> failure rather than just those suspected of being truncated).
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> About the error in my above email, I found the contig was correctly
>> annotated at the second time RETRY. So please ignore my last email. But
>> now, for a few number of scaffolds, I met problems to process the repeats
>> (as shown below in red). I used both Mammalia repeat library and species
>> specific repeat library (which is generated by your pipeline "
>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>> eat_Library_Construction--Basic"). There were no such problems when I
>> only used Mammalia repeat library. Do you have any ideas about this? What
>> could be the reason? Or do you have any suggestions for me to find the
>> reason? Many thanks
>>
>> Here are some parameters I used
>>
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in
>> RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
>> repeat library in fasta format for Repe
>>
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I got the following error again. Is this still related to memory issues?
>>> I wonder whether there can be other reasons lead to this error? This time,
>>> I got this error during training of the SNAP model. Before, even I set
>>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>>> training (where I get the following error),  I have decreased the
>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>> difference is that I am using both mammalian repeat library and species
>>> specific repeat library, while previously I only use the mammalian repeat
>>> library. Will it greatly increases the requirement of memory to use both
>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>> have also set the depth_blast as 30 in current training.
>>>
>>> Thank you! Have a nice weekend!
>>>
>>>
>>>
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>>
>>>
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>> line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>>
>>>> (2) By reading some of your replies in the maker google group, and I
>>>> noticed that it can reduce memory and save time for annotation if I set
>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>> memory and time?
>>>>
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> This values really only affects the final evidence kept in the GFF3
>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>> know it can through things away even sooner since you don?t want them in
>>>> the GFF3. It provides a minor improvement for memory use, but
>>>> max_dna_length is the big one that has the greatest effect.
>>>>
>>>>
>>>> (3) I also have some concerns about the speed, especially for the long
>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>> Swiss protein sequences as evidences.
>>>>
>>>>
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>>> times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>>
>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>> via max_dna_length will also increase runtimes.
>>>>
>>>>
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>>> start multiple maker, without splitting the long sequence into shorter
>>>> ones).
>>>>
>>>>
>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>> MPI.
>>>>
>>>>
>>>> (5) Still about the speed issue. I read some of your comments about
>>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>>> know it indicate the number of cpus for a single chunk. So if I set
>>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>>> submit the job, right?
>>>>
>>>>
>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>> node, you can give all CPUs to the mpiexec command.
>>>>
>>>>
>>>> ?Carson
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6edaec49/attachment-0001.html>

From carsonhh at gmail.com  Mon Sep 11 11:18:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:18:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
Message-ID: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>

If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>

It?s easy to install yourself, and tends to be very robust to failure.

?Carson


> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I met some problems to use MPI. I will give it another try.
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>> 
>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>> 
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>> 
>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>> 
>>> Here are some parameters I used
>>> 
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>> 
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> 
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>> 
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>> Dear Carson:
>>> 
>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>> 
>>> Thank you! Have a nice weekend! 
>>> 
>>> 
>>> 
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>> 
>>> 
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> 
>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>> 
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>> 
>>> 
>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>> 
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>> 
>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>> 
>>> 
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>> 
>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>> 
>>> 
>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>> 
>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>> 
>>> 
>>> ?Carson
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/ee287570/attachment-0001.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:27:22 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:27:22 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
Message-ID: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>

Dear Carson:

Would you please explain what do you mean by "a single machine"? I am
running maker2 on our high performance cluster. The cluster has more than
1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
as the scheduler. Can I use MPICH3?

Thanks

Best
Quanwei

2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> If you are just using a single machine (and not cross machine MPI), use
> MPICH3 ?> https://www.mpich.org
>
> It?s easy to install yourself, and tends to be very robust to failure.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I met some problems to use MPI. I will give it another try.
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> It could be either. Please use MPI instead of starting multiple
>> instances. It will greatly reduce both IO and RAM usage.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is
>> related to memory issue or an IO issue, I am not sure why the much longer
>> scaffolds (than the failed ones) were all annotated successfully, but the
>> relatively shorter ones failed.
>>
>> I have set "tries=5" (#number of times to try a contig if there is a
>> failure for some reason). I will try "clean_try=1" and test on the failed
>> scaffolds individually with larger memory to see whether they can be
>> annotated.
>>
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> I think the cause of the error may have been a little further upstream
>>> from what you pasted in the e-mail. One thing that may be happening is that
>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>> etc. In which case you can just retry and that will get around those types
>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>> working on network mounted locations (i.e. the storage being used is
>>> actually across the network), then they can be lest robust than local
>>> storage (when under heavy load NFS can falsely report success on read/write
>>> operations that actually failed). It?s the reason we built in the retry
>>> capabilities of MAKER.
>>>
>>> For contigs that continuously fail, you may need to set clean_try=1.
>>> That will cause failures to start from scratch (i.e. delete all old reports
>>> on failure rather than just those suspected of being truncated).
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> About the error in my above email, I found the contig was correctly
>>> annotated at the second time RETRY. So please ignore my last email. But
>>> now, for a few number of scaffolds, I met problems to process the repeats
>>> (as shown below in red). I used both Mammalia repeat library and species
>>> specific repeat library (which is generated by your pipeline "
>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>> eat_Library_Construction--Basic"). There were no such problems when I
>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>> could be the reason? Or do you have any suggestions for me to find the
>>> reason? Many thanks
>>>
>>> Here are some parameters I used
>>>
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in
>>> RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>> specific repeat library in fasta format for Repe
>>>
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>> line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>>
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>
>>>> Dear Carson:
>>>>
>>>> I got the following error again. Is this still related to memory
>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>> This time, I got this error during training of the SNAP model. Before, even
>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>> current training (where I get the following error),  I have decreased the
>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>> difference is that I am using both mammalian repeat library and species
>>>> specific repeat library, while previously I only use the mammalian repeat
>>>> library. Will it greatly increases the requirement of memory to use both
>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>> have also set the depth_blast as 30 in current training.
>>>>
>>>> Thank you! Have a nice weekend!
>>>>
>>>>
>>>>
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>>
>>>>
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>> line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>>
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>>
>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>> memory and time?
>>>>>
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> This values really only affects the final evidence kept in the GFF3
>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>> know it can through things away even sooner since you don?t want them in
>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>
>>>>>
>>>>> (3) I also have some concerns about the speed, especially for the long
>>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>> Swiss protein sequences as evidences.
>>>>>
>>>>>
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>> 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>
>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>> via max_dna_length will also increase runtimes.
>>>>>
>>>>>
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>> shorter ones).
>>>>>
>>>>>
>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>> MPI.
>>>>>
>>>>>
>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>> "cpus" parameters in the maker_opts file (
>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>> I can use the following command to submit the job, right?
>>>>>
>>>>>
>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>
>>>>>
>>>>> ?Carson
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6fd07594/attachment-0001.html>

From carsonhh at gmail.com  Mon Sep 11 11:46:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:46:39 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
Message-ID: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>

Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.

MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.

Example command for a 20 CPU node ?>  mpiexec -n 20 maker

?Carson


> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson: 
> 
> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
> 
> Thanks
> 
> Best
> Quanwei
> 
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
> 
> It?s easy to install yourself, and tends to be very robust to failure.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>> 
>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>> 
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>> 
>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>> 
>>>> Here are some parameters I used
>>>> 
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>> 
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> 
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>> 
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>> Dear Carson:
>>>> 
>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>> 
>>>> Thank you! Have a nice weekend! 
>>>> 
>>>> 
>>>> 
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>> 
>>>> 
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> 
>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>> 
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>> 
>>>> 
>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>> 
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>> 
>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>> 
>>>> 
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>> 
>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>> 
>>>> 
>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>> 
>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>> 
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/cef90e76/attachment-0001.html>

From qwzhang0601 at gmail.com  Mon Sep 11 12:33:51 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 14:33:51 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
Message-ID: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>

Dear Carson:

I see. Thank you. I will try it.

Best
Quanwei

2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> Each node is a single machine. Because you currently run without MPI, each
> MAKER job you submit runs on a single machine. So you are either running
> multiple times on the same node, or you submitted 5 separate batch jobs in
> which case you may have a single maker process on each of 5 nodes.
>
> MPI can parallelize on the same node or across nodes. If you request 10
> nodes, then it can communicate across nodes to run the job on all hardware.
> Or you can run MPI on a single node and ask for all CPUs on that node. In
> that case it will split up work within a single node and use all resources
> just on that node. So if you can?t get MPI to work across nodes, you can
> just submit a job that goes to a single node and ask for all CPUs on that
> node (multinode jobs may be hard to configure, but single node jobs are
> very easy). Just set the -n parameter of mpiexec to the CPU count of that
> node, and it will parallelize within the node.
>
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>
> ?Carson
>
>
>
>
>
> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Would you please explain what do you mean by "a single machine"? I am
> running maker2 on our high performance cluster. The cluster has more than
> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
> as the scheduler. Can I use MPICH3?
>
> Thanks
>
> Best
> Quanwei
>
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> If you are just using a single machine (and not cross machine MPI), use
>> MPICH3 ?> https://www.mpich.org
>>
>> It?s easy to install yourself, and tends to be very robust to failure.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> It could be either. Please use MPI instead of starting multiple
>>> instances. It will greatly reduce both IO and RAM usage.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>> is related to memory issue or an IO issue, I am not sure why the much
>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>> but the relatively shorter ones failed.
>>>
>>> I have set "tries=5" (#number of times to try a contig if there is a
>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>> scaffolds individually with larger memory to see whether they can be
>>> annotated.
>>>
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> I think the cause of the error may have been a little further upstream
>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>> etc. In which case you can just retry and that will get around those types
>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>> working on network mounted locations (i.e. the storage being used is
>>>> actually across the network), then they can be lest robust than local
>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>> operations that actually failed). It?s the reason we built in the retry
>>>> capabilities of MAKER.
>>>>
>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>> on failure rather than just those suspected of being truncated).
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> About the error in my above email, I found the contig was correctly
>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>> specific repeat library (which is generated by your pipeline "
>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>> could be the reason? Or do you have any suggestions for me to find the
>>>> reason? Many thanks
>>>>
>>>> Here are some parameters I used
>>>>
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>> RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>> specific repeat library in fasta format for Repe
>>>>
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>> line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>>
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I got the following error again. Is this still related to memory
>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>> current training (where I get the following error),  I have decreased the
>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>> difference is that I am using both mammalian repeat library and species
>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>> have also set the depth_blast as 30 in current training.
>>>>>
>>>>> Thank you! Have a nice weekend!
>>>>>
>>>>>
>>>>>
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>>
>>>>>
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>> line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>>
>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>> memory and time?
>>>>>>
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>
>>>>>>
>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>> Swiss protein sequences as evidences.
>>>>>>
>>>>>>
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>>> 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>
>>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>>> via max_dna_length will also increase runtimes.
>>>>>>
>>>>>>
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>> shorter ones).
>>>>>>
>>>>>>
>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>> MPI.
>>>>>>
>>>>>>
>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>> "cpus" parameters in the maker_opts file (
>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>>> I can use the following command to submit the job, right?
>>>>>>
>>>>>>
>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/e23e5faa/attachment-0001.html>

From qwzhang0601 at gmail.com  Wed Sep 13 08:51:32 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:51:32 -0400
Subject: [maker-devel] Repeats annotation
Message-ID: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>

Dear Carson:

We have generated species specific repeat library following your pipeline (
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic).
And did genome annotation by maker2 by using both species specific repeat
library and mammalian repeat library.

Now, we want to do some comparison about the repeat contexts among
different species. So I want to generate species specific for other species
and also use both their species specific repeat library and mammalian
repeat library. But I found, I can only provide either the species specific
repeat library or mammalian repeat library to RepeatMasker (not for both).
I wonder whether I can run maker2 on those genome but only for repeat
masking.

BTW, by running RepeatMasker we can get a summary report (as below), I
wonder whether there is any script from maker2 to analyze repeats element
(or other tools to process the output of maker2).

Many thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     301634 bp ( 34.93 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:               134        14362 bp    1.66 %
      Alu/B1          28         2183 bp    0.25 %
      MIRs            21         2860 bp    0.33 %

LINEs:               188       129104 bp   14.95 %
      LINE1          168       124633 bp   14.43 %
      LINE2           16         4266 bp    0.49 %
      L3/CR1           4          205 bp    0.02 %
      RTE              0            0 bp    0.00 %

LTR elements:        127       101129 bp   11.71 %
      ERVL            10         3057 bp    0.35 %
      ERVL-MaLRs      22         6902 bp    0.80 %
      ERV_classI      66        80258 bp    9.29 %
      ERV_classII     29        10912 bp    1.26 %

DNA elements:         27         4402 bp    0.51 %
      hAT-Charlie     13         1836 bp    0.21 %
      TcMar-Tigger     8         1651 bp    0.19 %

Unclassified:          4         1590 bp    0.18 %

Total interspersed repeats:    250587 bp   29.02 %


Small RNA:             9          616 bp    0.07 %

Satellites:           66        40820 bp    4.73 %
Simple repeats:      159         7235 bp    0.84 %
Low complexity:       50         2766 bp    0.32 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be mammalia
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/739f1e6a/attachment-0001.html>

From qwzhang0601 at gmail.com  Wed Sep 13 08:32:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:32:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
Message-ID: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>

Dear Carson:

I did more tests on one of the contigs (with length 863kb) that failed when
doing repeat masking. I found it only fail when I added the species
specific repeat library, and it can be successfully annotated when only
considering mammalian repeat library. When I did the test I only picked the
this contig and run maker with 64G memory. So I think the failure should
not be the problem with memory or IO, because even the contigs with length
98Mb can be annotated with memory 32G.

I also run RepeatMasker on this contig with mammalian and species specific
repeat library, separately. I found when I use  mammalian repeat library,
about 35% was masked as repeats, while it is 65% when I use species
specific repeat library (as shown below in blue). I wonder whether the high
level of repeats can lead to the failure of this contig.  Do you have any
ideas about this. Thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     562909 bp ( 65.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:              113        16134 bp    1.87 %
      ALUs           71        12479 bp    1.45 %
      MIRs            1          133 bp    0.02 %

LINEs:              251       380142 bp   44.02 %
      LINE1         211       210623 bp   24.39 %
      LINE2           1           86 bp    0.01 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:       246       101221 bp   11.72 %
      ERVL            5         1037 bp    0.12 %
      ERVL-MaLRs     18         2744 bp    0.32 %
      ERV_classI    201        90942 bp   10.53 %
      ERV_classII    18         5964 bp    0.69 %

DNA elements:        39        14177 bp    1.64 %
     hAT-Charlie      7         3864 bp    0.45 %
     TcMar-Tigger     7         1706 bp    0.20 %

Unclassified:       196        45831 bp    5.31 %

Total interspersed repeats:   557505 bp   64.56 %


Small RNA:            3          823 bp    0.10 %

Satellites:           2          237 bp    0.03 %
Simple repeats:      94         4472 bp    0.52 %
Low complexity:      18          766 bp    0.09 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
The query was compared to classified sequences in
".../consensi.fa.classifiednoProtFinal"


Best
Quanwei

2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I see. Thank you. I will try it.
>
> Best
> Quanwei
>
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> Each node is a single machine. Because you currently run without MPI,
>> each MAKER job you submit runs on a single machine. So you are either
>> running multiple times on the same node, or you submitted 5 separate batch
>> jobs in which case you may have a single maker process on each of 5 nodes.
>>
>> MPI can parallelize on the same node or across nodes. If you request 10
>> nodes, then it can communicate across nodes to run the job on all hardware.
>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>> that case it will split up work within a single node and use all resources
>> just on that node. So if you can?t get MPI to work across nodes, you can
>> just submit a job that goes to a single node and ask for all CPUs on that
>> node (multinode jobs may be hard to configure, but single node jobs are
>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>> node, and it will parallelize within the node.
>>
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>
>> ?Carson
>>
>>
>>
>>
>>
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> Would you please explain what do you mean by "a single machine"? I am
>> running maker2 on our high performance cluster. The cluster has more than
>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>> as the scheduler. Can I use MPICH3?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> If you are just using a single machine (and not cross machine MPI), use
>>> MPICH3 ?> https://www.mpich.org
>>>
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> It could be either. Please use MPI instead of starting multiple
>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>> but the relatively shorter ones failed.
>>>>
>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>> scaffolds individually with larger memory to see whether they can be
>>>> annotated.
>>>>
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> I think the cause of the error may have been a little further upstream
>>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>>> etc. In which case you can just retry and that will get around those types
>>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>>> working on network mounted locations (i.e. the storage being used is
>>>>> actually across the network), then they can be lest robust than local
>>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>>> operations that actually failed). It?s the reason we built in the retry
>>>>> capabilities of MAKER.
>>>>>
>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>> on failure rather than just those suspected of being truncated).
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> About the error in my above email, I found the contig was correctly
>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>> specific repeat library (which is generated by your pipeline "
>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>> reason? Many thanks
>>>>>
>>>>> Here are some parameters I used
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>> specific repeat library in fasta format for Repe
>>>>>
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>> line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>>
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I got the following error again. Is this still related to memory
>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>> current training (where I get the following error),  I have decreased the
>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>
>>>>>> Thank you! Have a nice weekend!
>>>>>>
>>>>>>
>>>>>>
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>>
>>>>>>
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>> line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>>
>>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>> memory and time?
>>>>>>>
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>
>>>>>>>
>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>> Swiss protein sequences as evidences.
>>>>>>>
>>>>>>>
>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>> least 6 times slower than BLASTN
>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>
>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>
>>>>>>>
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>> shorter ones).
>>>>>>>
>>>>>>>
>>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>> MPI.
>>>>>>>
>>>>>>>
>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>
>>>>>>>
>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/c1467038/attachment-0001.html>

From mathog at caltech.edu  Wed Sep 13 12:01:11 2017
From: mathog at caltech.edu (mathog)
Date: Wed, 13 Sep 2017 11:01:11 -0700
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
Message-ID: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>

Greetings,

I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It 
just won't start.  OpenMPI works fine with a small test program, it just 
doesn't work with maker.  It fails in exactly the same way on a second 
Centos system with minor software differences (Centos 6.9 and perl 5.20 
compiled without thread support, the perl on the first machine had 
thread support.) The gory details were posted already in a Centos forum 
so rather than repeat it all here, this is a link to that thread:

    https://www.centos.org/forums/viewtopic.php?f=14&t=64099

maker was unpacked from the maker-2.31.9.tgz a second time (after moving 
the original) after setting up the "module add openmpi-x86_64" to my 
.bash_profile
and logging in cleanly.  It was rebuilt.  The build messages were 
identical to the previous ones and when a run was attempted it also 
failed in exactly the same way.

I also tried to subscribe to the list here

   
https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

once yesterday, and once today, but no email ever came back.  Hopefully 
this message gets through!

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From carsonhh at gmail.com  Wed Sep 13 12:23:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:23:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
Message-ID: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>

These are the 3 errors you have shown in your e-mails ?>
open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.

The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.

The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.


IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.

Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.

1. Some things to check. Make sure TMP= is not being set to a network mounted location.
2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.

Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.

Thanks,
Carson


> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
> 
> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
> 
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
> 
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
> 
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
> 
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
> 
> Unclassified:       196        45831 bp    5.31 %
> 
> Total interspersed repeats:   557505 bp   64.56 %
> 
> 
> Small RNA:            3          823 bp    0.10 %
> 
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be homo          
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
> 
> 
> Best
> Quanwei
> 
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I see. Thank you. I will try it.
> 
> Best
> Quanwei
> 
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
> 
> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
> 
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
> 
> ?Carson
> 
> 
> 
> 
> 
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson: 
>> 
>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>> 
>> It?s easy to install yourself, and tends to be very robust to failure.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>> 
>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>> 
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>> 
>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>> 
>>>>> Here are some parameters I used
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>> 
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> 
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>> 
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>> Dear Carson:
>>>>> 
>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>> 
>>>>> Thank you! Have a nice weekend! 
>>>>> 
>>>>> 
>>>>> 
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> 
>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>> 
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>> 
>>>>> 
>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>> 
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>> 
>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>> 
>>>>> 
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>> 
>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>> 
>>>>> 
>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>> 
>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>> 
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/3c646981/attachment-0001.html>

From carsonhh at gmail.com  Wed Sep 13 12:26:08 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:26:08 -0600
Subject: [maker-devel] Repeats annotation
In-Reply-To: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
References: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
Message-ID: <40F80C42-836A-41FF-9C9F-1F45C5816283@gmail.com>

I don?t know of any tool to analyze the repeat info. MAKER really only focuses on getting the masking done for the gene prediction, and while it does keep the repeats as features in the GFF3, it does not do any kind of analysis. You would have to do that outside of MAKER.

?Carson


> On Sep 13, 2017, at 8:51 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> We have generated species specific repeat library following your pipeline (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>). And did genome annotation by maker2 by using both species specific repeat library and mammalian repeat library. 
> 
> Now, we want to do some comparison about the repeat contexts among different species. So I want to generate species specific for other species and also use both their species specific repeat library and mammalian repeat library. But I found, I can only provide either the species specific repeat library or mammalian repeat library to RepeatMasker (not for both). I wonder whether I can run maker2 on those genome but only for repeat masking. 
> 
> BTW, by running RepeatMasker we can get a summary report (as below), I wonder whether there is any script from maker2 to analyze repeats element (or other tools to process the output of maker2). 
> 
> Many thanks
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     301634 bp ( 34.93 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:               134        14362 bp    1.66 %
>       Alu/B1          28         2183 bp    0.25 %
>       MIRs            21         2860 bp    0.33 %
> 
> LINEs:               188       129104 bp   14.95 %
>       LINE1          168       124633 bp   14.43 %
>       LINE2           16         4266 bp    0.49 %
>       L3/CR1           4          205 bp    0.02 %
>       RTE              0            0 bp    0.00 %
> 
> LTR elements:        127       101129 bp   11.71 %
>       ERVL            10         3057 bp    0.35 %
>       ERVL-MaLRs      22         6902 bp    0.80 %
>       ERV_classI      66        80258 bp    9.29 %
>       ERV_classII     29        10912 bp    1.26 %
> 
> DNA elements:         27         4402 bp    0.51 %
>       hAT-Charlie     13         1836 bp    0.21 %
>       TcMar-Tigger     8         1651 bp    0.19 %
> 
> Unclassified:          4         1590 bp    0.18 %
> 
> Total interspersed repeats:    250587 bp   29.02 %
> 
> 
> Small RNA:             9          616 bp    0.07 %
> 
> Satellites:           66        40820 bp    4.73 %
> Simple repeats:      159         7235 bp    0.84 %
> Low complexity:       50         2766 bp    0.32 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be mammalia      
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/9744da83/attachment-0001.html>

From carsonhh at gmail.com  Wed Sep 13 12:41:24 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:41:24 -0600
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
In-Reply-To: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
References: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
Message-ID: <BA16E294-BE01-47DC-8113-C018C38480FC@gmail.com>

Mi David,

First thing. MAKER binds shared C libraries using Perl, so you have to tell MAKER where to find the needed files before you install it. Then it compiles the bindings and saves them for MAKER to use. If you have two MPI installation, you may have MAKER setup to use one of the installations then you are trying to call it with the other one. That would break the compiles bindings.

Also make sure you did the following (info from the ?/maker/INSTALL instructions file) ?> 

"make sure to set LD_PRELOAD to the location of libmpi.so before even trying to install MAKER. It must also be set before running MAKER (or any program that binds OpenMPI's shared libraries), so it's best just to add it to your ~/.bash_profile. (i.e. export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so)."

Remember to replace '/usr/local/openmpi/lib/libmpi.so? with the actual location of the file.

Second once you can get maker to start under OpenMPI, you may get freezes or failures part way into a run because OpenFabrics libraries use registered memory in a weird way that can cause system calls in a program to fail with a snowballing error effect. Adding this to the mpiexec options can stop this from occurring ?> '-mca btl ^openib'

That option has the side effect of disabling infiniband and using the ethernet adapter instead. However if you need to use the infiniband adapter, you can use this flag instead '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0'

That command will use IP over infiniband rather than the native infiniband which will have the same effect of diabling the OpenFabrics libraries.

Thanks,
Carson


> On Sep 13, 2017, at 12:01 PM, mathog <mathog at caltech.edu> wrote:
> 
> Greetings,
> 
> I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It just won't start.  OpenMPI works fine with a small test program, it just doesn't work with maker.  It fails in exactly the same way on a second Centos system with minor software differences (Centos 6.9 and perl 5.20 compiled without thread support, the perl on the first machine had thread support.) The gory details were posted already in a Centos forum so rather than repeat it all here, this is a link to that thread:
> 
>   https://www.centos.org/forums/viewtopic.php?f=14&t=64099
> 
> maker was unpacked from the maker-2.31.9.tgz a second time (after moving the original) after setting up the "module add openmpi-x86_64" to my .bash_profile
> and logging in cleanly.  It was rebuilt.  The build messages were identical to the previous ones and when a run was attempted it also failed in exactly the same way.
> 
> I also tried to subscribe to the list here
> 
>  https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> once yesterday, and once today, but no email ever came back.  Hopefully this message gets through!
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From qwzhang0601 at gmail.com  Wed Sep 13 13:42:01 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 15:42:01 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
Message-ID: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>

Dear Carson:

Thank you for your explanation.  Sorry for not describing my problem
clearly. The first two errors were all gone after I changed the parameters
you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
following error for two contigs among thousands of contigs. One of the two
failed contigs has length 863k, and I have done more tests on this contig
individually. By running repeatmask on this contig, 65% was masked when
using species specific repeat library, while it is only 35% when using
mammalian repeat library. Since longer contigs (even 98Mb) can all be
annotated, I doubt why this much shorter one can fail due to IO.

I did not set "TMP", and I am running on a high performance cluster. I am
not sure whether it is a virtual memory or not. I will check it later. Many
thanks

Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31

Best
Quanwei

2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
>
> The first two are memory related with the second being because it cannot
> kill a lock maintainer thread that it was not able to start because of lack
> of memory.
>
> The third one is IO related. It is a truncated file that succeeded on the
> second try according to the e-mail you sent.
>
>
> IO errors are quite common with NFS (network mounted file systems). It?s
> one of the most frequent issues submitted to the devel list. MAKER can hit
> IO limits long before it hits CPU limits. One of the most frequent casues
> of these issues is that the user set TMP= in the control files to a manual
> location that is not suitable for high IO (note TMP= defaults to /tmp). The
> location should always be a true locally mounted disk. Sometimes this is a
> virtual location (not really local disk but network mounted disk or an in
> memory location). With the former you will get frequent IO failures and
> with the latter you will also get out of memory issues.
>
> Note that when you supply more data files you will also use more memory
> (to hold analysis results). According to your e-mail the last error you got
> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
> with two input files but not when you supply a single input file further
> suggests you are running low on RAM.
>
> 1. Some things to check. Make sure TMP= is not being set to a network
> mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory
> on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users
> or because you failed to request enough RAM during job submission.
>
> Finally, try running interactively so you can see what the memory and
> directory locations look like on the node you get assigned for the job
> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
> local disk?). Also run with MPI rather than starting multiple MAKER
> instances. It uses resources better.
>
> Thanks,
> Carson
>
>
>
>
>
>
> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I did more tests on one of the contigs (with length 863kb) that failed
> when doing repeat masking. I found it only fail when I added the species
> specific repeat library, and it can be successfully annotated when only
> considering mammalian repeat library. When I did the test I only picked the
> this contig and run maker with 64G memory. So I think the failure should
> not be the problem with memory or IO, because even the contigs with length
> 98Mb can be annotated with memory 32G.
>
> I also run RepeatMasker on this contig with mammalian and species specific
> repeat library, separately. I found when I use  mammalian repeat library,
> about 35% was masked as repeats, while it is 65% when I use species
> specific repeat library (as shown below in blue). I wonder whether the high
> level of repeats can lead to the failure of this contig.  Do you have any
> ideas about this. Thanks
>
>
>
> file name: test_scaffold31.fasta
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
>
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
>
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
>
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
>
> Unclassified:       196        45831 bp    5.31 %
>
> Total interspersed repeats:   557505 bp   64.56 %
>
>
> Small RNA:            3          823 bp    0.10 %
>
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
>
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>
>
> The query species was assumed to be homo
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"
>
>
>
> Best
> Quanwei
>
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I see. Thank you. I will try it.
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> Each node is a single machine. Because you currently run without MPI,
>>> each MAKER job you submit runs on a single machine. So you are either
>>> running multiple times on the same node, or you submitted 5 separate batch
>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>
>>> MPI can parallelize on the same node or across nodes. If you request 10
>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>> that case it will split up work within a single node and use all resources
>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>> just submit a job that goes to a single node and ask for all CPUs on that
>>> node (multinode jobs may be hard to configure, but single node jobs are
>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>> node, and it will parallelize within the node.
>>>
>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>
>>> ?Carson
>>>
>>>
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> Would you please explain what do you mean by "a single machine"? I am
>>> running maker2 on our high performance cluster. The cluster has more than
>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>> as the scheduler. Can I use MPICH3?
>>>
>>> Thanks
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> If you are just using a single machine (and not cross machine MPI), use
>>>> MPICH3 ?> https://www.mpich.org
>>>>
>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> It could be either. Please use MPI instead of starting multiple
>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>> but the relatively shorter ones failed.
>>>>>
>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>> scaffolds individually with larger memory to see whether they can be
>>>>> annotated.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> I think the cause of the error may have been a little further
>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>> being used is actually across the network), then they can be lest robust
>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>> the retry capabilities of MAKER.
>>>>>>
>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> About the error in my above email, I found the contig was correctly
>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>> reason? Many thanks
>>>>>>
>>>>>> Here are some parameters I used
>>>>>>
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>> RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>> specific repeat library in fasta format for Repe
>>>>>>
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>> line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> I got the following error again. Is this still related to memory
>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>
>>>>>>> Thank you! Have a nice weekend!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>> Now starting the contig!!
>>>>>>> SeqID: Contig10
>>>>>>> Length: 18773588
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>>
>>>>>>>
>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>> doing repeat masking
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> collecting blastx repeatmasking
>>>>>>> processing all repeats
>>>>>>> doing repeat masking
>>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>>> line 1050.
>>>>>>> --> rank=NA, hostname=n224
>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>
>>>>>>>>
>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>> memory and time?
>>>>>>>>
>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>>
>>>>>>>>
>>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>
>>>>>>>>
>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>
>>>>>>>>
>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>> least 6 times slower than BLASTN
>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>
>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>
>>>>>>>>
>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>> shorter ones).
>>>>>>>>
>>>>>>>>
>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>> MPI.
>>>>>>>>
>>>>>>>>
>>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>
>>>>>>>>
>>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>>
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/31f8118a/attachment-0001.html>

From carsonhh at gmail.com  Wed Sep 13 14:21:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 14:21:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
Message-ID: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>

One final thought. If you are using rmblast as part of the RepeatMasker installation, it may be suffering a bug that some blast version suffer from that can sometimes lead to truncation of a blast report  (example of a separate error related to blast report truncation here)?> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ <https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ>

As a result there is a special update to rmblast ?> http://www.repeatmasker.org/RMBlast.html <http://www.repeatmasker.org/RMBlast.html>

So if you are not using the update try it, but if you are using the update and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update may be the cause or the cure or RepeatMasker errors).

?Carson


> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thank you for your explanation.  Sorry for not describing my problem clearly. The first two errors were all gone after I changed the parameters you suggested (e.g., max_dna_len, depeth_blast). Now I only get the following error for two contigs among thousands of contigs. One of the two failed contigs has length 863k, and I have done more tests on this contig individually. By running repeatmask on this contig, 65% was masked when using species specific repeat library, while it is only 35% when using mammalian repeat library. Since longer contigs (even 98Mb) can all be annotated, I doubt why this much shorter one can fail due to IO.
> 
> I did not set "TMP", and I am running on a high performance cluster. I am not sure whether it is a virtual memory or not. I will check it later. Many thanks
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> Best
> Quanwei
> 
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 
> The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.
> 
> The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.
> 
> 
> IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.
> 
> Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.
> 
> 1. Some things to check. Make sure TMP= is not being set to a network mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.
> 
> Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.
> 
> Thanks,
> Carson
> 
> 
> 
> 
> 
> 
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
>> 
>> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
>> 
>> 
>> 
>> file name: test_scaffold31.fasta    
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>> 
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>> 
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>> 
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>> 
>> Unclassified:       196        45831 bp    5.31 %
>> 
>> Total interspersed repeats:   557505 bp   64.56 %
>> 
>> 
>> Small RNA:            3          823 bp    0.10 %
>> 
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>> 
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>                                                       
>> 
>> The query species was assumed to be homo          
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>         
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I see. Thank you. I will try it.
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
>> 
>> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
>> 
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>> 
>> ?Carson
>> 
>> 
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson: 
>>> 
>>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>>> 
>>> Thanks
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>>> 
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>>> 
>>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>>> 
>>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>>> 
>>>>>> Dear Carson:
>>>>>> 
>>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>>> 
>>>>>> Here are some parameters I used
>>>>>> 
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>>> 
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> 
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>> 
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>>> Dear Carson:
>>>>>> 
>>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>>> 
>>>>>> Thank you! Have a nice weekend! 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> #---------------------------------------------------------------------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #---------------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>>> 
>>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>>> 
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>>> 
>>>>>> 
>>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>>> 
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>> 
>>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>>> 
>>>>>> 
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>>> 
>>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>>> 
>>>>>> 
>>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>>> 
>>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>>> 
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/5707fd81/attachment-0001.html>

From qwzhang0601 at gmail.com  Wed Sep 13 14:26:11 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 16:26:11 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
	<55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
Message-ID: <CAOW6FSKU9Tn6HN3fZAnXquVU0OrdsxZuHB8GCG76BNQAZ_kdKg@mail.gmail.com>

Dear Carson:

I will take a look at try it. Thank you.

Best
Quanwei

2017-09-13 16:21 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> One final thought. If you are using rmblast as part of the RepeatMasker
> installation, it may be suffering a bug that some blast version suffer from
> that can sometimes lead to truncation of a blast report  (example of a
> separate error related to blast report truncation here)?>
> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ
>
> As a result there is a special update to rmblast ?>
> http://www.repeatmasker.org/RMBlast.html
>
> So if you are not using the update try it, but if you are using the update
> and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update
> may be the cause or the cure or RepeatMasker errors).
>
> ?Carson
>
>
>
> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thank you for your explanation.  Sorry for not describing my problem
> clearly. The first two errors were all gone after I changed the parameters
> you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
> following error for two contigs among thousands of contigs. One of the two
> failed contigs has length 863k, and I have done more tests on this contig
> individually. By running repeatmask on this contig, 65% was masked when
> using species specific repeat library, while it is only 35% when using
> mammalian repeat library. Since longer contigs (even 98Mb) can all be
> annotated, I doubt why this much shorter one can fail due to IO.
>
> I did not set "TMP", and I am running on a high performance cluster. I am
> not sure whether it is a virtual memory or not. I will check it later. Many
> thanks
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
> Best
> Quanwei
>
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> These are the 3 errors you have shown in your e-mails ?>
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>>
>> The first two are memory related with the second being because it cannot
>> kill a lock maintainer thread that it was not able to start because of lack
>> of memory.
>>
>> The third one is IO related. It is a truncated file that succeeded on the
>> second try according to the e-mail you sent.
>>
>>
>> IO errors are quite common with NFS (network mounted file systems). It?s
>> one of the most frequent issues submitted to the devel list. MAKER can hit
>> IO limits long before it hits CPU limits. One of the most frequent casues
>> of these issues is that the user set TMP= in the control files to a manual
>> location that is not suitable for high IO (note TMP= defaults to /tmp). The
>> location should always be a true locally mounted disk. Sometimes this is a
>> virtual location (not really local disk but network mounted disk or an in
>> memory location). With the former you will get frequent IO failures and
>> with the latter you will also get out of memory issues.
>>
>> Note that when you supply more data files you will also use more memory
>> (to hold analysis results). According to your e-mail the last error you got
>> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
>> with two input files but not when you supply a single input file further
>> suggests you are running low on RAM.
>>
>> 1. Some things to check. Make sure TMP= is not being set to a network
>> mounted location.
>> 2. Make sure your temporary directory is not a virtual in memory
>> directory on the node being used.
>> 3. If nodes are shared, you may run out of memory because of other users
>> or because you failed to request enough RAM during job submission.
>>
>> Finally, try running interactively so you can see what the memory and
>> directory locations look like on the node you get assigned for the job
>> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
>> local disk?). Also run with MPI rather than starting multiple MAKER
>> instances. It uses resources better.
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Dear Carson:
>>
>> I did more tests on one of the contigs (with length 863kb) that failed
>> when doing repeat masking. I found it only fail when I added the species
>> specific repeat library, and it can be successfully annotated when only
>> considering mammalian repeat library. When I did the test I only picked the
>> this contig and run maker with 64G memory. So I think the failure should
>> not be the problem with memory or IO, because even the contigs with length
>> 98Mb can be annotated with memory 32G.
>>
>> I also run RepeatMasker on this contig with mammalian and species
>> specific repeat library, separately. I found when I use  mammalian repeat
>> library, about 35% was masked as repeats, while it is 65% when I use
>> species specific repeat library (as shown below in blue). I wonder whether
>> the high level of repeats can lead to the failure of this contig.  Do you
>> have any ideas about this. Thanks
>>
>>
>>
>> file name: test_scaffold31.fasta
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>>
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>>
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>>
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>>
>> Unclassified:       196        45831 bp    5.31 %
>>
>> Total interspersed repeats:   557505 bp   64.56 %
>>
>>
>> Small RNA:            3          823 bp    0.10 %
>>
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>>
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>
>>
>> The query species was assumed to be homo
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in
>> ".../consensi.fa.classifiednoProtFinal"
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I see. Thank you. I will try it.
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> Each node is a single machine. Because you currently run without MPI,
>>>> each MAKER job you submit runs on a single machine. So you are either
>>>> running multiple times on the same node, or you submitted 5 separate batch
>>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>>
>>>> MPI can parallelize on the same node or across nodes. If you request 10
>>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>>> that case it will split up work within a single node and use all resources
>>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>>> just submit a job that goes to a single node and ask for all CPUs on that
>>>> node (multinode jobs may be hard to configure, but single node jobs are
>>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>>> node, and it will parallelize within the node.
>>>>
>>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> Would you please explain what do you mean by "a single machine"? I am
>>>> running maker2 on our high performance cluster. The cluster has more than
>>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>>> as the scheduler. Can I use MPICH3?
>>>>
>>>> Thanks
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> If you are just using a single machine (and not cross machine MPI),
>>>>> use MPICH3 ?> https://www.mpich.org
>>>>>
>>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I met some problems to use MPI. I will give it another try.
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> It could be either. Please use MPI instead of starting multiple
>>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If
>>>>>> it is related to memory issue or an IO issue, I am not sure why the much
>>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>>> but the relatively shorter ones failed.
>>>>>>
>>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>>> scaffolds individually with larger memory to see whether they can be
>>>>>> annotated.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>> I think the cause of the error may have been a little further
>>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>>> being used is actually across the network), then they can be lest robust
>>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>>> the retry capabilities of MAKER.
>>>>>>>
>>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> About the error in my above email, I found the contig was correctly
>>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>>> reason? Many thanks
>>>>>>>
>>>>>>> Here are some parameters I used
>>>>>>>
>>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>>> RepeatMasker
>>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>>> specific repeat library in fasta format for Repe
>>>>>>>
>>>>>>> max_dna_len=300000
>>>>>>> split_hit=40000
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>>> line 188.
>>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>>
>>>>>>>> Dear Carson:
>>>>>>>>
>>>>>>>> I got the following error again. Is this still related to memory
>>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>>
>>>>>>>> Thank you! Have a nice weekend!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>> Now starting the contig!!
>>>>>>>> SeqID: Contig10
>>>>>>>> Length: 18773588
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>>
>>>>>>>>
>>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>>> doing repeat masking
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> collecting blastx repeatmasking
>>>>>>>> processing all repeats
>>>>>>>> doing repeat masking
>>>>>>>> Can't kill a non-numeric process ID at
>>>>>>>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line
>>>>>>>> 1050.
>>>>>>>> --> rank=NA, hostname=n224
>>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> Best
>>>>>>>> Quanwei
>>>>>>>>
>>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>>> memory and time?
>>>>>>>>>
>>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element
>>>>>>>>> masking
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This values really only affects the final evidence kept in the
>>>>>>>>> GFF3 when you look at it in a browser. It has not affect on the annotation.
>>>>>>>>> This is because internally MAKER already collapses evidence down to the 10
>>>>>>>>> best non-redundant features per evidence set per locus. The rest are put in
>>>>>>>>> the GFF3 just for reference. by setting it lower, you are just letting
>>>>>>>>> MAKER know it can through things away even sooner since you don?t want them
>>>>>>>>> in the GFF3. It provides a minor improvement for memory use, but
>>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>>> least 6 times slower than BLASTN
>>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>>
>>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>>> shorter ones).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>>> MPI.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (5) Still about the speed issue. I read some of your comments
>>>>>>>>> about "cpus" parameters in the maker_opts file (
>>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The cpu parameter only affects how many CPUs are given to the
>>>>>>>>> blast command line. So only the BLASt step will speed up, so I recommend
>>>>>>>>> using MPI to get all steps to speed up. Even if you are only running on a
>>>>>>>>> single node, you can give all CPUs to the mpiexec command.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ?Carson
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/42eb2d53/attachment-0001.html>

From xvazquezc at gmail.com  Sun Sep 17 19:12:56 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 18 Sep 2017 11:12:56 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
Message-ID: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>

I did it that way and AUGUSTUS is predicting a more reasonable number of
genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train
Augustus over the masked genome instead?
Thanks,


On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:

> BUSCO may be generating too few models. BUSCO also identifies classes of
> conserved short genes that may not represent enough training diversity for
> your organism. Try running MAKER in protein2genome or est2genome mode, and
> then train with those results.
>
> ?Carson
>
>
> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained
> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
> is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content
> close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea
> https://gitlab.com/ezlab/busco/issues/49
> Cheers,
> Xabi
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a8cfffd6/attachment-0001.html>

From qwzhang0601 at gmail.com  Mon Sep 18 21:07:25 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 18 Sep 2017 23:07:25 -0400
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
Message-ID: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>

Hello:

Would you please explain what is the difference between
"maker-...-agustus..." and "augustus_masked..." gene models?

I know  "augustus_masked..." gene models are raw august predictions, while
"maker-...-agustus..." are hit derived gene models. But by default, maker2
reports gene models with evidence support (protein sequences or
transcripts). Then why some gene models are hit derived while other models
(with evidence support) are raw augustus prediction (even there are protein
sequences or transcript evidence)?

BTW, is it true that generally the "maker-...-agustus..." gene models are
more reliable than the "augustus_masked..." gene models?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a273a8fe/attachment-0001.html>

From qwzhang0601 at gmail.com  Mon Sep 18 22:14:38 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 00:14:38 -0400
Subject: [maker-devel] about min_protein
Message-ID: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>

Hello:

I am working on a rodent species and get 28k annotated genes, I wonder
whether you have any suggestions about the "min_protein" parameter?

I did not change the parameter in my current annotation. I get several very
short predicted proteins (even those with only 1 amino acid).

min_protein=0 #require at least this many amino acids in predicted proteins

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/3bd06e0a/attachment-0001.html>

From qwzhang0601 at gmail.com  Tue Sep 19 06:47:00 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 08:47:00 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
Message-ID: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>

Thank you Daniel. I wonder whether there is a suggested value for the
?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people
often use. I am studying a rodent species.

Thank you.

Best
Quanwei

2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com>:

> Hi Quanwei,
>
> Increasing the ?min_protein" parameter should get ride of those very short
> predicted proteins.
>
>
>
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder
> whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several
> very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted
> proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/f2b950ea/attachment-0001.html>

From dandence at gmail.com  Tue Sep 19 06:29:35 2017
From: dandence at gmail.com (Daniel Ence)
Date: Tue, 19 Sep 2017 08:29:35 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
Message-ID: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>

Hi Quanwei, 

Increasing the ?min_protein" parameter should get ride of those very short predicted proteins. 


> On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter? 
> 
> I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid). 
>  
> min_protein=0 #require at least this many amino acids in predicted proteins
> 
> Thanks
> 
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From tuanduonganh at gmail.com  Tue Sep 19 11:23:39 2017
From: tuanduonganh at gmail.com (Tuan Duong Anh)
Date: Tue, 19 Sep 2017 19:23:39 +0200
Subject: [maker-devel] MAKER3 beta - EVM under predicting
Message-ID: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>

Dear MAKER-devel group

I have been testing out MAKER3 beta version and found out that EVM always
returns much less number of models. Did any one experience this before? I
do expect that EVM will return less models when compare to other, but not
to this extend (only 20% of the expected gene models). Any suggestion would
be much appreciated.

## Number of models obtained by each gene predictors:

HLIG.all.maker.augustus_masked.proteins.fasta:11224

HLIG.all.maker.evm.proteins.fasta:1974

HLIG.all.maker.genemark.proteins.fasta:11352

HLIG.all.maker.proteins.fasta:13672

HLIG.all.maker.snap_masked.proteins.fasta:13404

## maker_evm.ctl

#-----Transcript weights

evmtrans=10 #default weight for source unspecified est/alt_est alignments

evmtrans:blastn=0 #weight for blastn sourced alignments

evmtrans:est2genome=10 #weight for est2genome sourced alignments

evmtrans:tblastx=0 #weight for tblastx sourced alignments

evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments


#-----Protein weights

evmprot=10 #default weight for source unspecified protein alignments

evmprot:blastx=2 #weight for blastx sourced alignments

evmprot:protein2genome=10 #weight for protein2genome sourced alignments


#-----Abinitio Prediction weights

evmab=10 #default weight for source unspecified ab initio predictions

evmab:snap=7 #weight for snap sourced predictions

evmab:augustus=10 #weight for augustus sourced predictions

evmab:fgenesh=10 #weight for fgenesh sourced predictions

evmab:genemark=10 #weight for genemark sourced predictions


Regards,

Tuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/4e1fc970/attachment-0001.html>

From carsonhh at gmail.com  Tue Sep 19 15:34:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:34:40 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
Message-ID: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>

Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.

?Carson


> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
> 
> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
> Thanks,
> 
> 
> 
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
> 
> ?Carson
> 
> 
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>> Cheers,
>> Xabi
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/768b9648/attachment-0001.html>

From carsonhh at gmail.com  Tue Sep 19 15:40:27 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:40:27 -0600
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
In-Reply-To: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
References: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
Message-ID: <56CC4BEB-083E-4DE6-99F3-CB34A1735AB4@gmail.com>

MAKER uses all derived models as a pool of alternate models for a given locus.  The one that best matches the aligned evidence is then selected using the AED calculation described in the MAKER2 publication. Overall hint based models tend to perform better than the raw models because they get extra info about observed intron/exon structure from alignments. There is also a discussion of this in the MAKER2 paper.

?Carson


> On Sep 18, 2017, at 9:07 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Would you please explain what is the difference between "maker-...-agustus..." and "augustus_masked..." gene models? 
> 
> I know  "augustus_masked..." gene models are raw august predictions, while "maker-...-agustus..." are hit derived gene models. But by default, maker2 reports gene models with evidence support (protein sequences or transcripts). Then why some gene models are hit derived while other models (with evidence support) are raw augustus prediction (even there are protein sequences or transcript evidence)?
> 
> BTW, is it true that generally the "maker-...-agustus..." gene models are more reliable than the "augustus_masked..." gene models?  
> 
> Thanks
> 
> Best
> Quanwei


From carsonhh at gmail.com  Tue Sep 19 15:41:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:41:40 -0600
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
	<CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
Message-ID: <FFA05628-32ED-4036-9FDC-E6C7BC4EAE4C@gmail.com>

The value is arbitrary, but some submission databases like NCBI will flag entries under ~20-30 amino acids as errors if you try and submit them (I can?t remember the exact number).

?Carson


> On Sep 19, 2017, at 6:47 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Thank you Daniel. I wonder whether there is a suggested value for the ?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people often use. I am studying a rodent species. 
> 
> Thank you.
> 
> Best
> Quanwei
> 
> 2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com <mailto:dandence at gmail.com>>:
> Hi Quanwei,
> 
> Increasing the ?min_protein" parameter should get ride of those very short predicted proteins.
> 
> 
> 
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/8b03be64/attachment-0001.html>

From carsonhh at gmail.com  Tue Sep 19 15:47:42 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:47:42 -0600
Subject: [maker-devel] MAKER3 beta - EVM under predicting
In-Reply-To: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
References: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
Message-ID: <12FE3318-F0DE-485B-B43A-25A4A6EC9390@gmail.com>

If ab initio predictors and evidence alignments aren?t in high concordance, then EVM won?t produce results. This often indicates minor sequencing errors in the assembly (this is very common in draft assemblies). Ab initio predictors will slightly alter splicing and extend introns/exons to make a model work around these variations, but doing this does not always concord well with the alignment, so EVM produces nothing. In these cases it is often better just to train the predictor as well as you can, and then take the standard MAKER results.

?Carson


> On Sep 19, 2017, at 11:23 AM, Tuan Duong Anh <tuanduonganh at gmail.com> wrote:
> 
> Dear MAKER-devel group
> 
> I have been testing out MAKER3 beta version and found out that EVM always returns much less number of models. Did any one experience this before? I do expect that EVM will return less models when compare to other, but not to this extend (only 20% of the expected gene models). Any suggestion would be much appreciated.
> 
> ## Number of models obtained by each gene predictors:
> HLIG.all.maker.augustus_masked.proteins.fasta:11224
> HLIG.all.maker.evm.proteins.fasta:1974
> HLIG.all.maker.genemark.proteins.fasta:11352
> HLIG.all.maker.proteins.fasta:13672
> HLIG.all.maker.snap_masked.proteins.fasta:13404
> 
> ## maker_evm.ctl
> #-----Transcript weights
> evmtrans=10 #default weight for source unspecified est/alt_est alignments
> evmtrans:blastn=0 #weight for blastn sourced alignments
> evmtrans:est2genome=10 #weight for est2genome sourced alignments
> evmtrans:tblastx=0 #weight for tblastx sourced alignments
> evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments
> 
> #-----Protein weights
> evmprot=10 #default weight for source unspecified protein alignments
> evmprot:blastx=2 #weight for blastx sourced alignments
> evmprot:protein2genome=10 #weight for protein2genome sourced alignments
> 
> #-----Abinitio Prediction weights
> evmab=10 #default weight for source unspecified ab initio predictions
> evmab:snap=7 #weight for snap sourced predictions
> evmab:augustus=10 #weight for augustus sourced predictions
> evmab:fgenesh=10 #weight for fgenesh sourced predictions
> evmab:genemark=10 #weight for genemark sourced predictions
> 
> 
> Regards,
> Tuan
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/2c7d2669/attachment-0001.html>

From xvazquezc at gmail.com  Tue Sep 19 18:02:04 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Wed, 20 Sep 2017 10:02:04 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
Message-ID: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>

Thanks Carson.

Last quick question. After the first run (before using the gene predictors)
I ran fasta_merge to get an idea of the numbers I should be looking for.
In summary, I got 14000 genes, only using Swissprot and a close highly
curated reference genome to avoid any "fake" protein or partial proteins
from draft annotations, plus assembled RNA-seq from my genome.
How should I consider this as a guide? (if I can do so) ... Is this a
number I should be aiming as a minimum number of genes? maximum? something
around that?

PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few
possible fragments due assembly (seq errors aside)

On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com> wrote:

> Gene predictors tend to over predict, so I would not take the high numbers
> given by SNAP and GeneMark as true counts. You will probably end up with
> something like 7-10k in the final results. But now Augustus is giving a
> higher count, you should be good to start running MAKER.
>
> ?Carson
>
>
>
>
> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> I did it that way and AUGUSTUS is predicting a more reasonable number of
> genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
>
> I haven't found any reference about but, would it be a good idea to train
> Augustus over the masked genome instead?
> Thanks,
>
>
>
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:
>
>> BUSCO may be generating too few models. BUSCO also identifies classes of
>> conserved short genes that may not represent enough training diversity for
>> your organism. Try running MAKER in protein2genome or est2genome mode, and
>> then train with those results.
>>
>> ?Carson
>>
>>
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
>> wrote:
>>
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained
>> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
>> is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content
>> close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea
>> https://gitlab.com/ezlab/busco/issues/49
>> Cheers,
>> Xabi
>>
>> --
>> Xabier V?zquez-Campos, *PhD*
>> *Research Associate*
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/ca7c08db/attachment-0001.html>

From himanimalhotra89 at gmail.com  Tue Sep 19 22:56:55 2017
From: himanimalhotra89 at gmail.com (himani malhotra)
Date: Wed, 20 Sep 2017 10:26:55 +0530
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
Message-ID: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>

---------- Forwarded message ----------
From: himani malhotra <himanimalhotra89 at gmail.com>
Date: Wed, Sep 20, 2017 at 10:24 AM
Subject: maker error
To: maker-devel-request at box290.bluehost.com


hello
I am using MAKER for gene prediction.I am getting error in Repbase
installation.I am sending you the error also,please help me.I have
installed repbase manually and unpacked its libraries in RepeatMasker
Library but still I am getting error.
Please help me.


Thanks

Himani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: makererror.png
Type: image/png
Size: 212522 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment-0001.png>

From munholl at uwindsor.ca  Wed Sep 20 08:53:04 2017
From: munholl at uwindsor.ca (Seth Munholland)
Date: Wed, 20 Sep 2017 10:53:04 -0400
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <CAL=sJwrjccQC0GdDa3Km1TojWMdN1aYoujntVsjdMjJ9ha2YUw@mail.gmail.com>

Hello,

When this happened to me it was a faulty pathing on my part when I
configured RepeatMasker (which I also manually installed).

Seth Munholland, B.Sc., Ph.D. Candidate
Department of Biological Sciences
Rm. 304 Biology Building
University of Windsor
401 Sunset Ave. N9B 3P4
T: (519) 253-3000 Ext: 4755

On Wed, Sep 20, 2017 at 12:56 AM, himani malhotra <
himanimalhotra89 at gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com
>
>
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase
> installation.I am sending you the error also,please help me.I have
> installed repbase manually and unpacked its libraries in RepeatMasker
> Library but still I am getting error.
> Please help me.
>
>
>
> Thanks
>
> Himani
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/c89e50fe/attachment-0001.html>

From Jimmy.Cross at uea.ac.uk  Wed Sep 20 08:02:53 2017
From: Jimmy.Cross at uea.ac.uk (James Cross (ITCS - Staff))
Date: Wed, 20 Sep 2017 14:02:53 +0000
Subject: [maker-devel] Maker MPI across nodes
Message-ID: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>

Hi Maker Developers,

We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core's so 56 Core's in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core's) as opposed to being run on a single node (28 Core's). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?

Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.

The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp).

The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker

Any help or advise you could give would be greatly appreciated.

Best Wishes
Jimmy
----------------------------------------------------------------------
Mr  James Cross
HPC Systems Developer
University of East Anglia
Norwich Research Park
ITCS
Norwich, Norfolk
NR4 7TJ

Information Services

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/e1e9d5cb/attachment-0001.html>

From patrick.tranvan at unil.ch  Thu Sep 21 03:26:52 2017
From: patrick.tranvan at unil.ch (Patrick Tran Van)
Date: Thu, 21 Sep 2017 09:26:52 +0000
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>,
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
Message-ID: <1505986013492.52354@unil.ch>

Hi Carson,


I have a doubt for the round 2, so in a previous reply you said:


" Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "


Does it means that I don't need to modify the section :


#-----Re-annotation Using MAKER Derived GFF3


?


If I let everything by default such as :


altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no


It will not look again for repeat and protein + transcriptome alignment ?

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Monday, July 3, 2017 10:50 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Advice on my pipeline

maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).

So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.

The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).

You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/

Thanks,
Carson


On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.

I have then use SNAP to train/filter it with:

maker2zff  specie.all.gff

Here are my results:

Number of gene after maker -> Number of gene after maker2zff

- Without corrected_est_fusion: 21621 -> 13875
- With corrected_est_fusion: 16850 -> 9098

1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
Normally I should find more genes with corrected_est_fusion right ?

2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?

 Thanks for your help


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Sorry the option is ?> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

?Carson


On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).

?Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Hello,

This is my first time running Maker for an insect genome annotation.

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1


Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170921/c54c44f5/attachment-0001.html>

From carsonhh at gmail.com  Fri Sep 22 11:57:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 11:57:56 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
	<CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
Message-ID: <06E8D6C3-B278-4820-B309-5CF61186FDCB@gmail.com>

I don?t think you can use the protein2genome option to estimate gene count. It will turn any alignment that matches at east 50% into a gene model. So you can get a lot of partial models which will inflate gene count. It?s good enough for training but not so much annotation.

?Carson


> On Sep 19, 2017, at 6:02 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Thanks Carson.
> 
> Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for.
> In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome. 
> How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that?
> 
> PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside)
> 
> On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.
> 
> ?Carson
> 
> 
> 
> 
>> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
>> In comparison, SNAP gives 16000 and GeneMark 19000.
>> 
>> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
>> Thanks,
>> 
>> 
>> 
>> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> 
>>> Hi,
>>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>>> Has anybody come up with any similar issue?
>>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>>> Cheers,
>>> Xabi
>>> 
>>> -- 
>>> Xabier V?zquez-Campos, PhD
>>> Research Associate
>>> NSW Systems Biology Initiative
>>> School of Biotechnology and Biomolecular Sciences
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/edabec82/attachment-0001.html>

From carsonhh at gmail.com  Fri Sep 22 13:47:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:47:36 -0600
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <5196E0C2-9FDC-4B6A-9D14-CA8514E002EF@gmail.com>

You have a couple of errors at the start indicating that you may have an issue with the perl forks module as well as RepeatMasker installations. I?d recommend redoing both installations. Also the screen shot you show is not the failure, it is MAKER giving up after failing 2 times. To capture the actual failure set the try count to 3, then rerun and see what comes up in STDERR. Redirect STDERR to a file using ?&>?
.
Example:
maker &> err.log

Thanks,
Carson


On Sep 19, 2017, at 10:56 PM, himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>> wrote:

> 
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com <mailto:maker-devel-request at box290.bluehost.com>
> 
> 
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase installation.I am sending you the error also,please help me.I have installed repbase manually and unpacked its libraries in RepeatMasker Library but still I am getting error.
> Please help me.
> 
> 
> 
> Thanks 
> 
> Himani
> 
> <makererror.png>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/fc4e340a/attachment-0001.html>

From carsonhh at gmail.com  Fri Sep 22 13:59:17 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:59:17 -0600
Subject: [maker-devel] Maker MPI across nodes
In-Reply-To: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
References: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
Message-ID: <BD2A6E4D-280B-4B38-AA1C-05C03503848C@gmail.com>

The "-mca btl ^openib? flag has the side affect of bypassing infiniband and using ethernet. But if alternate communicators are too slow, you can switch back to indirect infiniband by using '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?. That option will force IP over infiniband whichb instead of direct infiniband. OpenFabrics libraries used by infiniband has a know issue because of how it uses registered memory (it generates seg faults whenever a program does system calls - i.e. MAKER calling BLAST). So you can?t use direct infinband with MAKER. So try this instead ?>  '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?

Also if it stays slow, it likely means you are hitting IO limits. If that is the case, make sure you are note setting TMP= to a network mounted disk location, and that whatever temp space exists on your cluster it needs to be per node real local mounted disk and not network mounted disk.

?Carson


> On Sep 20, 2017, at 8:02 AM, James Cross (ITCS - Staff) <Jimmy.Cross at uea.ac.uk> wrote:
> 
> Hi Maker Developers,
>  
> We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core?s so 56 Core?s in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core?s) as opposed to being run on a single node (28 Core?s). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?
>  
> Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.
>  
> The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp). 
>  
> The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker
>  
> Any help or advise you could give would be greatly appreciated. 
>  
> Best Wishes
> Jimmy
> ----------------------------------------------------------------------
> Mr  James Cross
> HPC Systems Developer
> University of East Anglia
> Norwich Research Park
> ITCS
> Norwich, Norfolk
> NR4 7TJ
>  
> Information Services
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/7fdc5720/attachment-0001.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 14:04:10 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:04:10 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>

MAKER won?t produce est2genome results for est_gff. This is partially because est2genome results are only used for training gene predictors. So you are essentially just getting protein2genome results from your runs. Once you get a gene predictor trained you will see a difference, as it will use the intron/exon structure of alignments as hints to improve gene predictor performance.

?Carson


> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> I have tried the proposed options for a small example (yeast).
> 
> I had 
> - proteins (fasta) from another yeast and 
> - transcript annotation (gff) from cufflinks and StringTie
> 
> I'd like to compare the maker results for 
> - proteins and StringTie
> Vs.
> - proteins and cufflinks
> 
> I used the default options, except:
> genome=<genome fasta>
> 
> protein=<protein fasta>
> est_gff=<transcript gff>
> 
> est2genome=1
> protein2genome=1
> 
> (An example is attached.)
> 
> Then I ran maker:
> 
> maker -RM_off -c 24
> find . -type f -name *.gff -exec cat {} + | grep maker > filtered-maker-prediction.gff
> 
> (The run seems to be okay. There were no FAILED, ... in the log. Cf. attachment)
> 
> Each maker run was started in a separate subdirectory.
> However, I realized that both maker runs yielded almost the same result (just one minor edit). This made me curious. 
> As far as I understood the files, I received the (filtered?) exonerate predictions for the proteins (from the other yeast). Is this correct? Why did I not receive any predictions (purely) based on the RNA-seq data? Did I something wrong?
> 
> I'm looking forward to your reply.
> 
> Best regards, Jens
> 
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Dienstag, 19. September 2017 23:37
>> An: Keilwagen, Jens
>> Betreff: Re: MAKER
>> 
>> MAKER cannot use the BAM directly, but you can use something like
>> stringtie or trinity to assemble a transcript fasta that can be given
>> to the est= option.
>> 
>> Ab initio gene prediction is only enabled if you specify an hmm or
>> species file to use.  If all you want is homology based annotation, you
>> can try the est2genome and protein2genome options. Note the final
>> models may be partial if the alignments do not cover the gene end to
>> end.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> thanks a lot for your last email that .
>>> 
>>> I was asked to do homology-based gene prediction using RNA-seq and
>> Maker was proposed as one option.
>>> Hence I'd like to ask how to do that in the best possible way.
>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>> related species. How can I integrate the RNA-seq data?
>>> 
>>> Is it possible to deactivate ab-initio gene prediction by Augustus or
>> SNAP?
>>> 
>>> Thanks a lot in advance.
>>> 
>>> Bets regards, Jens
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>> An: Keilwagen, Jens
>>>> Cc: Mark Yandell
>>>> Betreff: Re: MAKER
>>>> 
>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>> could give the GFF3 results to the pred_gff= option in MAKER (comma
>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>> the same coordinate space as the assembly being annotated (genome=
>> option).
>>>> Whatever you give to pred_gff will be treated as a raw predictions
>> by
>>>> MAKER and will only be accepted as a final model if there are
>>>> evidence alignments (protein/EST) that support the model, and if
>>>> there are multiple alternate models at the same locus, only the
>> model
>>>> that is best supported by the protein/transcript evidence is kept.
>>>> 
>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>> will cause even raw predictions with no evidence support to be
>> maintained.
>>>> In the event of multiple models with no evidence support, the model
>>>> best matching the consensus of alternate models will be maintained.
>>>> 
>>>> Alternatively you can use the model_gff= options (comma separated
>>>> list
>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>> confidence than pred_gff. At least one model will always be kept
>>>> regardless of evidence support (same rules as pred_gff selection for
>>>> which model to keep when there are multiple). But model_gff will
>> also
>>>> affect how evidence clusters are determined compared to pred_gff
>>>> (model_gff features are allowed to merge bridging evidence
>> clusters).
>>>> MAKER will also go to extra lengths to pull forward existing names
>>>> and other data in the GFF3 for model_gff features.
>>>> 
>>>> If you do not have GFF3 files in the right coordinate space, but do
>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>> you can supply these to the protein= and transcript= options in
>> MAKER
>>>> together with est2genome=1 or protein2genome=1. This will cause
>> MAKER
>>>> to place the models using exonerate. You would probably also need to
>>>> add est_forward=1 to the control files to have MAKER try and derive
>>>> model names from the name of evidence alignments they were derived
>>>> from if you go this route.
>>>> 
>>>> You can also try treating the GFF3 predictions as hints to
>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>> predictions inform the behavior of predictors like SNAP and
>>>> Augustus). Might be interesting. You would have to alter results to
>>>> be match/match_part
>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>> 
>>>> Let me know if you have any more questions, and I?ll do my best to
>>>> help.
>>>> 
>>>> Thanks,
>>>> Carson
>>>> 
>>>> 
>>>> 
>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>> <myandell at genetics.utah.edu> wrote:
>>>>> 
>>>>> 
>>>>> Mark Yandell
>>>>> Professor of Human Genetics
>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>> University of Utah
>>>>> 15 North 2030 East, Room 2100
>>>>> Salt Lake City, UT 84112-5330
>>>>> ph:801-587-7707
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens" <jens.keilwagen at jki.bund.de>
>>>> wrote:
>>>>> 
>>>>>> Dear Prof. Yandell,
>>>>>> 
>>>>>> we have published a homology-based gene prediction program today:
>>>>>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw092
>>>>>> and I'd like to ask how we can use MAKER to combine predictions of
>>>>>> GeMoMa using different reference organisms, i.e. we try to predict
>>>>>> the genes of an target organism (e.g. wheat) using the annotated
>>>>>> genes of other reference organisms (e.g. grasses). GeMoMa returns
>>>> for
>>>>>> each reference organism a GFF with the predicted gene models in
>> the
>>>> target organism.
>>>>>> 
>>>>>> It would be great if you or someone from your team could give us
>>>> some
>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>> 
>>>>>> Thanks a lot and best regards, Jens
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> Dr. Jens Keilwagen
>>>>>> 
>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>> Cultivated
>>>>>> Plants
>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>> 
>>>>>> Erwin-Baur-Stra?e 27
>>>>>> 06484 Quedlinburg
>>>>>> Germany
>>>>>> 
>>>>>> Phone: ++49 (0)3946 47 510
>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>> 
>>>>>> 
>>>>> 
>>> 
> 
> <maker_opts.ctl><slurm-278767.out>


From eennadi at gmail.com  Fri Sep 22 13:27:37 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Fri, 22 Sep 2017 20:27:37 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
Message-ID: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:

> Ok, thanks.
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
>
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>
>> It would need to be a new run. You won't be able to use the updated
>> contig names with the old run.
>>
>> --Carson
>>
>> Sent from my iPhone
>>
>> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>> genome.fasta
>>
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_
>> trimmed_\(paired\)_,
>>
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,
>>
>> 1. How can I effect the change when maker has produced some files from
>> the the old sequence?
>>
>> I have spent more than 24 hours running maker and it has produced some
>> folders already.
>>
>> How can I make this change?
>>
>> Thanks
>>
>>
>>
>>
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/
>> profile/Emmanuel_Nnadi/publications
>>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>
>>> BLAST which is used by MAKER can not handle really long contig names.
>>> MAKER tries to get around this by adding a secondary tag to the fasta
>>> header when long names are detected. Even then it would be better to change
>>> the IDs of your contigs to avoid downstream failures.
>>>
>>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_?
>>> from each contig name.
>>>
>>> Example command to do that ?>
>>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>>> genome.fasta
>>>
>>> ?Carson
>>>
>>>
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>
>>> Hi Carson
>>> Thanks for your response its been helpful
>>>
>>> Please bear with me as I work through this
>>>
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it
>>> wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> examining contents of the fasta file and run log
>>>
>>>
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/
>>> profile/Emmanuel_Nnadi/publications
>>>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>
>>>> You can query valid species names using the queryTaxonomyDatabase.pl
>>>> script that comes with RepeatMasker. Try not to be too specific. In general
>>>> you should use the genus rather than the species for example (or even use
>>>> all of RepBase).
>>>>
>>>> Example ?>
>>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>
>>>> Hi Carson,
>>>>
>>>>  Thanks
>>>> I was able to start using maker.
>>>>
>>>> However I am working with a plant Genome novel. I had set the
>>>> repeatmasking to
>>>> 1. Dcotrep a names from the repbase release but maker returned it back
>>>> as not known to repeat masker
>>>>
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>>
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/
>>>> profile/Emmanuel_Nnadi/publications
>>>>
>>>>
>>>>
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>>>>
>>>>> MAKER will read the genome= options from the maker_opts.ctl file in
>>>>> your current directory or the maker_opts.ctl you specified on the command
>>>>> line. The error means you have left the value empty. Perhaps you did not
>>>>> save the changes you made or you did not specify the location of
>>>>> the maker_opts.ctl file to use.
>>>>>
>>>>> You can check the contents of the file using cat. Example ?>
>>>>> cat maker_opts.ctl
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>>
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running
>>>>> maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>>
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>>
>>>>> It gave the following *1_S7_assembly.fa I specified the name of the
>>>>> genome but when I ran maker in another tab it gave *
>>>>>
>>>>> #-----Genome (these are always required)
>>>>> genome=*1_S7_assembly.fa* #genome sequence (fasta file or fasta
>>>>> embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>> eukaryotic
>>>>>
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 =
>>>>> no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>>
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate
>>>>> organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>>
>>>>> #-----Protein Homology Evidence (for best results provide a file for
>>>>> at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple
>>>>> oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3
>>>>> file
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format
>>>>> for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta
>>>>> #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change
>>>>> this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e.
>>>>> seg and dust filtering)
>>>>>
>>>>>
>>>>> *I ran maker command on another tab and it returned the following*
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control
>>>>> files.
>>>>>
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>>
>>>>>
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same
>>>>> tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate
>>>>> protein fast for the de novo sequence and EST?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/
>>>>> profile/Emmanuel_Nnadi/publications
>>>>>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Here is a class on how to use MAKER taught a couple of years back ?>
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/M
>>>>>> AKER_Tutorial_for_GMOD_Online_Training_2014
>>>>>>
>>>>>> There is also a linked video as well as an amazon image of the class
>>>>>> material where you can run the image in the cloud and follow along.
>>>>>>
>>>>>> Thanks,
>>>>>> Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Carson,
>>>>>> Thanks a lot
>>>>>>
>>>>>> I ran this command maker -h it returned the following
>>>>>>
>>>>>> The last thing I wish to ask you, how can I load my genome fine and
>>>>>> being annotation?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>>
>>>>>> MAKER version 2.31.9
>>>>>>
>>>>>> Usage:
>>>>>>
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>>
>>>>>>
>>>>>> Description:
>>>>>>
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format
>>>>>> using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can
>>>>>> be used to
>>>>>>      produce gene annotations for new genomes as well as update
>>>>>> annotations
>>>>>>      from existing genome databases.
>>>>>>
>>>>>>      The three input arguments are control files that specify how
>>>>>> MAKER should
>>>>>>      behave. All options for MAKER should be set in the control
>>>>>> files, but a
>>>>>>      few can also be set on the command line. Command line options
>>>>>> provide a
>>>>>>      convenient machanism to override commonly altered control file
>>>>>> values.
>>>>>>      MAKER will automatically search for the control files in the
>>>>>> current
>>>>>>      working directory if they are not specified on the command line.
>>>>>>
>>>>>>      Input files listed in the control options files must be in fasta
>>>>>> format
>>>>>>      unless otherwise specified. Please see MAKER documentation to
>>>>>> learn more
>>>>>>      about control file  configuration.  MAKER will automatically try
>>>>>> and
>>>>>>      locate the user control files in the current working directory
>>>>>> if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>>
>>>>>>      It is important to note that MAKER does not try and recalculated
>>>>>> data that
>>>>>>      it has already calculated.  For example, if you run an analysis
>>>>>> twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any
>>>>>> of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from
>>>>>> the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f
>>>>>> flag.
>>>>>>
>>>>>>      MAKER also supports parallelization via MPI on computer
>>>>>> clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support
>>>>>> must be
>>>>>>      configured during the MAKER installation process for this to
>>>>>> work though
>>>>>>
>>>>>>
>>>>>> Options:
>>>>>>
>>>>>>      -genome|g <file>    Overrides the genome file path in the
>>>>>> control files
>>>>>>
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>>
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep
>>>>>> directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>>
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and
>>>>>> lower)
>>>>>>
>>>>>>      -base    <string>   Set the base name MAKER uses to save output
>>>>>> files.
>>>>>>                          MAKER uses the input genome file name by
>>>>>> default.
>>>>>>
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of
>>>>>> tries.
>>>>>>
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST
>>>>>> analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>>
>>>>>>      -force|f            Forces MAKER to delete old files before
>>>>>> running again.
>>>>>> This will require all blast analyses to be rerun.
>>>>>>
>>>>>>      -again|a            recaculate all annotations and output files
>>>>>> even if no
>>>>>> settings have changed. Does not delete old analyses.
>>>>>>
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status
>>>>>> messages.
>>>>>>
>>>>>>      -qq                 Even more quiet. There are no status
>>>>>> messages.
>>>>>>
>>>>>>      -dsindex            Quickly generate datastore index file. Note
>>>>>> that this
>>>>>>                          will not check if run settings have changed
>>>>>> on contigs
>>>>>>
>>>>>>      -nolock             Turn off file locks. May be usful on some
>>>>>> file systems,
>>>>>>                          but can cause race conditions if running in
>>>>>> parallel.
>>>>>>
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>>
>>>>>>      -CTL                Generate empty control files in the current
>>>>>> directory.
>>>>>>
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>>
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>>
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>>
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for
>>>>>> web-based GUI
>>>>>>
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>>
>>>>>>      -version            Prints the MAKER version.
>>>>>>
>>>>>>      -help|?             Prints this usage statement.
>>>>>>
>>>>>>
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/
>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Path needs to be a list of directories to search (you specified an
>>>>>>> executable location).
>>>>>>>
>>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop
>>>>>>> /Gpm/maker/bin
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> I tried to export PATH
>>>>>>>
>>>>>>> running
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>>
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaeme
>>>>>>> ka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>>
>>>>>>>
>>>>>>> secondly,
>>>>>>>
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>>
>>>>>>> nothing returned.
>>>>>>>
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> After install the executables will be in the ?/maker/bin directory.
>>>>>>>> Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>>>
>>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be
>>>>>>>> found just by typing ?maker'
>>>>>>>>
>>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_e
>>>>>>>> nv_var.html
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu> wrote:
>>>>>>>>
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what
>>>>>>>> is the result of ?which maker??
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Daniel
>>>>>>>> The reply is
>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>> -bash: MAKER: command not found
>>>>>>>>
>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>> Department of Microbiology,
>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>
>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you
>>>>>>>>> used to try to run MAKER? Can you show the result of running ?MAKER -ctl??
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ence,
>>>>>>>>> Thanks for your reply,
>>>>>>>>>
>>>>>>>>> This is the step and error received
>>>>>>>>>
>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>> Installing MAKER...
>>>>>>>>> Building MAKER
>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>
>>>>>>>>> The build status is
>>>>>>>>> =============================================================================
>>>>>>>>> STATUS MAKER v2.31.9==============================================================================
>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to
>>>>>>>>>> the mailing list the command and output (including errors) of the step that
>>>>>>>>>> didn?t work.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in
>>>>>>>>>> installing all prerequisites however running maker ./build install, it
>>>>>>>>>> showed that maker installed.
>>>>>>>>>>
>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>
>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>>>> ell-lab.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>> ell-lab.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/5d36dba0/attachment-0001.html>

From carsonhh at gmail.com  Fri Sep 22 14:06:06 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:06:06 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
Message-ID: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

> On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello all,
> Please how can I determine the following in maker:
> 1. The total number of chromosomes
> 2. The size of my genome
> 
> 
> Thanks
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> Ok, thanks. 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> 
>    
> 
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> It would need to be a new run. You won't be able to use the updated contig names with the old run. 
> 
> --Carson
> 
> Sent from my iPhone
> 
> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> 
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> 1. How can I effect the change when maker has produced some files from the the old sequence?
>> 
>> I have spent more than 24 hours running maker and it has produced some folders already.
>> 
>> How can I make this change?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.
>> 
>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.
>> 
>> Example command to do that ?> 
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> ?Carson
>> 
>> 
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>> 
>>> Hi Carson
>>> Thanks for your response its been helpful
>>> 
>>> Please bear with me as I work through this
>>> 
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> examining contents of the fasta file and run log
>>> 
>>> 
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>> You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).
>>> 
>>> Example ?>
>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>> 
>>>> Hi Carson,
>>>> 
>>>>  Thanks
>>>> I was able to start using maker.
>>>> 
>>>> However I am working with a plant Genome novel. I had set the repeatmasking to 
>>>> 1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker
>>>> 
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>> 
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>> 
>>>>    
>>>> 
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>> MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.
>>>> 
>>>> You can check the contents of the file using cat. Example ?> cat maker_opts.ctl
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>> 
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>> 
>>>>> It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave 
>>>>> 
>>>>> #-----Genome (these are always required)
>>>>> genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
>>>>> 
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>> 
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>> 
>>>>> #-----Protein Homology Evidence (for best results provide a file for at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3 file
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
>>>>> 
>>>>> 
>>>>> I ran maker command on another tab and it returned the following
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control files.
>>>>> 
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>> 
>>>>> 
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>> Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014>
>>>>> 
>>>>> There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.
>>>>> 
>>>>> Thanks,
>>>>> Carson
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Carson,
>>>>>> Thanks a lot 
>>>>>> 
>>>>>> I ran this command maker -h it returned the following
>>>>>> 
>>>>>> The last thing I wish to ask you, how can I load my genome fine and being annotation?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>> 
>>>>>> MAKER version 2.31.9
>>>>>> 
>>>>>> Usage:
>>>>>> 
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>> 
>>>>>> 
>>>>>> Description:
>>>>>> 
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can be used to
>>>>>>      produce gene annotations for new genomes as well as update annotations
>>>>>>      from existing genome databases.
>>>>>> 
>>>>>>      The three input arguments are control files that specify how MAKER should
>>>>>>      behave. All options for MAKER should be set in the control files, but a
>>>>>>      few can also be set on the command line. Command line options provide a
>>>>>>      convenient machanism to override commonly altered control file values.
>>>>>>      MAKER will automatically search for the control files in the current
>>>>>>      working directory if they are not specified on the command line.
>>>>>> 
>>>>>>      Input files listed in the control options files must be in fasta format
>>>>>>      unless otherwise specified. Please see MAKER documentation to learn more
>>>>>>      about control file  configuration.  MAKER will automatically try and
>>>>>>      locate the user control files in the current working directory if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>> 
>>>>>>      It is important to note that MAKER does not try and recalculated data that
>>>>>>      it has already calculated.  For example, if you run an analysis twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f flag.
>>>>>> 
>>>>>>      MAKER also supports parallelization via MPI on computer clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
>>>>>>      configured during the MAKER installation process for this to work though
>>>>>>      
>>>>>> 
>>>>>> Options:
>>>>>> 
>>>>>>      -genome|g <file>    Overrides the genome file path in the control files
>>>>>> 
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>> 
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>> 
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and lower)
>>>>>> 
>>>>>>      -base    <string>   Set the base name MAKER uses to save output files.
>>>>>>                          MAKER uses the input genome file name by default.
>>>>>> 
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of tries.
>>>>>> 
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>> 
>>>>>>      -force|f            Forces MAKER to delete old files before running again.
>>>>>> 			 This will require all blast analyses to be rerun.
>>>>>> 
>>>>>>      -again|a            recaculate all annotations and output files even if no
>>>>>> 			 settings have changed. Does not delete old analyses.
>>>>>> 
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status messages.
>>>>>> 
>>>>>>      -qq                 Even more quiet. There are no status messages.
>>>>>> 
>>>>>>      -dsindex            Quickly generate datastore index file. Note that this
>>>>>>                          will not check if run settings have changed on contigs
>>>>>> 
>>>>>>      -nolock             Turn off file locks. May be usful on some file systems,
>>>>>>                          but can cause race conditions if running in parallel.
>>>>>> 
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>> 
>>>>>>      -CTL                Generate empty control files in the current directory.
>>>>>> 
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>> 
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>> 
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>> 
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for web-based GUI
>>>>>> 
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>> 
>>>>>>      -version            Prints the MAKER version.
>>>>>> 
>>>>>>      -help|?             Prints this usage statement.
>>>>>> 
>>>>>> 
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>> Path needs to be a list of directories to search (you specified an executable location).
>>>>>> 
>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>> 
>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Thanks 
>>>>>>> 
>>>>>>> I tried to export PATH
>>>>>>> 
>>>>>>> running 
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>> 
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>> 
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>> 
>>>>>>> 
>>>>>>> secondly,
>>>>>>> 
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>> 
>>>>>>> nothing returned.
>>>>>>> 
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>>> After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>> 
>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'
>>>>>>> 
>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html <http://www.linfo.org/path_env_var.html>
>>>>>>> 
>>>>>>> ?Carson
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>> 
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker?? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Daniel
>>>>>>>>> The reply is 
>>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>>> -bash: MAKER: command not found
>>>>>>>>> 
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl?? 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Ence,
>>>>>>>>>> Thanks for your reply,
>>>>>>>>>> 
>>>>>>>>>> This is the step and error received
>>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>>> Installing MAKER...
>>>>>>>>>> Building MAKER
>>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>> 
>>>>>>>>>> The build status is
>>>>>>>>>> 
>>>>>>>>>> =============================================================================
>>>>>>>>>> STATUS MAKER v2.31.9
>>>>>>>>>> ==============================================================================
>>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>> 
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hello all,
>>>>>>>>>>> 
>>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>>>>>>>>>>> 
>>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>> 
>>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>>> Department of Microbiology,
>>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>>> 
>>>>>>>>>>>    
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/64e7446c/attachment-0001.html>

From carsonhh at gmail.com  Fri Sep 22 14:08:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:08:36 -0600
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <1505986013492.52354@unil.ch>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
	<1505986013492.52354@unil.ch>
Message-ID: <651D4267-0FD7-4A92-B778-8976B47353BB@gmail.com>

The gff3 passthrough options are there to help users get old data into MAKER when they have lost access to the original files. But for iterative running of the pipeline, it is more effective just to rerun in place so MAKER can access the raw alignment reports. The raw reports from the alignments have more detail than what is stored in the GFF3. Details that are lost when trying to use the GFF3 as input.

?Carson


> On Sep 21, 2017, at 3:26 AM, Patrick Tran Van <Patrick.TranVan at unil.ch> wrote:
> 
> Hi Carson,
> 
> I have a doubt for the round 2, so in a previous reply you said:
> 
> " Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "
>  
> Does it means that I don't need to modify the section :
> 
> #-----Re-annotation Using MAKER Derived GFF3
> 
> ?
> 
> If I let everything by default such as :
> 
> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no 
> 
> 
> It will not look again for repeat and protein + transcriptome alignment ?
> 
> Patrick Tran Van
> 
> Groups Chapuisat, Robinson-Rechavi & Schwander
> Department of Ecology and Evolution
> University of Lausanne
> Le Biophore
> CH-1015 Lausanne
> Switzerland
> Office 3206
> 
> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
> Sent: Monday, July 3, 2017 10:50 PM
> To: Patrick Tran Van
> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] Advice on my pipeline
>  
> maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).
> 
> So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.
> 
> The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).
> 
> You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/>
> 
> Thanks,
> Carson
> 
> 
> 
> 
>> On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>> 
>> So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.
>> 
>> I have then use SNAP to train/filter it with:
>> 
>> maker2zff  specie.all.gff
>> 
>> Here are my results:
>> 
>> Number of gene after maker -> Number of gene after maker2zff
>> 
>> - Without corrected_est_fusion: 21621 -> 13875
>> - With corrected_est_fusion: 16850 -> 9098
>> 
>> 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
>> Normally I should find more genes with corrected_est_fusion right ?
>> 
>> 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?
>> 
>>  Thanks for your help 
>> 
>> 
>> 
>> Patrick Tran Van
>> 
>> Groups Chapuisat, Robinson-Rechavi & Schwander
>> Department of Ecology and Evolution
>> University of Lausanne
>> Le Biophore
>> CH-1015 Lausanne
>> Switzerland
>> Office 3206
>> 
>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>> Sent: Monday, June 26, 2017 11:38 PM
>> To: Patrick Tran Van
>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>> Subject: Re: [maker-devel] Advice on my pipeline
>>  
>> Sorry the option is ?> correct_est_fusion
>> 
>> It is in the maker_opts.ctl file.
>> 
>> I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>> 
>>> Thanks for your answer.
>>> 
>>> 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
>>> Because I am using autoAug for this and it tooks a while to compute ..
>>> 
>>> 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:
>>> 
>>> WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl
>>> 
>>> (I am using v 2.31.8 )
>>> 
>>> 
>>> Patrick Tran Van
>>> 
>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>> Department of Ecology and Evolution
>>> University of Lausanne
>>> Le Biophore
>>> CH-1015 Lausanne
>>> Switzerland
>>> Office 3206
>>> 
>>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>>> Sent: Monday, June 5, 2017 8:29 PM
>>> To: Patrick Tran Van
>>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>>> Subject: Re: [maker-devel] Advice on my pipeline
>>>  
>>> Your plan sounds good. A couple of related notes.
>>> 
>>> Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.
>>> 
>>> Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).
>>> 
>>> ?Carson
>>> 
>>> 
>>>> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> This is my first time running Maker for an insect genome annotation. 
>>>> 
>>>> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:
>>>> 
>>>> 
>>>> What I have:
>>>> - RNA evidence: transcriptome
>>>> - Proteine evidence: swissprot/uniprot + busco protein set of insect
>>>> - Cegma and busco results of my genome
>>>> 
>>>> 
>>>> 1) Train SNAP with CEGMA
>>>> 
>>>> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).
>>>> 
>>>> 3) Create SNAP model from run A.
>>>> 
>>>> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 5) Create SNAP model from run B.
>>>> 
>>>> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 7)  Create SNAP model from run C AND Create Augustus gene model from run C
>>>> 
>>>> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1
>>>> 
>>>> 
>>>> 
>>>> Does it seems coherent ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Patrick Tran Van
>>>> 
>>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>>> Department of Ecology and Evolution
>>>> University of Lausanne
>>>> Le Biophore
>>>> CH-1015 Lausanne
>>>> Switzerland
>>>> Office 3206
>>>> 
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/3b6b64af/attachment-0001.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 14:19:22 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:19:22 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
	<1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <ADB216BF-2828-4906-A32F-58CC3989102F@genetics.utah.edu>

All est2genome and protein2genome do is take exonerate alignments of the fasta inputs and translate the longest ORF to get a rough base model that can be used to train a gene predictor. That is why we have it in the documentation that once the predictor is trained they should be turned off.

Once you get the gene predictor trained, MAKER will feed hints to the gene predictor derived from alignments and input GFF3. These hints greatly improve the performance of the gene predictors. MAKER will also use the alignemnts to filter out predictions htat do not match the evidence alignments.

?Carson


> On Sep 22, 2017, at 2:15 PM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> Thanks a lot for the information.
> 
> Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?
> 
> Could you confirm that?
> 
> Thanks a lot and best regards, Jens
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Freitag, 22. September 2017 22:04
>> An: Keilwagen, Jens
>> Cc: Maker Mailing List
>> Betreff: Re: MAKER
>> 
>> MAKER won?t produce est2genome results for est_gff. This is partially
>> because est2genome results are only used for training gene predictors.
>> So you are essentially just getting protein2genome results from your
>> runs. Once you get a gene predictor trained you will see a difference,
>> as it will use the intron/exon structure of alignments as hints to
>> improve gene predictor performance.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> I have tried the proposed options for a small example (yeast).
>>> 
>>> I had
>>> - proteins (fasta) from another yeast and
>>> - transcript annotation (gff) from cufflinks and StringTie
>>> 
>>> I'd like to compare the maker results for
>>> - proteins and StringTie
>>> Vs.
>>> - proteins and cufflinks
>>> 
>>> I used the default options, except:
>>> genome=<genome fasta>
>>> 
>>> protein=<protein fasta>
>>> est_gff=<transcript gff>
>>> 
>>> est2genome=1
>>> protein2genome=1
>>> 
>>> (An example is attached.)
>>> 
>>> Then I ran maker:
>>> 
>>> maker -RM_off -c 24
>>> find . -type f -name *.gff -exec cat {} + | grep maker >
>>> filtered-maker-prediction.gff
>>> 
>>> (The run seems to be okay. There were no FAILED, ... in the log. Cf.
>>> attachment)
>>> 
>>> Each maker run was started in a separate subdirectory.
>>> However, I realized that both maker runs yielded almost the same
>> result (just one minor edit). This made me curious.
>>> As far as I understood the files, I received the (filtered?)
>> exonerate predictions for the proteins (from the other yeast). Is this
>> correct? Why did I not receive any predictions (purely) based on the
>> RNA-seq data? Did I something wrong?
>>> 
>>> I'm looking forward to your reply.
>>> 
>>> Best regards, Jens
>>> 
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Dienstag, 19. September 2017 23:37
>>>> An: Keilwagen, Jens
>>>> Betreff: Re: MAKER
>>>> 
>>>> MAKER cannot use the BAM directly, but you can use something like
>>>> stringtie or trinity to assemble a transcript fasta that can be
>> given
>>>> to the est= option.
>>>> 
>>>> Ab initio gene prediction is only enabled if you specify an hmm or
>>>> species file to use.  If all you want is homology based annotation,
>>>> you can try the est2genome and protein2genome options. Note the
>> final
>>>> models may be partial if the alignments do not cover the gene end to
>>>> end.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
>> <jens.keilwagen at julius-
>>>> kuehn.de> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> 
>>>>> thanks a lot for your last email that .
>>>>> 
>>>>> I was asked to do homology-based gene prediction using RNA-seq and
>>>> Maker was proposed as one option.
>>>>> Hence I'd like to ask how to do that in the best possible way.
>>>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>>>> related species. How can I integrate the RNA-seq data?
>>>>> 
>>>>> Is it possible to deactivate ab-initio gene prediction by Augustus
>>>>> or
>>>> SNAP?
>>>>> 
>>>>> Thanks a lot in advance.
>>>>> 
>>>>> Bets regards, Jens
>>>>> 
>>>>>> -----Urspr?ngliche Nachricht-----
>>>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>>>> An: Keilwagen, Jens
>>>>>> Cc: Mark Yandell
>>>>>> Betreff: Re: MAKER
>>>>>> 
>>>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>>>> could give the GFF3 results to the pred_gff= option in MAKER
>> (comma
>>>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>>>> the same coordinate space as the assembly being annotated (genome=
>>>> option).
>>>>>> Whatever you give to pred_gff will be treated as a raw predictions
>>>> by
>>>>>> MAKER and will only be accepted as a final model if there are
>>>>>> evidence alignments (protein/EST) that support the model, and if
>>>>>> there are multiple alternate models at the same locus, only the
>>>> model
>>>>>> that is best supported by the protein/transcript evidence is kept.
>>>>>> 
>>>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>>>> will cause even raw predictions with no evidence support to be
>>>> maintained.
>>>>>> In the event of multiple models with no evidence support, the
>> model
>>>>>> best matching the consensus of alternate models will be
>> maintained.
>>>>>> 
>>>>>> Alternatively you can use the model_gff= options (comma separated
>>>>>> list
>>>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>>>> confidence than pred_gff. At least one model will always be kept
>>>>>> regardless of evidence support (same rules as pred_gff selection
>>>>>> for which model to keep when there are multiple). But model_gff
>>>>>> will
>>>> also
>>>>>> affect how evidence clusters are determined compared to pred_gff
>>>>>> (model_gff features are allowed to merge bridging evidence
>>>> clusters).
>>>>>> MAKER will also go to extra lengths to pull forward existing names
>>>>>> and other data in the GFF3 for model_gff features.
>>>>>> 
>>>>>> If you do not have GFF3 files in the right coordinate space, but
>> do
>>>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>>>> you can supply these to the protein= and transcript= options in
>>>> MAKER
>>>>>> together with est2genome=1 or protein2genome=1. This will cause
>>>> MAKER
>>>>>> to place the models using exonerate. You would probably also need
>>>>>> to add est_forward=1 to the control files to have MAKER try and
>>>>>> derive model names from the name of evidence alignments they were
>>>>>> derived from if you go this route.
>>>>>> 
>>>>>> You can also try treating the GFF3 predictions as hints to
>>>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>>>> predictions inform the behavior of predictors like SNAP and
>>>>>> Augustus). Might be interesting. You would have to alter results
>> to
>>>>>> be match/match_part
>>>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>>>> 
>>>>>> Let me know if you have any more questions, and I?ll do my best to
>>>>>> help.
>>>>>> 
>>>>>> Thanks,
>>>>>> Carson
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>>>> <myandell at genetics.utah.edu> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Mark Yandell
>>>>>>> Professor of Human Genetics
>>>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>>>> University of Utah
>>>>>>> 15 North 2030 East, Room 2100
>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>> ph:801-587-7707
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
>>>>>>> <jens.keilwagen at jki.bund.de>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Dear Prof. Yandell,
>>>>>>>> 
>>>>>>>> we have published a homology-based gene prediction program
>> today:
>>>>>>>> 
>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
>>>>>>>> 2 and I'd like to ask how we can use MAKER to combine
>> predictions
>>>>>>>> of GeMoMa using different reference organisms, i.e. we try to
>>>>>>>> predict the genes of an target organism (e.g. wheat) using the
>>>>>>>> annotated genes of other reference organisms (e.g. grasses).
>>>>>>>> GeMoMa returns
>>>>>> for
>>>>>>>> each reference organism a GFF with the predicted gene models in
>>>> the
>>>>>> target organism.
>>>>>>>> 
>>>>>>>> It would be great if you or someone from your team could give us
>>>>>> some
>>>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>>>> 
>>>>>>>> Thanks a lot and best regards, Jens
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> 
>>>>>>>> Dr. Jens Keilwagen
>>>>>>>> 
>>>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>>>> Cultivated
>>>>>>>> Plants
>>>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>>>> 
>>>>>>>> Erwin-Baur-Stra?e 27
>>>>>>>> 06484 Quedlinburg
>>>>>>>> Germany
>>>>>>>> 
>>>>>>>> Phone: ++49 (0)3946 47 510
>>>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>> <maker_opts.ctl><slurm-278767.out>
> 


From jens.keilwagen at julius-kuehn.de  Fri Sep 22 14:15:23 2017
From: jens.keilwagen at julius-kuehn.de (Keilwagen, Jens)
Date: Fri, 22 Sep 2017 20:15:23 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
Message-ID: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>

Hi Carson,

Thanks a lot for the information.

Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?

Could you confirm that?

Thanks a lot and best regards, Jens

> -----Urspr?ngliche Nachricht-----
> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> Gesendet: Freitag, 22. September 2017 22:04
> An: Keilwagen, Jens
> Cc: Maker Mailing List
> Betreff: Re: MAKER
> 
> MAKER won?t produce est2genome results for est_gff. This is partially
> because est2genome results are only used for training gene predictors.
> So you are essentially just getting protein2genome results from your
> runs. Once you get a gene predictor trained you will see a difference,
> as it will use the intron/exon structure of alignments as hints to
> improve gene predictor performance.
> 
> ?Carson
> 
> 
> > On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
> kuehn.de> wrote:
> >
> > Hi Carson,
> >
> > I have tried the proposed options for a small example (yeast).
> >
> > I had
> > - proteins (fasta) from another yeast and
> > - transcript annotation (gff) from cufflinks and StringTie
> >
> > I'd like to compare the maker results for
> > - proteins and StringTie
> > Vs.
> > - proteins and cufflinks
> >
> > I used the default options, except:
> > genome=<genome fasta>
> >
> > protein=<protein fasta>
> > est_gff=<transcript gff>
> >
> > est2genome=1
> > protein2genome=1
> >
> > (An example is attached.)
> >
> > Then I ran maker:
> >
> > maker -RM_off -c 24
> > find . -type f -name *.gff -exec cat {} + | grep maker >
> > filtered-maker-prediction.gff
> >
> > (The run seems to be okay. There were no FAILED, ... in the log. Cf.
> > attachment)
> >
> > Each maker run was started in a separate subdirectory.
> > However, I realized that both maker runs yielded almost the same
> result (just one minor edit). This made me curious.
> > As far as I understood the files, I received the (filtered?)
> exonerate predictions for the proteins (from the other yeast). Is this
> correct? Why did I not receive any predictions (purely) based on the
> RNA-seq data? Did I something wrong?
> >
> > I'm looking forward to your reply.
> >
> > Best regards, Jens
> >
> >
> >> -----Urspr?ngliche Nachricht-----
> >> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> >> Gesendet: Dienstag, 19. September 2017 23:37
> >> An: Keilwagen, Jens
> >> Betreff: Re: MAKER
> >>
> >> MAKER cannot use the BAM directly, but you can use something like
> >> stringtie or trinity to assemble a transcript fasta that can be
> given
> >> to the est= option.
> >>
> >> Ab initio gene prediction is only enabled if you specify an hmm or
> >> species file to use.  If all you want is homology based annotation,
> >> you can try the est2genome and protein2genome options. Note the
> final
> >> models may be partial if the alignments do not cover the gene end to
> >> end.
> >>
> >> ?Carson
> >>
> >>
> >>
> >>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
> <jens.keilwagen at julius-
> >> kuehn.de> wrote:
> >>>
> >>> Hi Carson,
> >>>
> >>> thanks a lot for your last email that .
> >>>
> >>> I was asked to do homology-based gene prediction using RNA-seq and
> >> Maker was proposed as one option.
> >>> Hence I'd like to ask how to do that in the best possible way.
> >>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
> >> related species. How can I integrate the RNA-seq data?
> >>>
> >>> Is it possible to deactivate ab-initio gene prediction by Augustus
> >>> or
> >> SNAP?
> >>>
> >>> Thanks a lot in advance.
> >>>
> >>> Bets regards, Jens
> >>>
> >>>> -----Urspr?ngliche Nachricht-----
> >>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
> >>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
> >>>> An: Keilwagen, Jens
> >>>> Cc: Mark Yandell
> >>>> Betreff: Re: MAKER
> >>>>
> >>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
> >>>> could give the GFF3 results to the pred_gff= option in MAKER
> (comma
> >>>> separated lists accepted). The GFF3 file of predictions must be in
> >>>> the same coordinate space as the assembly being annotated (genome=
> >> option).
> >>>> Whatever you give to pred_gff will be treated as a raw predictions
> >> by
> >>>> MAKER and will only be accepted as a final model if there are
> >>>> evidence alignments (protein/EST) that support the model, and if
> >>>> there are multiple alternate models at the same locus, only the
> >> model
> >>>> that is best supported by the protein/transcript evidence is kept.
> >>>>
> >>>> You can also set the keep_preds=1 option when using pred_gff. This
> >>>> will cause even raw predictions with no evidence support to be
> >> maintained.
> >>>> In the event of multiple models with no evidence support, the
> model
> >>>> best matching the consensus of alternate models will be
> maintained.
> >>>>
> >>>> Alternatively you can use the model_gff= options (comma separated
> >>>> list
> >>>> ok) to input the GFF3 file.  model_gff features are given higher
> >>>> confidence than pred_gff. At least one model will always be kept
> >>>> regardless of evidence support (same rules as pred_gff selection
> >>>> for which model to keep when there are multiple). But model_gff
> >>>> will
> >> also
> >>>> affect how evidence clusters are determined compared to pred_gff
> >>>> (model_gff features are allowed to merge bridging evidence
> >> clusters).
> >>>> MAKER will also go to extra lengths to pull forward existing names
> >>>> and other data in the GFF3 for model_gff features.
> >>>>
> >>>> If you do not have GFF3 files in the right coordinate space, but
> do
> >>>> have protein fasta or transcript fasta for the GeMoMa predictions,
> >>>> you can supply these to the protein= and transcript= options in
> >> MAKER
> >>>> together with est2genome=1 or protein2genome=1. This will cause
> >> MAKER
> >>>> to place the models using exonerate. You would probably also need
> >>>> to add est_forward=1 to the control files to have MAKER try and
> >>>> derive model names from the name of evidence alignments they were
> >>>> derived from if you go this route.
> >>>>
> >>>> You can also try treating the GFF3 predictions as hints to
> >>>> traditional ab initio gene finders like SNAP or Augustus by giving
> >>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
> >>>> predictions inform the behavior of predictors like SNAP and
> >>>> Augustus). Might be interesting. You would have to alter results
> to
> >>>> be match/match_part
> >>>> GFF3 features to give them to the est_gff or protein_gff options.
> >>>>
> >>>> Let me know if you have any more questions, and I?ll do my best to
> >>>> help.
> >>>>
> >>>> Thanks,
> >>>> Carson
> >>>>
> >>>>
> >>>>
> >>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
> >>>> <myandell at genetics.utah.edu> wrote:
> >>>>>
> >>>>>
> >>>>> Mark Yandell
> >>>>> Professor of Human Genetics
> >>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
> >>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
> >>>>> University of Utah
> >>>>> 15 North 2030 East, Room 2100
> >>>>> Salt Lake City, UT 84112-5330
> >>>>> ph:801-587-7707
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
> >>>>> <jens.keilwagen at jki.bund.de>
> >>>> wrote:
> >>>>>
> >>>>>> Dear Prof. Yandell,
> >>>>>>
> >>>>>> we have published a homology-based gene prediction program
> today:
> >>>>>>
> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
> >>>>>> 2 and I'd like to ask how we can use MAKER to combine
> predictions
> >>>>>> of GeMoMa using different reference organisms, i.e. we try to
> >>>>>> predict the genes of an target organism (e.g. wheat) using the
> >>>>>> annotated genes of other reference organisms (e.g. grasses).
> >>>>>> GeMoMa returns
> >>>> for
> >>>>>> each reference organism a GFF with the predicted gene models in
> >> the
> >>>> target organism.
> >>>>>>
> >>>>>> It would be great if you or someone from your team could give us
> >>>> some
> >>>>>> hints or point us to correct paragraph in the documentation.
> >>>>>>
> >>>>>> Thanks a lot and best regards, Jens
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Dr. Jens Keilwagen
> >>>>>>
> >>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
> >> Cultivated
> >>>>>> Plants
> >>>>>> 	Institute for Biosafety in Plant Biotechnology
> >>>>>>
> >>>>>> Erwin-Baur-Stra?e 27
> >>>>>> 06484 Quedlinburg
> >>>>>> Germany
> >>>>>>
> >>>>>> Phone: ++49 (0)3946 47 510
> >>>>>> EMail: jens.keilwagen at jki.bund.de
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >
> > <maker_opts.ctl><slurm-278767.out>


From venyao at qq.com  Sun Sep 24 03:08:43 2017
From: venyao at qq.com (=?ISO-8859-1?B?V2VuIFlhbw==?=)
Date: Sun, 24 Sep 2017 17:08:43 +0800
Subject: [maker-devel] integrate gmap into Maker
Message-ID: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>

Dear Guys,


I am using Maker to annotate my genome sequence. However, it costs too much time.


By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.


I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the


alignment to Maker. If so, this may save much time, as GMAP is very fast.


Thanks!


Best regards,


Wen Yao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170924/8d42e58d/attachment-0001.html>

From eennadi at gmail.com  Sun Sep 24 15:24:10 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Sun, 24 Sep 2017 22:24:10 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
Message-ID: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>

Hello,

Good day,

I am trying to assign putative gene function to the maker generated fasta.
I am using NCBI

I keep getting this error
  Command line argument error: Argument "query". File is not accessible:
`muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'

What do I do?

can I use blast2go in place of ncbi command line software?

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu> wrote:

> Hi Emmanuel, In order for anyone to help you, you need post to the mailing
> list the command and output (including errors) of the step that didn?t
> work.
>
> Thanks,
> Daniel Ence
>
>
> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>
> Hello all,
>
> I downloaded Maker and tried to install it. I succeeded in installing all
> prerequisites however running maker ./build install, it showed that maker
> installed.
>
> However trying to run maker it wouldn't run.
>
> Please how do I install maker to run on local computer?
>
> Thanks
>
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170924/90a7c717/attachment-0001.html>

From dandence at gmail.com  Mon Sep 25 08:11:31 2017
From: dandence at gmail.com (Daniel Ence)
Date: Mon, 25 Sep 2017 10:11:31 -0400
Subject: [maker-devel] integrate gmap into Maker
In-Reply-To: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
References: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
Message-ID: <7E5F06C8-05B2-447F-A695-DDE7673BDEFF@gmail.com>

Without commenting on the merits of GMAP vs Blastn or Exonerate, you can provide evidence alignments from any source in gff format in the maker control files. I think for GMAP this would mean converting the sam/bam outputs to a gff3 format, but I don?t know those steps of the top of my head. 

~Daniel 


> On Sep 24, 2017, at 5:08 AM, Wen Yao <venyao at qq.com> wrote:
> 
> Dear Guys,
> 
>  
> 
> I am using Maker to annotate my genome sequence. However, it costs too much time.
> 
> By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.
> 
> I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the
> 
> alignment to Maker. If so, this may save much time, as GMAP is very fast.
> 
> 
> 
> Thanks!
> 
>  
> 
> Best regards,
> 
>  
> 
> Wen Yao
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/143d3024/attachment-0001.html>

From carsonhh at gmail.com  Mon Sep 25 10:07:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 25 Sep 2017 10:07:39 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
Message-ID: <07342091-897A-46C2-B000-76A283FE5FB1@gmail.com>

I?m not sure what you mean by NCBI. Do you mean BLAST? If so, you probably did not format and index your input database before running BLAST. See BLAST documentation.

Also the file you are using ?> muc1_genome_snap2.all.maker.snap_masked.proteins.fasta

That is not the maker result file. That is a reference fasta of raw SNAP results. The MAKER result file will have a name like this (see maker documentation) ?> muc1_genome_snap2.all.maker.proteins.fasta

?Carson


> On Sep 24, 2017, at 3:24 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello,
> 
> Good day,
> 
> I am trying to assign putative gene function to the maker generated fasta. I am using NCBI
> 
> I keep getting this error
>   Command line argument error: Argument "query". File is not accessible:  `muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'
> 
> What do I do?
> 
> can I use blast2go in place of ncbi command line software?
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
> 
> Thanks,
> Daniel Ence
> 
> 
>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>> 
>> Hello all,
>> 
>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>> 
>> However trying to run maker it wouldn't run.
>> 
>> Please how do I install maker to run on local computer?
>> 
>> Thanks
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> 
>>    
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/c21cf4d8/attachment-0001.html>

From xvazquezc at gmail.com  Tue Sep 26 01:23:13 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Tue, 26 Sep 2017 17:23:13 +1000
Subject: [maker-devel] question about Maker-MPI
Message-ID: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>

Hi Carson,
We finally got Maker working with MPI (mpich, openmpi was a dead end...)
and I have a question about how Maker distributes the computation load.
I know, correct me if I'm wrong, that with MPI, Maker runs blast in
parallel (1 instance per thread) for protein2genome and est2genome. This
indeed improves enormously the speed for the initial run.
But, does it take advance of this at the time of running the gene
predictors? I think there is no benefit on multiple cpus in non-MPI mode
but I have no idea in MPI.
Thank you in advance,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/f9539591/attachment-0001.html>

From carsonhh at gmail.com  Tue Sep 26 09:28:58 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 09:28:58 -0600
Subject: [maker-devel] question about Maker-MPI
In-Reply-To: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
References: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
Message-ID: <E29F4653-61A3-4E33-967A-4E1A9C8C4721@gmail.com>

MAKER parallelizes at multiple levels. For the ab initio predictors, it will run multiple contigs simultaneously (so each one will get their own ab initio predictor running). For large contigs it will further divide it into 10Mb chunks, and each will run simultaneously.

?Carson


> On Sep 26, 2017, at 1:23 AM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi Carson,
> We finally got Maker working with MPI (mpich, openmpi was a dead end...) and I have a question about how Maker distributes the computation load.
> I know, correct me if I'm wrong, that with MPI, Maker runs blast in parallel (1 instance per thread) for protein2genome and est2genome. This indeed improves enormously the speed for the initial run.
> But, does it take advance of this at the time of running the gene predictors? I think there is no benefit on multiple cpus in non-MPI mode but I have no idea in MPI.
> Thank you in advance,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/52293014/attachment-0001.html>

From cjfields at illinois.edu  Mon Sep 25 08:53:39 2017
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 25 Sep 2017 14:53:39 +0000
Subject: [maker-devel] Maker not installing
In-Reply-To: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
	<78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
Message-ID: <ED8DB3BD-0981-4883-8CE0-E920BCEE0CC6@illinois.edu>

Emmanuel,

Look for anything that will help calculate basic assembly metrics, such as N50, NG50, L50, etc.; these almost always give overall assembly size, and total scaffolds/contigs.  For instance I?ve used this:

http://korflab.ucdavis.edu/datasets/Assemblathon/Assemblathon2/Basic_metrics/assemblathon_stats.pl

(it requires FALite, which is here: http://korflab.ucdavis.edu/Unix_and_Perl/FAlite.pm )

The Broad also has GAEMR (http://software.broadinstitute.org/software/gaemr/ ), but I haven?t tested it myself (I?ve heard it?s a bit finicky).

Also, see this: https://www.biostars.org/p/237591/ , which has a few more options.

chris

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Carson Holt <carsonhh at gmail.com>
Date: Friday, September 22, 2017 at 3:09 PM
To: Emmanuel Nnadi <eennadi at gmail.com>
Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Maker not installing

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Ok, thanks.
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
It would need to be a new run. You won't be able to use the updated contig names with the old run.
--Carson

Sent from my iPhone

On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Hi carson
Thanks for the tip
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

1. How can I effect the change when maker has produced some files from the the old sequence?

I have spent more than 24 hours running maker and it has produced some folders already.

How can I make this change?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.

I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.

Example command to do that ?>
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

?Carson


On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson
Thanks for your response its been helpful

Please bear with me as I work through this

1. Please how do I generate EST for my novel sequences?
2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
3. One error in the contig just returned this value
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
ERROR: RepeatMasker failed
--> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

examining contents of the fasta file and run log


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).

Example ?>
perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"

?Carson


On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,

 Thanks
I was able to start using maker.

However I am working with a plant Genome novel. I had set the repeatmasking to
1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker

How can I use specific known genomes for repeat masking
Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.

You can check the contents of the file using cat. Example ?> cat maker_opts.ctl

?Carson


On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
I however encountered another problem

when I ran the command nano -c maker_opts.ctl

It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave

#-----Genome (these are always required)
genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)


I ran maker command on another tab and it returned the following
STATUS: Parsing control files...
ERROR: You have failed to provide a value for 'genome' in the control files.

--> rank=NA, hostname=emmannamekasMBP


Questions
1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014

There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.

Thanks,
Carson


On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot

I ran this command maker -h it returned the following

The last thing I wish to ask you, how can I load my genome fine and being annotation?

Thanks

emmannamekasMBP:maker emmannaemeka$ maker -h

MAKER version 2.31.9

Usage:

     maker [options] <maker_opts> <maker_bopts> <maker_exe>


Description:

     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.

     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.

     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.

     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.

     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though


Options:

     -genome|g <file>    Overrides the genome file path in the control files

     -RM_off|R           Turns all repeat masking options off.

     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.

     -old_struct         Use the old directory styles (MAKER 2.26 and lower)

     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.

     -tries|t <integer>  Run contigs up to the specified number of tries.

     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!

     -force|f            Forces MAKER to delete old files before running again.
This will require all blast analyses to be rerun.

     -again|a            recaculate all annotations and output files even if no
settings have changed. Does not delete old analyses.

     -quiet|q            Regular quiet. Only a handlful of status messages.

     -qq                 Even more quiet. There are no status messages.

     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs

     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.

     -TMP                Specify temporary directory to use.

     -CTL                Generate empty control files in the current directory.

     -OPTS               Generates just the maker_opts.ctl file.

     -BOPTS              Generates just the maker_bopts.ctl file.

     -EXE                Generates just the maker_exe.ctl file.

     -MWAS    <option>   Easy way to control mwas_server for web-based GUI

                              options:  STOP
                                        START
                                        RESTART

     -version            Prints the MAKER version.

     -help|?             Prints this usage statement.


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Path needs to be a list of directories to search (you specified an executable location).

So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker

Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin

?Carson


On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>>
wrote:

Thanks

I tried to export PATH

running
echo $PATH in the maker directory this returned

/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker


1. Does it mean that PATH has been exported?


secondly,

I tried to run
the command maker -h, which maker, maker -CTL

nothing returned.

2. how do i start up maker?
3. Do I need to be in maker directory to start maker?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker

You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'

Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html

?Carson


On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:

Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker??


On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Daniel
The reply is
emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
-bash: MAKER: command not found

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl??

Thanks,
Daniel Ence


On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Ence,
Thanks for your reply,

This is the step and error received

emmannamekasMBP:src emmannaemeka$ ./build install

Installing MAKER...

Building MAKER

Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)


The build status is


=============================================================================

STATUS MAKER v2.31.9

==============================================================================

PERL Dependencies:  VERIFIED

External Programs:  VERIFIED

External C Libraries:   VERIFIED

MPI SUPPORT:        DISABLED

MWAS Web Interface: DISABLED

MAKER PACKAGE:      CONFIGURATION OK

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work.

Thanks,
Daniel Ence


On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,

I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.

However trying to run maker it wouldn't run.

Please how do I install maker to run on local computer?

Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/2ac6b193/attachment-0001.html>

From tfallon at mit.edu  Tue Sep 26 11:40:21 2017
From: tfallon at mit.edu (Tim Fallon)
Date: Tue, 26 Sep 2017 13:40:21 -0400
Subject: [maker-devel] MAKER changelog?
Message-ID: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>

Hi there,

I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?

All the best,
-Tim

Timothy R. Fallon
PhD candidate
Laboratory of Jing-Ke Weng
Department of Biology
MIT

tfallon at mit.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1853 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment-0001.p7s>

From carsonhh at gmail.com  Tue Sep 26 12:34:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 12:34:16 -0600
Subject: [maker-devel] MAKER changelog?
In-Reply-To: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
References: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
Message-ID: <C32D3C31-125B-4D3D-8E0B-CD4ED629E541@gmail.com>

Here you go.

*updated the locations for repbase and augustus
*make library install more portable for newer perl versions
*fix for cdna2genome single exon strand
*updates for beter hints in augustus (exact rather than partial intron match)
*added allow_overlap for UTR in fungi and prokaryotes
*uri escape snap name in zff conversion
*fix for BioPerl-live related error (also submitted fix to BioPerl)
*jaccard cluster and bug fixes for cigar string
*Added zff2genebank script for training augustus (adapted from Jason Stajich's zff2augustus_gbk.pl)

?Carson


> On Sep 26, 2017, at 11:40 AM, Tim Fallon <tfallon at mit.edu> wrote:
> 
> Hi there,
> 
> I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?
> 
> All the best,
> -Tim
> 
> Timothy R. Fallon
> PhD candidate
> Laboratory of Jing-Ke Weng
> Department of Biology
> MIT
> 
> tfallon at mit.edu <mailto:tfallon at mit.edu>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/a7ae24bf/attachment-0001.html>

From qwzhang0601 at gmail.com  Wed Sep 27 08:30:28 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 10:30:28 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
Message-ID: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>

Hello:

Thank you for all your previous comments and suggestions. We annotated a
new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both
transcriptome and protein sequences as evidences (including 10k reviewed
Mammalian and 340k predicted rodent protein sequences from uniprot). We
predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5,
and 74% have domains by "InterProScan". It seems the genome was well
annotated, but I still feel  28800 protein coding genes are too many for a
rodent species. Do you think this gene set is good for downstream analysis
(e.g., gene family expansion analysis, positive selection analysis)? Or can
I do further filtering to make the number of genes closer to estimated
number (e.g., 22,000)?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b07f2f47/attachment-0001.html>

From dandence at gmail.com  Wed Sep 27 08:54:30 2017
From: dandence at gmail.com (Daniel Ence)
Date: Wed, 27 Sep 2017 10:54:30 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
Message-ID: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>

Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 

I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
> 
> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
> 
> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
> 
> Thanks
> 
> Best
> Quanwei
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/17cf26fd/attachment-0001.html>

From michael.s.campbell1 at gmail.com  Wed Sep 27 09:34:11 2017
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Wed, 27 Sep 2017 11:34:11 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
Message-ID: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>

Hi Quanwei,

The first thing that comes to mind with too many genes is undermasked repeats. You could check the Pfam donmains for things like integrase, GAG proteins, and other transposon related domains. I would also look a bit closer at the genes with AEDs greater than 0.5. Looking and things like average numner of exons per transcript and average gene and transcript lengths can help pick out dodgy genes. You could also do some filtering on the QI values output by MAKER. It is defensible to create a ?higher quality? set by limiting it to genes with AEDs less than 0.5 and puting some requirement on the fractions of splice sites confirmed by EST/mRNA-seq alignments. 

Take care,
Mike
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
> 
> Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 
> 
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.
> 
> Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 
> 
> Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 
> 
> Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 
> 
> Hope this helps, 
> Daniel
> 
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
>> 
>> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
>> 
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b72e2514/attachment-0001.html>

From xvazquezc at gmail.com  Wed Sep 27 18:32:30 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Thu, 28 Sep 2017 10:32:30 +1000
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
Message-ID: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>

Hi Quanwei,
Following Michael comment, even if you use Swissprot, there are over 2700
transposases in it. If there is some undermasking, they will show up as
evidence.
Cheers,
Xabi

On 28 September 2017 at 01:34, Michael Campbell <
michael.s.campbell1 at gmail.com> wrote:

> Hi Quanwei,
>
> The first thing that comes to mind with too many genes is undermasked
> repeats. You could check the Pfam donmains for things like integrase, GAG
> proteins, and other transposon related domains. I would also look a bit
> closer at the genes with AEDs greater than 0.5. Looking and things like
> average numner of exons per transcript and average gene and transcript
> lengths can help pick out dodgy genes. You could also do some filtering on
> the QI values output by MAKER. It is defensible to create a ?higher
> quality? set by limiting it to genes with AEDs less than 0.5 and puting
> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
> alignments.
>
> Take care,
> Mike
>
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>
> Hi Quanwei, I think that your genome assembly probably contains many
> contigs that are too small to contain full gene sequences. Rather than
> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
> paramter: ?skip genome contigs below this length (under 10kbp are often
> useless)?.
>
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
> contigs but excluding those contigs would probably reduce your gene count.
> These may be fragments or duplicates of genes present on these sequences
> that weren?t assembled properly.
>
> Also using predicted protein sequences from uniprot as evidence in your
> annotation is probably not advisable since those sequences are not from
> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
> that you asked about earlier.
>
> Additionally requiring a minimum protein length as you asked about earlier
> could also reduce the gene count.
>
> Ultimately, you may do whatever filtering you find necessary and
> justifiable for your annotation depending on the biology of your organism
> and the methods that generated your assembly, and your annotation.
>
> Hope this helps,
> Daniel
>
> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> Thank you for all your previous comments and suggestions. We annotated a
> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
> annotation (about 250k scaffolds).
>
> For repeats masking, we also build a species specific library. We used
> both transcriptome and protein sequences as evidences (including 10k
> reviewed Mammalian and 340k predicted rodent protein sequences from
> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>
> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
> and 74% have domains by "InterProScan". It seems the genome was well
> annotated, but I still feel  28800 protein coding genes are too many for a
> rodent species. Do you think this gene set is good for downstream analysis
> (e.g., gene family expansion analysis, positive selection analysis)? Or can
> I do further filtering to make the number of genes closer to estimated
> number (e.g., 22,000)?
>
> Thanks
>
> Best
> Quanwei
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170928/1a63a2ec/attachment-0001.html>

From qwzhang0601 at gmail.com  Wed Sep 27 20:04:43 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 22:04:43 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
	<CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
Message-ID: <CAOW6FSJPZBiriKh9L5knuGp_ZCSEVxw4+eftyddk+o3kFwTTCw@mail.gmail.com>

Thank you all for your comments and suggestions. Yes, even when I only use
Swissprot I still have 26.5k protein coding genes. As you mentioned one
reason may be related to repeat masking, and another one may be because of
inclusion of short scaffolds, which further lead to protein fragments.

About the repeat masking, I use the latest Repeatmaker and Repbase
(selected Mammalian), I also build species specific repeat libraries
following
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic.
About transposases I know the Maker pipe line already provided
"transposable element proteins". I do not know what else I can do.

About the short scaffolds, in  fact among the 26.5k genes, only about 400
genes are predicted from scaffolds shorter than 10kb. Besides, I know there
are some very short proteins (e.g.,the mouse protein RL41 (60s ribosomal
protein) has lengh 25). I think short scaffolds may also include some short
proteins.

Now, I plan to start from the 26.5k protein coding genes. I think the less
reliable ones will be filtered out in downstream analysis. For example,
when we construct the gene families, those fragments or falsely predicted
proteins will more like to be excluded from gene families.

Thank you all for your suggestions.

Best
Qaunwei


2017-09-27 20:32 GMT-04:00 Xabier V?zquez-Campos <xvazquezc at gmail.com>:

> Hi Quanwei,
> Following Michael comment, even if you use Swissprot, there are over 2700
> transposases in it. If there is some undermasking, they will show up as
> evidence.
> Cheers,
> Xabi
>
> On 28 September 2017 at 01:34, Michael Campbell <
> michael.s.campbell1 at gmail.com> wrote:
>
>> Hi Quanwei,
>>
>> The first thing that comes to mind with too many genes is undermasked
>> repeats. You could check the Pfam donmains for things like integrase, GAG
>> proteins, and other transposon related domains. I would also look a bit
>> closer at the genes with AEDs greater than 0.5. Looking and things like
>> average numner of exons per transcript and average gene and transcript
>> lengths can help pick out dodgy genes. You could also do some filtering on
>> the QI values output by MAKER. It is defensible to create a ?higher
>> quality? set by limiting it to genes with AEDs less than 0.5 and puting
>> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
>> alignments.
>>
>> Take care,
>> Mike
>>
>> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>>
>> Hi Quanwei, I think that your genome assembly probably contains many
>> contigs that are too small to contain full gene sequences. Rather than
>> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
>> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
>> paramter: ?skip genome contigs below this length (under 10kbp are often
>> useless)?.
>>
>> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
>> contigs but excluding those contigs would probably reduce your gene count.
>> These may be fragments or duplicates of genes present on these sequences
>> that weren?t assembled properly.
>>
>> Also using predicted protein sequences from uniprot as evidence in your
>> annotation is probably not advisable since those sequences are not from
>> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
>> that you asked about earlier.
>>
>> Additionally requiring a minimum protein length as you asked about
>> earlier could also reduce the gene count.
>>
>> Ultimately, you may do whatever filtering you find necessary and
>> justifiable for your annotation depending on the biology of your organism
>> and the methods that generated your assembly, and your annotation.
>>
>> Hope this helps,
>> Daniel
>>
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Hello:
>>
>> Thank you for all your previous comments and suggestions. We annotated a
>> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
>> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
>> annotation (about 250k scaffolds).
>>
>> For repeats masking, we also build a species specific library. We used
>> both transcriptome and protein sequences as evidences (including 10k
>> reviewed Mammalian and 340k predicted rodent protein sequences from
>> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>>
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
>> and 74% have domains by "InterProScan". It seems the genome was well
>> annotated, but I still feel  28800 protein coding genes are too many for a
>> rodent species. Do you think this gene set is good for downstream analysis
>> (e.g., gene family expansion analysis, positive selection analysis)? Or can
>> I do further filtering to make the number of genes closer to estimated
>> number (e.g., 22,000)?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/4b9e4898/attachment-0001.html>

From qwzhang0601 at gmail.com  Thu Sep 28 06:05:19 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Thu, 28 Sep 2017 08:05:19 -0400
Subject: [maker-devel] gene annotation for a better genome
Message-ID: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>

Hello:

Recently, we got a new version of NMR genome, whose genome had been
assembled and annotated a few years ago. We can download the gene
annotation from NCBI.

Now we want to annotate the new genome using Maker2 pipeline. I wonder how
can I fully make use of existing annotations. On the other hand, since the
previous genome is not very well assemblies, some genes annotation maybe
false positives. I hope those false positive genes in previous annotation
won't mislead Maker2 for current gene annotation.

Do you have any suggestions. Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170928/4192c41f/attachment-0001.html>

From carsonhh at gmail.com  Fri Sep 29 10:36:09 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 29 Sep 2017 10:36:09 -0600
Subject: [maker-devel] gene annotation for a better genome
In-Reply-To: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
References: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
Message-ID: <5AFEDD05-DF02-463F-A6EE-1619A9BB968D@gmail.com>

You can try using the est2genome=1 option to map the old models forward onto the new assembly as if they were ESTs (add a line that says est_forward=1 to the control file to maintain old naming and set est=1 to the old model transcript file). Then provide the final models as a pred_gff for a subsuquent run (i.e. a traditional MAKER run where you are annotating the new assembly with transcript and protein evidence and ab initio predictors). Don?t supply the old models to est= on that run.

The idea behind doing it this way is:
1. You need to get old models onto the new assembly so coordinates will change. So by doing it this way, you will at least be able to move many models forward based on homology.
2. By providing the models to pred_gff on a subsequent MAKER run, you are just letting old models compete against new annotations. They will be rejected if they have no evidence support, or can be kept if they score better than alternate models from SNAP/Augustus. That way you have the chance to integrate old models while at the same time rejecting some old models that have no evidence overlap.

?Carson


> On Sep 28, 2017, at 6:05 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Recently, we got a new version of NMR genome, whose genome had been assembled and annotated a few years ago. We can download the gene annotation from NCBI. 
> 
> Now we want to annotate the new genome using Maker2 pipeline. I wonder how can I fully make use of existing annotations. On the other hand, since the previous genome is not very well assemblies, some genes annotation maybe false positives. I hope those false positive genes in previous annotation won't mislead Maker2 for current gene annotation.
> 
> Do you have any suggestions. Thanks
> 
> Best
> Quanwei  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From willett4 at email.unc.edu  Fri Sep 29 11:20:46 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 29 Sep 2017 17:20:46 +0000
Subject: [maker-devel] question on gene numbers with quality_filter.pl
Message-ID: <16C1890A-2042-4BE1-93CE-8A8DC0C18151@ad.unc.edu>

Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this?

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625


http://labs.bio.unc.edu/Willett/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170929/740b9569/attachment-0001.html>

From willett4 at email.unc.edu  Fri Sep  1 09:22:34 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 1 Sep 2017 15:22:34 +0000
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
Message-ID: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>

Hi Everyone-

I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:

"Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?

This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 

We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.

If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).

Thanks,

Best,

Chris Willett


error 48600

#--------- command -------------#
Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_3

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_3

error 48599

Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_11

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_11

error 48592

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-193-25.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_5

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_5

error 47069

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-183-35.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_12

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_12


Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
 

From chzelin at gmail.com  Tue Sep  5 07:59:09 2017
From: chzelin at gmail.com (zl c)
Date: Tue, 5 Sep 2017 09:59:09 -0400
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
Message-ID: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>

Hello,

I run maker for most sequences successfully but fail some long sequences.
The error is:

Widget::tblastx:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db
db.778415-832259.for_tblastx.fasta -query ...778415.832259.0
-num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000
-searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking
true -show_gis -out   OUT.tblastx

#-------------------------------#


------------- EXCEPTION: Bio::Root::Exception -------------

MSG: Can't get HSPs: data not collected.

STACK: Error::throw

STACK: Bio::Root::Root::throw
/usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486

STACK: Bio::Search::Hit::PhatHit::Base::hsps
/spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552

STACK: Widget::tblastx::keepers
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192

STACK: Widget::tblastx::parse
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260

STACK: GI::reblast_merged_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471

STACK: GI::merge_resolve_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291

STACK: Process::MpiChunk::_go
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320

STACK: Process::MpiChunk::run
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340

STACK: Process::MpiChunk::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: /home/chenz11/program/maker/bin/maker:695

-----------------------------------------------------------

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

ERROR: Failed while collecting tblastx reports

ERROR: Chunk failed at level:5, tier_type:3

FAILED CONTIG:tig00011625_arrow


ERROR: Chunk failed at level:4, tier_type:0

FAILED CONTIG:tig00011625_arrow


examining contents of the fasta file and run log

I've read a relative thread on the google group and checked my tblastx
output. I found that the number of HSPs should be larger than 1000,000, but
only output 1000,000, which make some alignments have no HSPs. Is there any
setting that could solve the problem?

Thanks,
Zelin

--------------------------------------------
Zelin Chen [chzelin at gmail.com]


NIH/NHGRI
Building 50, Room 5531
50 SOUTH DR, MSC 8004
BETHESDA, MD 20892-8004
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/867d1aef/attachment-0002.html>

From qwzhang0601 at gmail.com  Tue Sep  5 14:24:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 16:24:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
Message-ID: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>

Hello:

We are doing genome annotation for a new rodent species. We have finished
the training of the ab initio gene predictors successful by setting the
following parameters (split_hit=40000, max_dna_len=1000000, and 99k
mammalian Swiss protein sequences as evidences.

But when I used the trained model to do the genome annotation, I got the
following kinds of errors (shown in red). I used the same parameters as
those for training, except for addition of 340k rodent TrEMBL protein
sequences for protein evidences (i.e., I use both 99k mammalian Swiss
protein sequences and 340k rodent TrEMBL protein sequences).

I am doing the annotation on a cluster and started multiple Maker in the
same directory (I had tried to use MPI but met some problems).

Do you have any suggestions? Many thanks
#some kinds of errors
open3: fork failed: Cannot allocate memory at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
--> rank=NA, hostname=n520
ERROR: Failed while doing blastx of proteins
ERROR: Chunk failed at level:8, tier_type:3
FAILED CONTIG:Contig2


setting up GFF3 output and fasta chunks
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n513
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig12378


Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/d504a94d/attachment-0002.html>

From carsonhh at gmail.com  Tue Sep  5 14:56:01 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 14:56:01 -0600
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
In-Reply-To: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
References: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
Message-ID: <7DCB519E-9AFA-4D10-8046-72DE99C5E4FF@gmail.com>

Did you use gff3 input to MAKER for any steps (example pred_gff or est_gff)?

?Carson

> On Sep 1, 2017, at 9:22 AM, Willett, Christopher S <willett4 at email.unc.edu> wrote:
> 
> Hi Everyone-
> 
> I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:
> 
> "Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?
> 
> This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 
> 
> We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.
> 
> If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).
> 
> Thanks,
> 
> Best,
> 
> Chris Willett
> 
> 
> 
> error 48600
> 
> #--------- command -------------#
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_3
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_3
> 
> error 48599
> 
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_11
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_11
> 
> error 48592
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-193-25.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_5
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_5
> 
> error 47069
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-183-35.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_12
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_12
> 
> 
> Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Tue Sep  5 15:48:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 15:48:56 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
Message-ID: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>

You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.

So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).

?Carson


> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
> 
> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
> 
> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
> 
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
> 
> 
> Best
> Quanwei

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/c2fb8514/attachment-0002.html>

From carsonhh at gmail.com  Tue Sep  5 16:04:00 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:04:00 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
Message-ID: <846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>

The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.

?Carson


> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com> wrote:
> 
> Hello,
> 
> I run maker for most sequences successfully but fail some long sequences. The error is: 
> 
> Widget::tblastx:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
> #-------------------------------#
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Can't get HSPs: data not collected.
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: /home/chenz11/program/maker/bin/maker:695
> -----------------------------------------------------------
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> ERROR: Failed while collecting tblastx reports
> ERROR: Chunk failed at level:5, tier_type:3
> FAILED CONTIG:tig00011625_arrow
> 
> ERROR: Chunk failed at level:4, tier_type:0
> FAILED CONTIG:tig00011625_arrow
> 
> examining contents of the fasta file and run log
> 
> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>  
> Thanks,
> Zelin
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/a316398a/attachment-0002.html>

From qwzhang0601 at gmail.com  Tue Sep  5 16:04:23 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 18:04:23 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
Message-ID: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>

Dear Carson:

Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds.
I set max_dna_len as 1Mb, because there are quite many long scaffolds
(e.g., the longest one is about 100Mb). Would you explain whether smaller
"max_dna_len" will decrease the quality of annotation (e.g., split some
genes in the same scaffold)?


Best
Quanwei

2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> You ran out of memory. You probably set max_dna_len too high for the
> machines you are using. There is a note in the maker_opts.ctl file that
> tells you that this value affects memory usage.
>
> So you can either set it lower, or if running under MPI, use fewer CPUs
> per node (how you do this is MPI flavor dependent, but some flavors let you
> do this by setting process count lower combined with the round robin
> option).
>
> ?Carson
>
>
>
> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> We are doing genome annotation for a new rodent species. We have finished
> the training of the ab initio gene predictors successful by setting the
> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
> mammalian Swiss protein sequences as evidences.
>
> But when I used the trained model to do the genome annotation, I got the
> following kinds of errors (shown in red). I used the same parameters as
> those for training, except for addition of 340k rodent TrEMBL protein
> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
> protein sequences and 340k rodent TrEMBL protein sequences).
>
> I am doing the annotation on a cluster and started multiple Maker in the
> same directory (I had tried to use MPI but met some problems).
>
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
>
>
> Best
> Quanwei
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/8c55b5a3/attachment-0002.html>

From carsonhh at gmail.com  Tue Sep  5 16:08:28 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:08:28 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
Message-ID: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>

max_dna_len is the window size for keeping data in RAM. Smaller values do not split genes. But values lower than 100kb can create issues (if a single gene models spans 3 or more windows, it creates a weird failure).

?Carson


> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds. I set max_dna_len as 1Mb, because there are quite many long scaffolds (e.g., the longest one is about 100Mb). Would you explain whether smaller "max_dna_len" will decrease the quality of annotation (e.g., split some genes in the same scaffold)? 
> 
> 
> Best
> Quanwei  
> 
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.
> 
> So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).
> 
> ?Carson
> 
> 
> 
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
>> 
>> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
>> 
>> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
>> 
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>> 
>> 
>> Best
>> Quanwei
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/6032bfb2/attachment-0002.html>

From qwzhang0601 at gmail.com  Wed Sep  6 09:51:54 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 6 Sep 2017 11:51:54 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
Message-ID: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>

Dear Carson:

(1) Thank you for your explanation. I will try to set max_dna_len as 400kb
for our rodent species, which is a little bit higher than the suggested
value for large vertebrate genome (in the maker manual it mentioned
"300,000 is a good max_dna_len on large vertebrate genomes if memory is not
a limiting factor").

(2) By reading some of your replies in the maker google group, and I
noticed that it can reduce memory and save time for annotation if I set
depth_blast to a certain number. So I changed the following parameters. But
I wonder, whether it will decrease the quality of annotation? If it won't
affect the quality, can I even use a smaller number (e.g., 20) to save more
memory and time?

depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

(3) I also have some concerns about the speed, especially for the long
scaffolds (around 100Mb). I wonder which part is the most time consuming
for genome annotation (repeat masking, blast, or polishing?).
Particularly, I wonder whether the blastx of protein evidence will take
majority of time. Now, I have prepared 99k mammalian Swiss protein
sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
am considering whether I can save much time if I only use the 99k mammalian
Swiss protein sequences as evidences.

(4) For some reasons, I can not run maker though MPI on our cluster. So I
can only start multiple maker. I wonder if it is possible to let multiple
maker to annotate the same long scaffold (i.e., for a single sequence I
start multiple maker, without splitting the long sequence into shorter
ones).

(5) Still about the speed issue. I read some of your comments about "cpus"
parameters in the maker_opts file (
http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html).
And I know it indicate the number of cpus for a single chunk. So if I set
"cpus=2" in the maker_opts file, then I can use the following command to
submit the job, right?

**************** the bash file used to submit the maker job
#!/bin/bash

#$ -cwd
#$ -S /bin/bash
#$ -j y
#$ -N makerT2
#$ -l h_vmem=8g
#$ -pe smp 2

module load MAKER/2.31.9/perl.5.22.1

maker --q 2> maker_test.error


Many thanks

Best
Qaunwei


2017-09-05 18:08 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> max_dna_len is the window size for keeping data in RAM. Smaller values do
> not split genes. But values lower than 100kb can create issues (if a single
> gene models spans 3 or more windows, it creates a weird failure).
>
> ?Carson
>
>
>
>
> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thanks. I wonder whether smaller "max_dna_len" will split longer
> scaffolds. I set max_dna_len as 1Mb, because there are quite many long
> scaffolds (e.g., the longest one is about 100Mb). Would you explain whether
> smaller "max_dna_len" will decrease the quality of annotation (e.g., split
> some genes in the same scaffold)?
>
>
> Best
> Quanwei
>
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> You ran out of memory. You probably set max_dna_len too high for the
>> machines you are using. There is a note in the maker_opts.ctl file that
>> tells you that this value affects memory usage.
>>
>> So you can either set it lower, or if running under MPI, use fewer CPUs
>> per node (how you do this is MPI flavor dependent, but some flavors let you
>> do this by setting process count lower combined with the round robin
>> option).
>>
>> ?Carson
>>
>>
>>
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Hello:
>>
>> We are doing genome annotation for a new rodent species. We have finished
>> the training of the ab initio gene predictors successful by setting the
>> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
>> mammalian Swiss protein sequences as evidences.
>>
>> But when I used the trained model to do the genome annotation, I got the
>> following kinds of errors (shown in red). I used the same parameters as
>> those for training, except for addition of 340k rodent TrEMBL protein
>> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
>> protein sequences and 340k rodent TrEMBL protein sequences).
>>
>> I am doing the annotation on a cluster and started multiple Maker in the
>> same directory (I had tried to use MPI but met some problems).
>>
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>>
>>
>> Best
>> Quanwei
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170906/5ef9f187/attachment-0002.html>

From carsonhh at gmail.com  Wed Sep  6 10:06:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 6 Sep 2017 10:06:46 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
Message-ID: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>


> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
> 
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.


> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.

BLASTN (ESTs) -> fastest as it is searching nucleotide space
BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX

Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.


> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).

Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.


> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  

The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.


?Carson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170906/2e1e3d6b/attachment-0002.html>

From carsonhh at gmail.com  Thu Sep  7 09:12:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 7 Sep 2017 09:12:46 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
	<846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>
	<CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
Message-ID: <2B046506-1E32-4840-B3B6-6DABB4A5D4C2@gmail.com>

I?m glad it fixed it.

?Carson

> On Sep 6, 2017, at 8:27 PM, zl c <chzelin at gmail.com> wrote:
> 
> Hi Carson,
> 
> I try blast-2.6.0+ and it works. Thank you very much.
> 
> Thanks
> Zelin Chen
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> 
> On Tue, Sep 5, 2017 at 6:04 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.
> 
> ?Carson
> 
> 
>> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I run maker for most sequences successfully but fail some long sequences. The error is: 
>> 
>> Widget::tblastx:
>> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
>> #-------------------------------#
>> 
>> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: Can't get HSPs: data not collected.
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
>> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
>> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
>> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
>> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
>> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
>> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
>> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
>> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: /home/chenz11/program/maker/bin/maker:695
>> -----------------------------------------------------------
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> ERROR: Failed while collecting tblastx reports
>> ERROR: Chunk failed at level:5, tier_type:3
>> FAILED CONTIG:tig00011625_arrow
>> 
>> ERROR: Chunk failed at level:4, tier_type:0
>> FAILED CONTIG:tig00011625_arrow
>> 
>> examining contents of the fasta file and run log
>> 
>> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>>  
>> Thanks,
>> Zelin
>> 
>> --------------------------------------------
>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>> 
>> 
>> NIH/NHGRI
>> Building 50, Room 5531
>> 50 SOUTH DR, MSC 8004 
>> BETHESDA, MD 20892-8004
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170907/441f80c9/attachment-0002.html>

From qwzhang0601 at gmail.com  Fri Sep  8 21:25:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Fri, 8 Sep 2017 23:25:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
Message-ID: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>

Dear Carson:

I got the following error again. Is this still related to memory issues? I
wonder whether there can be other reasons lead to this error? This time, I
got this error during training of the SNAP model. Before, even I set
max_dna_len=1Mb, I can train the model successfully.  And in the current
training (where I get the following error),  I have decreased the
max_dna_len to 300kb. I required the same amount memory as before. The only
difference is that I am using both mammalian repeat library and species
specific repeat library, while previously I only use the mammalian repeat
library. Will it greatly increases the requirement of memory to use both
repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
have also set the depth_blast as 30 in current training.

Thank you! Have a nice weekend!


#---------------------------------------------------------------------
Now starting the contig!!
SeqID: Contig10
Length: 18773588
#---------------------------------------------------------------------


setting up GFF3 output and fasta chunks
doing repeat masking
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
collecting blastx repeatmasking
processing all repeats
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n224
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig10

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:Contig10

Best
Quanwei

2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

>
> (2) By reading some of your replies in the maker google group, and I
> noticed that it can reduce memory and save time for annotation if I set
> depth_blast to a certain number. So I changed the following parameters. But
> I wonder, whether it will decrease the quality of annotation? If it won't
> affect the quality, can I even use a smaller number (e.g., 20) to save more
> memory and time?
>
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> This values really only affects the final evidence kept in the GFF3 when
> you look at it in a browser. It has not affect on the annotation. This is
> because internally MAKER already collapses evidence down to the 10 best
> non-redundant features per evidence set per locus. The rest are put in the
> GFF3 just for reference. by setting it lower, you are just letting MAKER
> know it can through things away even sooner since you don?t want them in
> the GFF3. It provides a minor improvement for memory use, but
> max_dna_length is the big one that has the greatest effect.
>
>
> (3) I also have some concerns about the speed, especially for the long
> scaffolds (around 100Mb). I wonder which part is the most time consuming
> for genome annotation (repeat masking, blast, or polishing?).
> Particularly, I wonder whether the blastx of protein evidence will take
> majority of time. Now, I have prepared 99k mammalian Swiss protein
> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
> am considering whether I can save much time if I only use the 99k mammalian
> Swiss protein sequences as evidences.
>
>
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
> times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12
> times slower than BLASTN and twice as slow as BLASTX
>
> Also double the dataset size, double the runtime. Larger window sizes via
> max_dna_length will also increase runtimes.
>
>
> (4) For some reasons, I can not run maker though MPI on our cluster. So I
> can only start multiple maker. I wonder if it is possible to let multiple
> maker to annotate the same long scaffold (i.e., for a single sequence I
> start multiple maker, without splitting the long sequence into shorter
> ones).
>
>
> Without MPI you won?t be able to split up large contigs. At the very least
> you can try and run on a single node and set MPI to use all CPUs on that
> node. It?s less difficult to set up compared to cross node jobs via MPI.
>
>
> (5) Still about the speed issue. I read some of your comments about "cpus"
> parameters in the maker_opts file (http://gmod.827538.n3.nabble.
> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
> the maker_opts file, then I can use the following command to submit the
> job, right?
>
>
> The cpu parameter only affects how many CPUs are given to the blast
> command line. So only the BLASt step will speed up, so I recommend using
> MPI to get all steps to speed up. Even if you are only running on a single
> node, you can give all CPUs to the mpiexec command.
>
>
> ?Carson
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170908/08852c2f/attachment-0002.html>

From xvazquezc at gmail.com  Sun Sep 10 19:03:11 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 11 Sep 2017 11:03:11 +1000
Subject: [maker-devel] augustus underpredicting
Message-ID: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>

Hi,
I have been annotating a fungal genome as usual, using Busco-trained
Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close
to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea
https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/7ac7b97f/attachment-0002.html>

From qwzhang0601 at gmail.com  Mon Sep 11 10:19:50 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 12:19:50 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>

Dear Carson:

About the error in my above email, I found the contig was correctly
annotated at the second time RETRY. So please ignore my last email. But
now, for a few number of scaffolds, I met problems to process the repeats
(as shown below in red). I used both Mammalia repeat library and species
specific repeat library (which is generated by your pipeline "
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic").
There were no such problems when I only used Mammalia repeat library. Do
you have any ideas about this? What could be the reason? Or do you have any
suggestions for me to find the reason? Many thanks

Here are some parameters I used

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in
RepeatMasker
rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
repeat library in fasta format for Repe

max_dna_len=300000
split_hit=40000
depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking


Died at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31


Best
Quanwei

2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I got the following error again. Is this still related to memory issues? I
> wonder whether there can be other reasons lead to this error? This time, I
> got this error during training of the SNAP model. Before, even I set
> max_dna_len=1Mb, I can train the model successfully.  And in the current
> training (where I get the following error),  I have decreased the
> max_dna_len to 300kb. I required the same amount memory as before. The only
> difference is that I am using both mammalian repeat library and species
> specific repeat library, while previously I only use the mammalian repeat
> library. Will it greatly increases the requirement of memory to use both
> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
> have also set the depth_blast as 30 in current training.
>
> Thank you! Have a nice weekend!
>
>
>
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
>
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
>
> Best
> Quanwei
>
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>>
>> (2) By reading some of your replies in the maker google group, and I
>> noticed that it can reduce memory and save time for annotation if I set
>> depth_blast to a certain number. So I changed the following parameters. But
>> I wonder, whether it will decrease the quality of annotation? If it won't
>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>> memory and time?
>>
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> This values really only affects the final evidence kept in the GFF3 when
>> you look at it in a browser. It has not affect on the annotation. This is
>> because internally MAKER already collapses evidence down to the 10 best
>> non-redundant features per evidence set per locus. The rest are put in the
>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>> know it can through things away even sooner since you don?t want them in
>> the GFF3. It provides a minor improvement for memory use, but
>> max_dna_length is the big one that has the greatest effect.
>>
>>
>> (3) I also have some concerns about the speed, especially for the long
>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>> for genome annotation (repeat masking, blast, or polishing?).
>> Particularly, I wonder whether the blastx of protein evidence will take
>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>> am considering whether I can save much time if I only use the 99k mammalian
>> Swiss protein sequences as evidences.
>>
>>
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>> times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>> 12 times slower than BLASTN and twice as slow as BLASTX
>>
>> Also double the dataset size, double the runtime. Larger window sizes via
>> max_dna_length will also increase runtimes.
>>
>>
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I
>> can only start multiple maker. I wonder if it is possible to let multiple
>> maker to annotate the same long scaffold (i.e., for a single sequence I
>> start multiple maker, without splitting the long sequence into shorter
>> ones).
>>
>>
>> Without MPI you won?t be able to split up large contigs. At the very
>> least you can try and run on a single node and set MPI to use all CPUs on
>> that node. It?s less difficult to set up compared to cross node jobs via
>> MPI.
>>
>>
>> (5) Still about the speed issue. I read some of your comments about
>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
>> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
>> the maker_opts file, then I can use the following command to submit the
>> job, right?
>>
>>
>> The cpu parameter only affects how many CPUs are given to the blast
>> command line. So only the BLASt step will speed up, so I recommend using
>> MPI to get all steps to speed up. Even if you are only running on a single
>> node, you can give all CPUs to the mpiexec command.
>>
>>
>> ?Carson
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/126b5351/attachment-0002.html>

From carsonhh at gmail.com  Mon Sep 11 10:48:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:48:16 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <5C2477A3-CDBA-458A-95CA-E6DC912417B3@gmail.com>

It may can a memory issue or an IO issue. Some resource is being taxed and creating a non-responsive bottleneck. If you are running MAKER multiple times in the same directory, you may have to run fewer processes. Also if you are running without MPI, run with MPI instead as it will better manage the parallelization and use fewer resources than multiple individual processes.

?Carson


> On Sep 8, 2017, at 9:25 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/a9e87402/attachment-0002.html>

From carsonhh at gmail.com  Mon Sep 11 10:50:41 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:50:41 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
Message-ID: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>

BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

?Carson


> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
> Cheers,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f7e3efe3/attachment-0002.html>

From carsonhh at gmail.com  Mon Sep 11 11:07:12 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:07:12 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
Message-ID: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>

I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.

For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).

?Carson


> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
> 
> Here are some parameters I used
> 
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
> 
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> 
> Best
> Quanwei
> 
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/0885c26a/attachment-0002.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:12:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:12:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
Message-ID: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>

Dear Carson:

I only run 5 Maker instances in each directory (and set cpus=2). If it is
related to memory issue or an IO issue, I am not sure why the much longer
scaffolds (than the failed ones) were all annotated successfully, but the
relatively shorter ones failed.

I have set "tries=5" (#number of times to try a contig if there is a
failure for some reason). I will try "clean_try=1" and test on the failed
scaffolds individually with larger memory to see whether they can be
annotated.

Thank you!

Best
Quanwei

2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> I think the cause of the error may have been a little further upstream
> from what you pasted in the e-mail. One thing that may be happening is that
> you are taxing resources (like IO) if running MAKER multiple times or on
> too many CPUs. That can lead to failures because of truncated BLAST reports
> etc. In which case you can just retry and that will get around those types
> of IO derived errors. MAKER can generate a lot of IO, and if you are
> working on network mounted locations (i.e. the storage being used is
> actually across the network), then they can be lest robust than local
> storage (when under heavy load NFS can falsely report success on read/write
> operations that actually failed). It?s the reason we built in the retry
> capabilities of MAKER.
>
> For contigs that continuously fail, you may need to set clean_try=1. That
> will cause failures to start from scratch (i.e. delete all old reports on
> failure rather than just those suspected of being truncated).
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> About the error in my above email, I found the contig was correctly
> annotated at the second time RETRY. So please ignore my last email. But
> now, for a few number of scaffolds, I met problems to process the repeats
> (as shown below in red). I used both Mammalia repeat library and species
> specific repeat library (which is generated by your pipeline "
> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/
> Repeat_Library_Construction--Basic"). There were no such problems when I
> only used Mammalia repeat library. Do you have any ideas about this? What
> could be the reason? Or do you have any suggestions for me to find the
> reason? Many thanks
>
> Here are some parameters I used
>
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in
> RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
> repeat library in fasta format for Repe
>
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
>
> Best
> Quanwei
>
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I got the following error again. Is this still related to memory issues?
>> I wonder whether there can be other reasons lead to this error? This time,
>> I got this error during training of the SNAP model. Before, even I set
>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>> training (where I get the following error),  I have decreased the
>> max_dna_len to 300kb. I required the same amount memory as before. The only
>> difference is that I am using both mammalian repeat library and species
>> specific repeat library, while previously I only use the mammalian repeat
>> library. Will it greatly increases the requirement of memory to use both
>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>> have also set the depth_blast as 30 in current training.
>>
>> Thank you! Have a nice weekend!
>>
>>
>>
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>>
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>>
>> Best
>> Quanwei
>>
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>>
>>> (2) By reading some of your replies in the maker google group, and I
>>> noticed that it can reduce memory and save time for annotation if I set
>>> depth_blast to a certain number. So I changed the following parameters. But
>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>> memory and time?
>>>
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> This values really only affects the final evidence kept in the GFF3 when
>>> you look at it in a browser. It has not affect on the annotation. This is
>>> because internally MAKER already collapses evidence down to the 10 best
>>> non-redundant features per evidence set per locus. The rest are put in the
>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>> know it can through things away even sooner since you don?t want them in
>>> the GFF3. It provides a minor improvement for memory use, but
>>> max_dna_length is the big one that has the greatest effect.
>>>
>>>
>>> (3) I also have some concerns about the speed, especially for the long
>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>> for genome annotation (repeat masking, blast, or polishing?).
>>> Particularly, I wonder whether the blastx of protein evidence will take
>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>> am considering whether I can save much time if I only use the 99k mammalian
>>> Swiss protein sequences as evidences.
>>>
>>>
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>> times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>
>>> Also double the dataset size, double the runtime. Larger window sizes
>>> via max_dna_length will also increase runtimes.
>>>
>>>
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>> start multiple maker, without splitting the long sequence into shorter
>>> ones).
>>>
>>>
>>> Without MPI you won?t be able to split up large contigs. At the very
>>> least you can try and run on a single node and set MPI to use all CPUs on
>>> that node. It?s less difficult to set up compared to cross node jobs via
>>> MPI.
>>>
>>>
>>> (5) Still about the speed issue. I read some of your comments about
>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>> know it indicate the number of cpus for a single chunk. So if I set
>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>> submit the job, right?
>>>
>>>
>>> The cpu parameter only affects how many CPUs are given to the blast
>>> command line. So only the BLASt step will speed up, so I recommend using
>>> MPI to get all steps to speed up. Even if you are only running on a single
>>> node, you can give all CPUs to the mpiexec command.
>>>
>>>
>>> ?Carson
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f02b6a0b/attachment-0002.html>

From carsonhh at gmail.com  Mon Sep 11 11:14:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:14:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
Message-ID: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>

It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.

?Carson


> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
> 
> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
> 
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
> 
> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>> 
>> Here are some parameters I used
>> 
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>> 
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> 
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>> 
>> Thank you! Have a nice weekend! 
>> 
>> 
>> 
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>> 
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> 
>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>> 
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>> 
>> 
>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>> 
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>> 
>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>> 
>> 
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>> 
>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>> 
>> 
>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>> 
>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>> 
>> 
>> ?Carson
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/2a88e334/attachment-0002.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:16:49 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:16:49 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
Message-ID: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>

Dear Carson:

I met some problems to use MPI. I will give it another try.
Thank you!

Best
Quanwei

2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> It could be either. Please use MPI instead of starting multiple instances.
> It will greatly reduce both IO and RAM usage.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I only run 5 Maker instances in each directory (and set cpus=2). If it is
> related to memory issue or an IO issue, I am not sure why the much longer
> scaffolds (than the failed ones) were all annotated successfully, but the
> relatively shorter ones failed.
>
> I have set "tries=5" (#number of times to try a contig if there is a
> failure for some reason). I will try "clean_try=1" and test on the failed
> scaffolds individually with larger memory to see whether they can be
> annotated.
>
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> I think the cause of the error may have been a little further upstream
>> from what you pasted in the e-mail. One thing that may be happening is that
>> you are taxing resources (like IO) if running MAKER multiple times or on
>> too many CPUs. That can lead to failures because of truncated BLAST reports
>> etc. In which case you can just retry and that will get around those types
>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>> working on network mounted locations (i.e. the storage being used is
>> actually across the network), then they can be lest robust than local
>> storage (when under heavy load NFS can falsely report success on read/write
>> operations that actually failed). It?s the reason we built in the retry
>> capabilities of MAKER.
>>
>> For contigs that continuously fail, you may need to set clean_try=1. That
>> will cause failures to start from scratch (i.e. delete all old reports on
>> failure rather than just those suspected of being truncated).
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> About the error in my above email, I found the contig was correctly
>> annotated at the second time RETRY. So please ignore my last email. But
>> now, for a few number of scaffolds, I met problems to process the repeats
>> (as shown below in red). I used both Mammalia repeat library and species
>> specific repeat library (which is generated by your pipeline "
>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>> eat_Library_Construction--Basic"). There were no such problems when I
>> only used Mammalia repeat library. Do you have any ideas about this? What
>> could be the reason? Or do you have any suggestions for me to find the
>> reason? Many thanks
>>
>> Here are some parameters I used
>>
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in
>> RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
>> repeat library in fasta format for Repe
>>
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I got the following error again. Is this still related to memory issues?
>>> I wonder whether there can be other reasons lead to this error? This time,
>>> I got this error during training of the SNAP model. Before, even I set
>>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>>> training (where I get the following error),  I have decreased the
>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>> difference is that I am using both mammalian repeat library and species
>>> specific repeat library, while previously I only use the mammalian repeat
>>> library. Will it greatly increases the requirement of memory to use both
>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>> have also set the depth_blast as 30 in current training.
>>>
>>> Thank you! Have a nice weekend!
>>>
>>>
>>>
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>>
>>>
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>> line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>>
>>>> (2) By reading some of your replies in the maker google group, and I
>>>> noticed that it can reduce memory and save time for annotation if I set
>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>> memory and time?
>>>>
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> This values really only affects the final evidence kept in the GFF3
>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>> know it can through things away even sooner since you don?t want them in
>>>> the GFF3. It provides a minor improvement for memory use, but
>>>> max_dna_length is the big one that has the greatest effect.
>>>>
>>>>
>>>> (3) I also have some concerns about the speed, especially for the long
>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>> Swiss protein sequences as evidences.
>>>>
>>>>
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>>> times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>>
>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>> via max_dna_length will also increase runtimes.
>>>>
>>>>
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>>> start multiple maker, without splitting the long sequence into shorter
>>>> ones).
>>>>
>>>>
>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>> MPI.
>>>>
>>>>
>>>> (5) Still about the speed issue. I read some of your comments about
>>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>>> know it indicate the number of cpus for a single chunk. So if I set
>>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>>> submit the job, right?
>>>>
>>>>
>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>> node, you can give all CPUs to the mpiexec command.
>>>>
>>>>
>>>> ?Carson
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6edaec49/attachment-0002.html>

From carsonhh at gmail.com  Mon Sep 11 11:18:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:18:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
Message-ID: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>

If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>

It?s easy to install yourself, and tends to be very robust to failure.

?Carson


> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I met some problems to use MPI. I will give it another try.
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>> 
>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>> 
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>> 
>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>> 
>>> Here are some parameters I used
>>> 
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>> 
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> 
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>> 
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>> Dear Carson:
>>> 
>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>> 
>>> Thank you! Have a nice weekend! 
>>> 
>>> 
>>> 
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>> 
>>> 
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> 
>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>> 
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>> 
>>> 
>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>> 
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>> 
>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>> 
>>> 
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>> 
>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>> 
>>> 
>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>> 
>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>> 
>>> 
>>> ?Carson
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/ee287570/attachment-0002.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:27:22 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:27:22 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
Message-ID: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>

Dear Carson:

Would you please explain what do you mean by "a single machine"? I am
running maker2 on our high performance cluster. The cluster has more than
1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
as the scheduler. Can I use MPICH3?

Thanks

Best
Quanwei

2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> If you are just using a single machine (and not cross machine MPI), use
> MPICH3 ?> https://www.mpich.org
>
> It?s easy to install yourself, and tends to be very robust to failure.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I met some problems to use MPI. I will give it another try.
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> It could be either. Please use MPI instead of starting multiple
>> instances. It will greatly reduce both IO and RAM usage.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is
>> related to memory issue or an IO issue, I am not sure why the much longer
>> scaffolds (than the failed ones) were all annotated successfully, but the
>> relatively shorter ones failed.
>>
>> I have set "tries=5" (#number of times to try a contig if there is a
>> failure for some reason). I will try "clean_try=1" and test on the failed
>> scaffolds individually with larger memory to see whether they can be
>> annotated.
>>
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> I think the cause of the error may have been a little further upstream
>>> from what you pasted in the e-mail. One thing that may be happening is that
>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>> etc. In which case you can just retry and that will get around those types
>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>> working on network mounted locations (i.e. the storage being used is
>>> actually across the network), then they can be lest robust than local
>>> storage (when under heavy load NFS can falsely report success on read/write
>>> operations that actually failed). It?s the reason we built in the retry
>>> capabilities of MAKER.
>>>
>>> For contigs that continuously fail, you may need to set clean_try=1.
>>> That will cause failures to start from scratch (i.e. delete all old reports
>>> on failure rather than just those suspected of being truncated).
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> About the error in my above email, I found the contig was correctly
>>> annotated at the second time RETRY. So please ignore my last email. But
>>> now, for a few number of scaffolds, I met problems to process the repeats
>>> (as shown below in red). I used both Mammalia repeat library and species
>>> specific repeat library (which is generated by your pipeline "
>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>> eat_Library_Construction--Basic"). There were no such problems when I
>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>> could be the reason? Or do you have any suggestions for me to find the
>>> reason? Many thanks
>>>
>>> Here are some parameters I used
>>>
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in
>>> RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>> specific repeat library in fasta format for Repe
>>>
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>> line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>>
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>
>>>> Dear Carson:
>>>>
>>>> I got the following error again. Is this still related to memory
>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>> This time, I got this error during training of the SNAP model. Before, even
>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>> current training (where I get the following error),  I have decreased the
>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>> difference is that I am using both mammalian repeat library and species
>>>> specific repeat library, while previously I only use the mammalian repeat
>>>> library. Will it greatly increases the requirement of memory to use both
>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>> have also set the depth_blast as 30 in current training.
>>>>
>>>> Thank you! Have a nice weekend!
>>>>
>>>>
>>>>
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>>
>>>>
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>> line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>>
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>>
>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>> memory and time?
>>>>>
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> This values really only affects the final evidence kept in the GFF3
>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>> know it can through things away even sooner since you don?t want them in
>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>
>>>>>
>>>>> (3) I also have some concerns about the speed, especially for the long
>>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>> Swiss protein sequences as evidences.
>>>>>
>>>>>
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>> 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>
>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>> via max_dna_length will also increase runtimes.
>>>>>
>>>>>
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>> shorter ones).
>>>>>
>>>>>
>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>> MPI.
>>>>>
>>>>>
>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>> "cpus" parameters in the maker_opts file (
>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>> I can use the following command to submit the job, right?
>>>>>
>>>>>
>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>
>>>>>
>>>>> ?Carson
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6fd07594/attachment-0002.html>

From carsonhh at gmail.com  Mon Sep 11 11:46:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:46:39 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
Message-ID: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>

Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.

MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.

Example command for a 20 CPU node ?>  mpiexec -n 20 maker

?Carson


> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson: 
> 
> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
> 
> Thanks
> 
> Best
> Quanwei
> 
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
> 
> It?s easy to install yourself, and tends to be very robust to failure.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>> 
>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>> 
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>> 
>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>> 
>>>> Here are some parameters I used
>>>> 
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>> 
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> 
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>> 
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>> Dear Carson:
>>>> 
>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>> 
>>>> Thank you! Have a nice weekend! 
>>>> 
>>>> 
>>>> 
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>> 
>>>> 
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> 
>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>> 
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>> 
>>>> 
>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>> 
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>> 
>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>> 
>>>> 
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>> 
>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>> 
>>>> 
>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>> 
>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>> 
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/cef90e76/attachment-0002.html>

From qwzhang0601 at gmail.com  Mon Sep 11 12:33:51 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 14:33:51 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
Message-ID: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>

Dear Carson:

I see. Thank you. I will try it.

Best
Quanwei

2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> Each node is a single machine. Because you currently run without MPI, each
> MAKER job you submit runs on a single machine. So you are either running
> multiple times on the same node, or you submitted 5 separate batch jobs in
> which case you may have a single maker process on each of 5 nodes.
>
> MPI can parallelize on the same node or across nodes. If you request 10
> nodes, then it can communicate across nodes to run the job on all hardware.
> Or you can run MPI on a single node and ask for all CPUs on that node. In
> that case it will split up work within a single node and use all resources
> just on that node. So if you can?t get MPI to work across nodes, you can
> just submit a job that goes to a single node and ask for all CPUs on that
> node (multinode jobs may be hard to configure, but single node jobs are
> very easy). Just set the -n parameter of mpiexec to the CPU count of that
> node, and it will parallelize within the node.
>
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>
> ?Carson
>
>
>
>
>
> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Would you please explain what do you mean by "a single machine"? I am
> running maker2 on our high performance cluster. The cluster has more than
> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
> as the scheduler. Can I use MPICH3?
>
> Thanks
>
> Best
> Quanwei
>
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> If you are just using a single machine (and not cross machine MPI), use
>> MPICH3 ?> https://www.mpich.org
>>
>> It?s easy to install yourself, and tends to be very robust to failure.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> It could be either. Please use MPI instead of starting multiple
>>> instances. It will greatly reduce both IO and RAM usage.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>> is related to memory issue or an IO issue, I am not sure why the much
>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>> but the relatively shorter ones failed.
>>>
>>> I have set "tries=5" (#number of times to try a contig if there is a
>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>> scaffolds individually with larger memory to see whether they can be
>>> annotated.
>>>
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> I think the cause of the error may have been a little further upstream
>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>> etc. In which case you can just retry and that will get around those types
>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>> working on network mounted locations (i.e. the storage being used is
>>>> actually across the network), then they can be lest robust than local
>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>> operations that actually failed). It?s the reason we built in the retry
>>>> capabilities of MAKER.
>>>>
>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>> on failure rather than just those suspected of being truncated).
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> About the error in my above email, I found the contig was correctly
>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>> specific repeat library (which is generated by your pipeline "
>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>> could be the reason? Or do you have any suggestions for me to find the
>>>> reason? Many thanks
>>>>
>>>> Here are some parameters I used
>>>>
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>> RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>> specific repeat library in fasta format for Repe
>>>>
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>> line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>>
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I got the following error again. Is this still related to memory
>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>> current training (where I get the following error),  I have decreased the
>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>> difference is that I am using both mammalian repeat library and species
>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>> have also set the depth_blast as 30 in current training.
>>>>>
>>>>> Thank you! Have a nice weekend!
>>>>>
>>>>>
>>>>>
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>>
>>>>>
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>> line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>>
>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>> memory and time?
>>>>>>
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>
>>>>>>
>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>> Swiss protein sequences as evidences.
>>>>>>
>>>>>>
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>>> 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>
>>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>>> via max_dna_length will also increase runtimes.
>>>>>>
>>>>>>
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>> shorter ones).
>>>>>>
>>>>>>
>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>> MPI.
>>>>>>
>>>>>>
>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>> "cpus" parameters in the maker_opts file (
>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>>> I can use the following command to submit the job, right?
>>>>>>
>>>>>>
>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/e23e5faa/attachment-0002.html>

From qwzhang0601 at gmail.com  Wed Sep 13 08:51:32 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:51:32 -0400
Subject: [maker-devel] Repeats annotation
Message-ID: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>

Dear Carson:

We have generated species specific repeat library following your pipeline (
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic).
And did genome annotation by maker2 by using both species specific repeat
library and mammalian repeat library.

Now, we want to do some comparison about the repeat contexts among
different species. So I want to generate species specific for other species
and also use both their species specific repeat library and mammalian
repeat library. But I found, I can only provide either the species specific
repeat library or mammalian repeat library to RepeatMasker (not for both).
I wonder whether I can run maker2 on those genome but only for repeat
masking.

BTW, by running RepeatMasker we can get a summary report (as below), I
wonder whether there is any script from maker2 to analyze repeats element
(or other tools to process the output of maker2).

Many thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     301634 bp ( 34.93 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:               134        14362 bp    1.66 %
      Alu/B1          28         2183 bp    0.25 %
      MIRs            21         2860 bp    0.33 %

LINEs:               188       129104 bp   14.95 %
      LINE1          168       124633 bp   14.43 %
      LINE2           16         4266 bp    0.49 %
      L3/CR1           4          205 bp    0.02 %
      RTE              0            0 bp    0.00 %

LTR elements:        127       101129 bp   11.71 %
      ERVL            10         3057 bp    0.35 %
      ERVL-MaLRs      22         6902 bp    0.80 %
      ERV_classI      66        80258 bp    9.29 %
      ERV_classII     29        10912 bp    1.26 %

DNA elements:         27         4402 bp    0.51 %
      hAT-Charlie     13         1836 bp    0.21 %
      TcMar-Tigger     8         1651 bp    0.19 %

Unclassified:          4         1590 bp    0.18 %

Total interspersed repeats:    250587 bp   29.02 %


Small RNA:             9          616 bp    0.07 %

Satellites:           66        40820 bp    4.73 %
Simple repeats:      159         7235 bp    0.84 %
Low complexity:       50         2766 bp    0.32 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be mammalia
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/739f1e6a/attachment-0002.html>

From qwzhang0601 at gmail.com  Wed Sep 13 08:32:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:32:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
Message-ID: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>

Dear Carson:

I did more tests on one of the contigs (with length 863kb) that failed when
doing repeat masking. I found it only fail when I added the species
specific repeat library, and it can be successfully annotated when only
considering mammalian repeat library. When I did the test I only picked the
this contig and run maker with 64G memory. So I think the failure should
not be the problem with memory or IO, because even the contigs with length
98Mb can be annotated with memory 32G.

I also run RepeatMasker on this contig with mammalian and species specific
repeat library, separately. I found when I use  mammalian repeat library,
about 35% was masked as repeats, while it is 65% when I use species
specific repeat library (as shown below in blue). I wonder whether the high
level of repeats can lead to the failure of this contig.  Do you have any
ideas about this. Thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     562909 bp ( 65.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:              113        16134 bp    1.87 %
      ALUs           71        12479 bp    1.45 %
      MIRs            1          133 bp    0.02 %

LINEs:              251       380142 bp   44.02 %
      LINE1         211       210623 bp   24.39 %
      LINE2           1           86 bp    0.01 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:       246       101221 bp   11.72 %
      ERVL            5         1037 bp    0.12 %
      ERVL-MaLRs     18         2744 bp    0.32 %
      ERV_classI    201        90942 bp   10.53 %
      ERV_classII    18         5964 bp    0.69 %

DNA elements:        39        14177 bp    1.64 %
     hAT-Charlie      7         3864 bp    0.45 %
     TcMar-Tigger     7         1706 bp    0.20 %

Unclassified:       196        45831 bp    5.31 %

Total interspersed repeats:   557505 bp   64.56 %


Small RNA:            3          823 bp    0.10 %

Satellites:           2          237 bp    0.03 %
Simple repeats:      94         4472 bp    0.52 %
Low complexity:      18          766 bp    0.09 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
The query was compared to classified sequences in
".../consensi.fa.classifiednoProtFinal"


Best
Quanwei

2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I see. Thank you. I will try it.
>
> Best
> Quanwei
>
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> Each node is a single machine. Because you currently run without MPI,
>> each MAKER job you submit runs on a single machine. So you are either
>> running multiple times on the same node, or you submitted 5 separate batch
>> jobs in which case you may have a single maker process on each of 5 nodes.
>>
>> MPI can parallelize on the same node or across nodes. If you request 10
>> nodes, then it can communicate across nodes to run the job on all hardware.
>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>> that case it will split up work within a single node and use all resources
>> just on that node. So if you can?t get MPI to work across nodes, you can
>> just submit a job that goes to a single node and ask for all CPUs on that
>> node (multinode jobs may be hard to configure, but single node jobs are
>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>> node, and it will parallelize within the node.
>>
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>
>> ?Carson
>>
>>
>>
>>
>>
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> Would you please explain what do you mean by "a single machine"? I am
>> running maker2 on our high performance cluster. The cluster has more than
>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>> as the scheduler. Can I use MPICH3?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> If you are just using a single machine (and not cross machine MPI), use
>>> MPICH3 ?> https://www.mpich.org
>>>
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> It could be either. Please use MPI instead of starting multiple
>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>> but the relatively shorter ones failed.
>>>>
>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>> scaffolds individually with larger memory to see whether they can be
>>>> annotated.
>>>>
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> I think the cause of the error may have been a little further upstream
>>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>>> etc. In which case you can just retry and that will get around those types
>>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>>> working on network mounted locations (i.e. the storage being used is
>>>>> actually across the network), then they can be lest robust than local
>>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>>> operations that actually failed). It?s the reason we built in the retry
>>>>> capabilities of MAKER.
>>>>>
>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>> on failure rather than just those suspected of being truncated).
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> About the error in my above email, I found the contig was correctly
>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>> specific repeat library (which is generated by your pipeline "
>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>> reason? Many thanks
>>>>>
>>>>> Here are some parameters I used
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>> specific repeat library in fasta format for Repe
>>>>>
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>> line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>>
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I got the following error again. Is this still related to memory
>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>> current training (where I get the following error),  I have decreased the
>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>
>>>>>> Thank you! Have a nice weekend!
>>>>>>
>>>>>>
>>>>>>
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>>
>>>>>>
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>> line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>>
>>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>> memory and time?
>>>>>>>
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>
>>>>>>>
>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>> Swiss protein sequences as evidences.
>>>>>>>
>>>>>>>
>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>> least 6 times slower than BLASTN
>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>
>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>
>>>>>>>
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>> shorter ones).
>>>>>>>
>>>>>>>
>>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>> MPI.
>>>>>>>
>>>>>>>
>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>
>>>>>>>
>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/c1467038/attachment-0002.html>

From mathog at caltech.edu  Wed Sep 13 12:01:11 2017
From: mathog at caltech.edu (mathog)
Date: Wed, 13 Sep 2017 11:01:11 -0700
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
Message-ID: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>

Greetings,

I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It 
just won't start.  OpenMPI works fine with a small test program, it just 
doesn't work with maker.  It fails in exactly the same way on a second 
Centos system with minor software differences (Centos 6.9 and perl 5.20 
compiled without thread support, the perl on the first machine had 
thread support.) The gory details were posted already in a Centos forum 
so rather than repeat it all here, this is a link to that thread:

    https://www.centos.org/forums/viewtopic.php?f=14&t=64099

maker was unpacked from the maker-2.31.9.tgz a second time (after moving 
the original) after setting up the "module add openmpi-x86_64" to my 
.bash_profile
and logging in cleanly.  It was rebuilt.  The build messages were 
identical to the previous ones and when a run was attempted it also 
failed in exactly the same way.

I also tried to subscribe to the list here

   
https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

once yesterday, and once today, but no email ever came back.  Hopefully 
this message gets through!

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From carsonhh at gmail.com  Wed Sep 13 12:23:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:23:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
Message-ID: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>

These are the 3 errors you have shown in your e-mails ?>
open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.

The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.

The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.


IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.

Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.

1. Some things to check. Make sure TMP= is not being set to a network mounted location.
2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.

Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.

Thanks,
Carson


> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
> 
> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
> 
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
> 
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
> 
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
> 
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
> 
> Unclassified:       196        45831 bp    5.31 %
> 
> Total interspersed repeats:   557505 bp   64.56 %
> 
> 
> Small RNA:            3          823 bp    0.10 %
> 
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be homo          
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
> 
> 
> Best
> Quanwei
> 
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I see. Thank you. I will try it.
> 
> Best
> Quanwei
> 
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
> 
> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
> 
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
> 
> ?Carson
> 
> 
> 
> 
> 
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson: 
>> 
>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>> 
>> It?s easy to install yourself, and tends to be very robust to failure.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>> 
>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>> 
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>> 
>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>> 
>>>>> Here are some parameters I used
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>> 
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> 
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>> 
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>> Dear Carson:
>>>>> 
>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>> 
>>>>> Thank you! Have a nice weekend! 
>>>>> 
>>>>> 
>>>>> 
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> 
>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>> 
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>> 
>>>>> 
>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>> 
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>> 
>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>> 
>>>>> 
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>> 
>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>> 
>>>>> 
>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>> 
>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>> 
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/3c646981/attachment-0002.html>

From carsonhh at gmail.com  Wed Sep 13 12:26:08 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:26:08 -0600
Subject: [maker-devel] Repeats annotation
In-Reply-To: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
References: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
Message-ID: <40F80C42-836A-41FF-9C9F-1F45C5816283@gmail.com>

I don?t know of any tool to analyze the repeat info. MAKER really only focuses on getting the masking done for the gene prediction, and while it does keep the repeats as features in the GFF3, it does not do any kind of analysis. You would have to do that outside of MAKER.

?Carson


> On Sep 13, 2017, at 8:51 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> We have generated species specific repeat library following your pipeline (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>). And did genome annotation by maker2 by using both species specific repeat library and mammalian repeat library. 
> 
> Now, we want to do some comparison about the repeat contexts among different species. So I want to generate species specific for other species and also use both their species specific repeat library and mammalian repeat library. But I found, I can only provide either the species specific repeat library or mammalian repeat library to RepeatMasker (not for both). I wonder whether I can run maker2 on those genome but only for repeat masking. 
> 
> BTW, by running RepeatMasker we can get a summary report (as below), I wonder whether there is any script from maker2 to analyze repeats element (or other tools to process the output of maker2). 
> 
> Many thanks
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     301634 bp ( 34.93 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:               134        14362 bp    1.66 %
>       Alu/B1          28         2183 bp    0.25 %
>       MIRs            21         2860 bp    0.33 %
> 
> LINEs:               188       129104 bp   14.95 %
>       LINE1          168       124633 bp   14.43 %
>       LINE2           16         4266 bp    0.49 %
>       L3/CR1           4          205 bp    0.02 %
>       RTE              0            0 bp    0.00 %
> 
> LTR elements:        127       101129 bp   11.71 %
>       ERVL            10         3057 bp    0.35 %
>       ERVL-MaLRs      22         6902 bp    0.80 %
>       ERV_classI      66        80258 bp    9.29 %
>       ERV_classII     29        10912 bp    1.26 %
> 
> DNA elements:         27         4402 bp    0.51 %
>       hAT-Charlie     13         1836 bp    0.21 %
>       TcMar-Tigger     8         1651 bp    0.19 %
> 
> Unclassified:          4         1590 bp    0.18 %
> 
> Total interspersed repeats:    250587 bp   29.02 %
> 
> 
> Small RNA:             9          616 bp    0.07 %
> 
> Satellites:           66        40820 bp    4.73 %
> Simple repeats:      159         7235 bp    0.84 %
> Low complexity:       50         2766 bp    0.32 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be mammalia      
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/9744da83/attachment-0002.html>

From carsonhh at gmail.com  Wed Sep 13 12:41:24 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:41:24 -0600
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
In-Reply-To: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
References: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
Message-ID: <BA16E294-BE01-47DC-8113-C018C38480FC@gmail.com>

Mi David,

First thing. MAKER binds shared C libraries using Perl, so you have to tell MAKER where to find the needed files before you install it. Then it compiles the bindings and saves them for MAKER to use. If you have two MPI installation, you may have MAKER setup to use one of the installations then you are trying to call it with the other one. That would break the compiles bindings.

Also make sure you did the following (info from the ?/maker/INSTALL instructions file) ?> 

"make sure to set LD_PRELOAD to the location of libmpi.so before even trying to install MAKER. It must also be set before running MAKER (or any program that binds OpenMPI's shared libraries), so it's best just to add it to your ~/.bash_profile. (i.e. export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so)."

Remember to replace '/usr/local/openmpi/lib/libmpi.so? with the actual location of the file.

Second once you can get maker to start under OpenMPI, you may get freezes or failures part way into a run because OpenFabrics libraries use registered memory in a weird way that can cause system calls in a program to fail with a snowballing error effect. Adding this to the mpiexec options can stop this from occurring ?> '-mca btl ^openib'

That option has the side effect of disabling infiniband and using the ethernet adapter instead. However if you need to use the infiniband adapter, you can use this flag instead '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0'

That command will use IP over infiniband rather than the native infiniband which will have the same effect of diabling the OpenFabrics libraries.

Thanks,
Carson


> On Sep 13, 2017, at 12:01 PM, mathog <mathog at caltech.edu> wrote:
> 
> Greetings,
> 
> I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It just won't start.  OpenMPI works fine with a small test program, it just doesn't work with maker.  It fails in exactly the same way on a second Centos system with minor software differences (Centos 6.9 and perl 5.20 compiled without thread support, the perl on the first machine had thread support.) The gory details were posted already in a Centos forum so rather than repeat it all here, this is a link to that thread:
> 
>   https://www.centos.org/forums/viewtopic.php?f=14&t=64099
> 
> maker was unpacked from the maker-2.31.9.tgz a second time (after moving the original) after setting up the "module add openmpi-x86_64" to my .bash_profile
> and logging in cleanly.  It was rebuilt.  The build messages were identical to the previous ones and when a run was attempted it also failed in exactly the same way.
> 
> I also tried to subscribe to the list here
> 
>  https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> once yesterday, and once today, but no email ever came back.  Hopefully this message gets through!
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From qwzhang0601 at gmail.com  Wed Sep 13 13:42:01 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 15:42:01 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
Message-ID: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>

Dear Carson:

Thank you for your explanation.  Sorry for not describing my problem
clearly. The first two errors were all gone after I changed the parameters
you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
following error for two contigs among thousands of contigs. One of the two
failed contigs has length 863k, and I have done more tests on this contig
individually. By running repeatmask on this contig, 65% was masked when
using species specific repeat library, while it is only 35% when using
mammalian repeat library. Since longer contigs (even 98Mb) can all be
annotated, I doubt why this much shorter one can fail due to IO.

I did not set "TMP", and I am running on a high performance cluster. I am
not sure whether it is a virtual memory or not. I will check it later. Many
thanks

Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31

Best
Quanwei

2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
>
> The first two are memory related with the second being because it cannot
> kill a lock maintainer thread that it was not able to start because of lack
> of memory.
>
> The third one is IO related. It is a truncated file that succeeded on the
> second try according to the e-mail you sent.
>
>
> IO errors are quite common with NFS (network mounted file systems). It?s
> one of the most frequent issues submitted to the devel list. MAKER can hit
> IO limits long before it hits CPU limits. One of the most frequent casues
> of these issues is that the user set TMP= in the control files to a manual
> location that is not suitable for high IO (note TMP= defaults to /tmp). The
> location should always be a true locally mounted disk. Sometimes this is a
> virtual location (not really local disk but network mounted disk or an in
> memory location). With the former you will get frequent IO failures and
> with the latter you will also get out of memory issues.
>
> Note that when you supply more data files you will also use more memory
> (to hold analysis results). According to your e-mail the last error you got
> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
> with two input files but not when you supply a single input file further
> suggests you are running low on RAM.
>
> 1. Some things to check. Make sure TMP= is not being set to a network
> mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory
> on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users
> or because you failed to request enough RAM during job submission.
>
> Finally, try running interactively so you can see what the memory and
> directory locations look like on the node you get assigned for the job
> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
> local disk?). Also run with MPI rather than starting multiple MAKER
> instances. It uses resources better.
>
> Thanks,
> Carson
>
>
>
>
>
>
> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I did more tests on one of the contigs (with length 863kb) that failed
> when doing repeat masking. I found it only fail when I added the species
> specific repeat library, and it can be successfully annotated when only
> considering mammalian repeat library. When I did the test I only picked the
> this contig and run maker with 64G memory. So I think the failure should
> not be the problem with memory or IO, because even the contigs with length
> 98Mb can be annotated with memory 32G.
>
> I also run RepeatMasker on this contig with mammalian and species specific
> repeat library, separately. I found when I use  mammalian repeat library,
> about 35% was masked as repeats, while it is 65% when I use species
> specific repeat library (as shown below in blue). I wonder whether the high
> level of repeats can lead to the failure of this contig.  Do you have any
> ideas about this. Thanks
>
>
>
> file name: test_scaffold31.fasta
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
>
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
>
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
>
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
>
> Unclassified:       196        45831 bp    5.31 %
>
> Total interspersed repeats:   557505 bp   64.56 %
>
>
> Small RNA:            3          823 bp    0.10 %
>
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
>
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>
>
> The query species was assumed to be homo
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"
>
>
>
> Best
> Quanwei
>
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I see. Thank you. I will try it.
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> Each node is a single machine. Because you currently run without MPI,
>>> each MAKER job you submit runs on a single machine. So you are either
>>> running multiple times on the same node, or you submitted 5 separate batch
>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>
>>> MPI can parallelize on the same node or across nodes. If you request 10
>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>> that case it will split up work within a single node and use all resources
>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>> just submit a job that goes to a single node and ask for all CPUs on that
>>> node (multinode jobs may be hard to configure, but single node jobs are
>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>> node, and it will parallelize within the node.
>>>
>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>
>>> ?Carson
>>>
>>>
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> Would you please explain what do you mean by "a single machine"? I am
>>> running maker2 on our high performance cluster. The cluster has more than
>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>> as the scheduler. Can I use MPICH3?
>>>
>>> Thanks
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> If you are just using a single machine (and not cross machine MPI), use
>>>> MPICH3 ?> https://www.mpich.org
>>>>
>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> It could be either. Please use MPI instead of starting multiple
>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>> but the relatively shorter ones failed.
>>>>>
>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>> scaffolds individually with larger memory to see whether they can be
>>>>> annotated.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> I think the cause of the error may have been a little further
>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>> being used is actually across the network), then they can be lest robust
>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>> the retry capabilities of MAKER.
>>>>>>
>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> About the error in my above email, I found the contig was correctly
>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>> reason? Many thanks
>>>>>>
>>>>>> Here are some parameters I used
>>>>>>
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>> RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>> specific repeat library in fasta format for Repe
>>>>>>
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>> line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> I got the following error again. Is this still related to memory
>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>
>>>>>>> Thank you! Have a nice weekend!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>> Now starting the contig!!
>>>>>>> SeqID: Contig10
>>>>>>> Length: 18773588
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>>
>>>>>>>
>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>> doing repeat masking
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> collecting blastx repeatmasking
>>>>>>> processing all repeats
>>>>>>> doing repeat masking
>>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>>> line 1050.
>>>>>>> --> rank=NA, hostname=n224
>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>
>>>>>>>>
>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>> memory and time?
>>>>>>>>
>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>>
>>>>>>>>
>>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>
>>>>>>>>
>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>
>>>>>>>>
>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>> least 6 times slower than BLASTN
>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>
>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>
>>>>>>>>
>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>> shorter ones).
>>>>>>>>
>>>>>>>>
>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>> MPI.
>>>>>>>>
>>>>>>>>
>>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>
>>>>>>>>
>>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>>
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/31f8118a/attachment-0002.html>

From carsonhh at gmail.com  Wed Sep 13 14:21:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 14:21:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
Message-ID: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>

One final thought. If you are using rmblast as part of the RepeatMasker installation, it may be suffering a bug that some blast version suffer from that can sometimes lead to truncation of a blast report  (example of a separate error related to blast report truncation here)?> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ <https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ>

As a result there is a special update to rmblast ?> http://www.repeatmasker.org/RMBlast.html <http://www.repeatmasker.org/RMBlast.html>

So if you are not using the update try it, but if you are using the update and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update may be the cause or the cure or RepeatMasker errors).

?Carson


> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thank you for your explanation.  Sorry for not describing my problem clearly. The first two errors were all gone after I changed the parameters you suggested (e.g., max_dna_len, depeth_blast). Now I only get the following error for two contigs among thousands of contigs. One of the two failed contigs has length 863k, and I have done more tests on this contig individually. By running repeatmask on this contig, 65% was masked when using species specific repeat library, while it is only 35% when using mammalian repeat library. Since longer contigs (even 98Mb) can all be annotated, I doubt why this much shorter one can fail due to IO.
> 
> I did not set "TMP", and I am running on a high performance cluster. I am not sure whether it is a virtual memory or not. I will check it later. Many thanks
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> Best
> Quanwei
> 
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 
> The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.
> 
> The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.
> 
> 
> IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.
> 
> Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.
> 
> 1. Some things to check. Make sure TMP= is not being set to a network mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.
> 
> Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.
> 
> Thanks,
> Carson
> 
> 
> 
> 
> 
> 
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
>> 
>> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
>> 
>> 
>> 
>> file name: test_scaffold31.fasta    
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>> 
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>> 
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>> 
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>> 
>> Unclassified:       196        45831 bp    5.31 %
>> 
>> Total interspersed repeats:   557505 bp   64.56 %
>> 
>> 
>> Small RNA:            3          823 bp    0.10 %
>> 
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>> 
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>                                                       
>> 
>> The query species was assumed to be homo          
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>         
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I see. Thank you. I will try it.
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
>> 
>> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
>> 
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>> 
>> ?Carson
>> 
>> 
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson: 
>>> 
>>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>>> 
>>> Thanks
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>>> 
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>>> 
>>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>>> 
>>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>>> 
>>>>>> Dear Carson:
>>>>>> 
>>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>>> 
>>>>>> Here are some parameters I used
>>>>>> 
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>>> 
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> 
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>> 
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>>> Dear Carson:
>>>>>> 
>>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>>> 
>>>>>> Thank you! Have a nice weekend! 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> #---------------------------------------------------------------------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #---------------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>>> 
>>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>>> 
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>>> 
>>>>>> 
>>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>>> 
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>> 
>>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>>> 
>>>>>> 
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>>> 
>>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>>> 
>>>>>> 
>>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>>> 
>>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>>> 
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/5707fd81/attachment-0002.html>

From qwzhang0601 at gmail.com  Wed Sep 13 14:26:11 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 16:26:11 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
	<55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
Message-ID: <CAOW6FSKU9Tn6HN3fZAnXquVU0OrdsxZuHB8GCG76BNQAZ_kdKg@mail.gmail.com>

Dear Carson:

I will take a look at try it. Thank you.

Best
Quanwei

2017-09-13 16:21 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> One final thought. If you are using rmblast as part of the RepeatMasker
> installation, it may be suffering a bug that some blast version suffer from
> that can sometimes lead to truncation of a blast report  (example of a
> separate error related to blast report truncation here)?>
> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ
>
> As a result there is a special update to rmblast ?>
> http://www.repeatmasker.org/RMBlast.html
>
> So if you are not using the update try it, but if you are using the update
> and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update
> may be the cause or the cure or RepeatMasker errors).
>
> ?Carson
>
>
>
> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thank you for your explanation.  Sorry for not describing my problem
> clearly. The first two errors were all gone after I changed the parameters
> you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
> following error for two contigs among thousands of contigs. One of the two
> failed contigs has length 863k, and I have done more tests on this contig
> individually. By running repeatmask on this contig, 65% was masked when
> using species specific repeat library, while it is only 35% when using
> mammalian repeat library. Since longer contigs (even 98Mb) can all be
> annotated, I doubt why this much shorter one can fail due to IO.
>
> I did not set "TMP", and I am running on a high performance cluster. I am
> not sure whether it is a virtual memory or not. I will check it later. Many
> thanks
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
> Best
> Quanwei
>
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> These are the 3 errors you have shown in your e-mails ?>
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>>
>> The first two are memory related with the second being because it cannot
>> kill a lock maintainer thread that it was not able to start because of lack
>> of memory.
>>
>> The third one is IO related. It is a truncated file that succeeded on the
>> second try according to the e-mail you sent.
>>
>>
>> IO errors are quite common with NFS (network mounted file systems). It?s
>> one of the most frequent issues submitted to the devel list. MAKER can hit
>> IO limits long before it hits CPU limits. One of the most frequent casues
>> of these issues is that the user set TMP= in the control files to a manual
>> location that is not suitable for high IO (note TMP= defaults to /tmp). The
>> location should always be a true locally mounted disk. Sometimes this is a
>> virtual location (not really local disk but network mounted disk or an in
>> memory location). With the former you will get frequent IO failures and
>> with the latter you will also get out of memory issues.
>>
>> Note that when you supply more data files you will also use more memory
>> (to hold analysis results). According to your e-mail the last error you got
>> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
>> with two input files but not when you supply a single input file further
>> suggests you are running low on RAM.
>>
>> 1. Some things to check. Make sure TMP= is not being set to a network
>> mounted location.
>> 2. Make sure your temporary directory is not a virtual in memory
>> directory on the node being used.
>> 3. If nodes are shared, you may run out of memory because of other users
>> or because you failed to request enough RAM during job submission.
>>
>> Finally, try running interactively so you can see what the memory and
>> directory locations look like on the node you get assigned for the job
>> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
>> local disk?). Also run with MPI rather than starting multiple MAKER
>> instances. It uses resources better.
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Dear Carson:
>>
>> I did more tests on one of the contigs (with length 863kb) that failed
>> when doing repeat masking. I found it only fail when I added the species
>> specific repeat library, and it can be successfully annotated when only
>> considering mammalian repeat library. When I did the test I only picked the
>> this contig and run maker with 64G memory. So I think the failure should
>> not be the problem with memory or IO, because even the contigs with length
>> 98Mb can be annotated with memory 32G.
>>
>> I also run RepeatMasker on this contig with mammalian and species
>> specific repeat library, separately. I found when I use  mammalian repeat
>> library, about 35% was masked as repeats, while it is 65% when I use
>> species specific repeat library (as shown below in blue). I wonder whether
>> the high level of repeats can lead to the failure of this contig.  Do you
>> have any ideas about this. Thanks
>>
>>
>>
>> file name: test_scaffold31.fasta
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>>
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>>
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>>
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>>
>> Unclassified:       196        45831 bp    5.31 %
>>
>> Total interspersed repeats:   557505 bp   64.56 %
>>
>>
>> Small RNA:            3          823 bp    0.10 %
>>
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>>
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>
>>
>> The query species was assumed to be homo
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in
>> ".../consensi.fa.classifiednoProtFinal"
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I see. Thank you. I will try it.
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> Each node is a single machine. Because you currently run without MPI,
>>>> each MAKER job you submit runs on a single machine. So you are either
>>>> running multiple times on the same node, or you submitted 5 separate batch
>>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>>
>>>> MPI can parallelize on the same node or across nodes. If you request 10
>>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>>> that case it will split up work within a single node and use all resources
>>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>>> just submit a job that goes to a single node and ask for all CPUs on that
>>>> node (multinode jobs may be hard to configure, but single node jobs are
>>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>>> node, and it will parallelize within the node.
>>>>
>>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> Would you please explain what do you mean by "a single machine"? I am
>>>> running maker2 on our high performance cluster. The cluster has more than
>>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>>> as the scheduler. Can I use MPICH3?
>>>>
>>>> Thanks
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> If you are just using a single machine (and not cross machine MPI),
>>>>> use MPICH3 ?> https://www.mpich.org
>>>>>
>>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I met some problems to use MPI. I will give it another try.
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> It could be either. Please use MPI instead of starting multiple
>>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If
>>>>>> it is related to memory issue or an IO issue, I am not sure why the much
>>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>>> but the relatively shorter ones failed.
>>>>>>
>>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>>> scaffolds individually with larger memory to see whether they can be
>>>>>> annotated.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>> I think the cause of the error may have been a little further
>>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>>> being used is actually across the network), then they can be lest robust
>>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>>> the retry capabilities of MAKER.
>>>>>>>
>>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> About the error in my above email, I found the contig was correctly
>>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>>> reason? Many thanks
>>>>>>>
>>>>>>> Here are some parameters I used
>>>>>>>
>>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>>> RepeatMasker
>>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>>> specific repeat library in fasta format for Repe
>>>>>>>
>>>>>>> max_dna_len=300000
>>>>>>> split_hit=40000
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>>> line 188.
>>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>>
>>>>>>>> Dear Carson:
>>>>>>>>
>>>>>>>> I got the following error again. Is this still related to memory
>>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>>
>>>>>>>> Thank you! Have a nice weekend!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>> Now starting the contig!!
>>>>>>>> SeqID: Contig10
>>>>>>>> Length: 18773588
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>>
>>>>>>>>
>>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>>> doing repeat masking
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> collecting blastx repeatmasking
>>>>>>>> processing all repeats
>>>>>>>> doing repeat masking
>>>>>>>> Can't kill a non-numeric process ID at
>>>>>>>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line
>>>>>>>> 1050.
>>>>>>>> --> rank=NA, hostname=n224
>>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> Best
>>>>>>>> Quanwei
>>>>>>>>
>>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>>> memory and time?
>>>>>>>>>
>>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element
>>>>>>>>> masking
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This values really only affects the final evidence kept in the
>>>>>>>>> GFF3 when you look at it in a browser. It has not affect on the annotation.
>>>>>>>>> This is because internally MAKER already collapses evidence down to the 10
>>>>>>>>> best non-redundant features per evidence set per locus. The rest are put in
>>>>>>>>> the GFF3 just for reference. by setting it lower, you are just letting
>>>>>>>>> MAKER know it can through things away even sooner since you don?t want them
>>>>>>>>> in the GFF3. It provides a minor improvement for memory use, but
>>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>>> least 6 times slower than BLASTN
>>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>>
>>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>>> shorter ones).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>>> MPI.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (5) Still about the speed issue. I read some of your comments
>>>>>>>>> about "cpus" parameters in the maker_opts file (
>>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The cpu parameter only affects how many CPUs are given to the
>>>>>>>>> blast command line. So only the BLASt step will speed up, so I recommend
>>>>>>>>> using MPI to get all steps to speed up. Even if you are only running on a
>>>>>>>>> single node, you can give all CPUs to the mpiexec command.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ?Carson
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/42eb2d53/attachment-0002.html>

From xvazquezc at gmail.com  Sun Sep 17 19:12:56 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 18 Sep 2017 11:12:56 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
Message-ID: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>

I did it that way and AUGUSTUS is predicting a more reasonable number of
genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train
Augustus over the masked genome instead?
Thanks,


On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:

> BUSCO may be generating too few models. BUSCO also identifies classes of
> conserved short genes that may not represent enough training diversity for
> your organism. Try running MAKER in protein2genome or est2genome mode, and
> then train with those results.
>
> ?Carson
>
>
> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained
> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
> is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content
> close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea
> https://gitlab.com/ezlab/busco/issues/49
> Cheers,
> Xabi
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a8cfffd6/attachment-0002.html>

From qwzhang0601 at gmail.com  Mon Sep 18 21:07:25 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 18 Sep 2017 23:07:25 -0400
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
Message-ID: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>

Hello:

Would you please explain what is the difference between
"maker-...-agustus..." and "augustus_masked..." gene models?

I know  "augustus_masked..." gene models are raw august predictions, while
"maker-...-agustus..." are hit derived gene models. But by default, maker2
reports gene models with evidence support (protein sequences or
transcripts). Then why some gene models are hit derived while other models
(with evidence support) are raw augustus prediction (even there are protein
sequences or transcript evidence)?

BTW, is it true that generally the "maker-...-agustus..." gene models are
more reliable than the "augustus_masked..." gene models?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a273a8fe/attachment-0002.html>

From qwzhang0601 at gmail.com  Mon Sep 18 22:14:38 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 00:14:38 -0400
Subject: [maker-devel] about min_protein
Message-ID: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>

Hello:

I am working on a rodent species and get 28k annotated genes, I wonder
whether you have any suggestions about the "min_protein" parameter?

I did not change the parameter in my current annotation. I get several very
short predicted proteins (even those with only 1 amino acid).

min_protein=0 #require at least this many amino acids in predicted proteins

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/3bd06e0a/attachment-0002.html>

From qwzhang0601 at gmail.com  Tue Sep 19 06:47:00 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 08:47:00 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
Message-ID: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>

Thank you Daniel. I wonder whether there is a suggested value for the
?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people
often use. I am studying a rodent species.

Thank you.

Best
Quanwei

2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com>:

> Hi Quanwei,
>
> Increasing the ?min_protein" parameter should get ride of those very short
> predicted proteins.
>
>
>
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder
> whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several
> very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted
> proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/f2b950ea/attachment-0002.html>

From dandence at gmail.com  Tue Sep 19 06:29:35 2017
From: dandence at gmail.com (Daniel Ence)
Date: Tue, 19 Sep 2017 08:29:35 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
Message-ID: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>

Hi Quanwei, 

Increasing the ?min_protein" parameter should get ride of those very short predicted proteins. 


> On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter? 
> 
> I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid). 
>  
> min_protein=0 #require at least this many amino acids in predicted proteins
> 
> Thanks
> 
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From tuanduonganh at gmail.com  Tue Sep 19 11:23:39 2017
From: tuanduonganh at gmail.com (Tuan Duong Anh)
Date: Tue, 19 Sep 2017 19:23:39 +0200
Subject: [maker-devel] MAKER3 beta - EVM under predicting
Message-ID: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>

Dear MAKER-devel group

I have been testing out MAKER3 beta version and found out that EVM always
returns much less number of models. Did any one experience this before? I
do expect that EVM will return less models when compare to other, but not
to this extend (only 20% of the expected gene models). Any suggestion would
be much appreciated.

## Number of models obtained by each gene predictors:

HLIG.all.maker.augustus_masked.proteins.fasta:11224

HLIG.all.maker.evm.proteins.fasta:1974

HLIG.all.maker.genemark.proteins.fasta:11352

HLIG.all.maker.proteins.fasta:13672

HLIG.all.maker.snap_masked.proteins.fasta:13404

## maker_evm.ctl

#-----Transcript weights

evmtrans=10 #default weight for source unspecified est/alt_est alignments

evmtrans:blastn=0 #weight for blastn sourced alignments

evmtrans:est2genome=10 #weight for est2genome sourced alignments

evmtrans:tblastx=0 #weight for tblastx sourced alignments

evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments


#-----Protein weights

evmprot=10 #default weight for source unspecified protein alignments

evmprot:blastx=2 #weight for blastx sourced alignments

evmprot:protein2genome=10 #weight for protein2genome sourced alignments


#-----Abinitio Prediction weights

evmab=10 #default weight for source unspecified ab initio predictions

evmab:snap=7 #weight for snap sourced predictions

evmab:augustus=10 #weight for augustus sourced predictions

evmab:fgenesh=10 #weight for fgenesh sourced predictions

evmab:genemark=10 #weight for genemark sourced predictions


Regards,

Tuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/4e1fc970/attachment-0002.html>

From carsonhh at gmail.com  Tue Sep 19 15:34:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:34:40 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
Message-ID: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>

Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.

?Carson


> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
> 
> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
> Thanks,
> 
> 
> 
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
> 
> ?Carson
> 
> 
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>> Cheers,
>> Xabi
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/768b9648/attachment-0002.html>

From carsonhh at gmail.com  Tue Sep 19 15:40:27 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:40:27 -0600
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
In-Reply-To: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
References: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
Message-ID: <56CC4BEB-083E-4DE6-99F3-CB34A1735AB4@gmail.com>

MAKER uses all derived models as a pool of alternate models for a given locus.  The one that best matches the aligned evidence is then selected using the AED calculation described in the MAKER2 publication. Overall hint based models tend to perform better than the raw models because they get extra info about observed intron/exon structure from alignments. There is also a discussion of this in the MAKER2 paper.

?Carson


> On Sep 18, 2017, at 9:07 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Would you please explain what is the difference between "maker-...-agustus..." and "augustus_masked..." gene models? 
> 
> I know  "augustus_masked..." gene models are raw august predictions, while "maker-...-agustus..." are hit derived gene models. But by default, maker2 reports gene models with evidence support (protein sequences or transcripts). Then why some gene models are hit derived while other models (with evidence support) are raw augustus prediction (even there are protein sequences or transcript evidence)?
> 
> BTW, is it true that generally the "maker-...-agustus..." gene models are more reliable than the "augustus_masked..." gene models?  
> 
> Thanks
> 
> Best
> Quanwei


From carsonhh at gmail.com  Tue Sep 19 15:41:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:41:40 -0600
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
	<CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
Message-ID: <FFA05628-32ED-4036-9FDC-E6C7BC4EAE4C@gmail.com>

The value is arbitrary, but some submission databases like NCBI will flag entries under ~20-30 amino acids as errors if you try and submit them (I can?t remember the exact number).

?Carson


> On Sep 19, 2017, at 6:47 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Thank you Daniel. I wonder whether there is a suggested value for the ?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people often use. I am studying a rodent species. 
> 
> Thank you.
> 
> Best
> Quanwei
> 
> 2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com <mailto:dandence at gmail.com>>:
> Hi Quanwei,
> 
> Increasing the ?min_protein" parameter should get ride of those very short predicted proteins.
> 
> 
> 
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/8b03be64/attachment-0002.html>

From carsonhh at gmail.com  Tue Sep 19 15:47:42 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:47:42 -0600
Subject: [maker-devel] MAKER3 beta - EVM under predicting
In-Reply-To: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
References: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
Message-ID: <12FE3318-F0DE-485B-B43A-25A4A6EC9390@gmail.com>

If ab initio predictors and evidence alignments aren?t in high concordance, then EVM won?t produce results. This often indicates minor sequencing errors in the assembly (this is very common in draft assemblies). Ab initio predictors will slightly alter splicing and extend introns/exons to make a model work around these variations, but doing this does not always concord well with the alignment, so EVM produces nothing. In these cases it is often better just to train the predictor as well as you can, and then take the standard MAKER results.

?Carson


> On Sep 19, 2017, at 11:23 AM, Tuan Duong Anh <tuanduonganh at gmail.com> wrote:
> 
> Dear MAKER-devel group
> 
> I have been testing out MAKER3 beta version and found out that EVM always returns much less number of models. Did any one experience this before? I do expect that EVM will return less models when compare to other, but not to this extend (only 20% of the expected gene models). Any suggestion would be much appreciated.
> 
> ## Number of models obtained by each gene predictors:
> HLIG.all.maker.augustus_masked.proteins.fasta:11224
> HLIG.all.maker.evm.proteins.fasta:1974
> HLIG.all.maker.genemark.proteins.fasta:11352
> HLIG.all.maker.proteins.fasta:13672
> HLIG.all.maker.snap_masked.proteins.fasta:13404
> 
> ## maker_evm.ctl
> #-----Transcript weights
> evmtrans=10 #default weight for source unspecified est/alt_est alignments
> evmtrans:blastn=0 #weight for blastn sourced alignments
> evmtrans:est2genome=10 #weight for est2genome sourced alignments
> evmtrans:tblastx=0 #weight for tblastx sourced alignments
> evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments
> 
> #-----Protein weights
> evmprot=10 #default weight for source unspecified protein alignments
> evmprot:blastx=2 #weight for blastx sourced alignments
> evmprot:protein2genome=10 #weight for protein2genome sourced alignments
> 
> #-----Abinitio Prediction weights
> evmab=10 #default weight for source unspecified ab initio predictions
> evmab:snap=7 #weight for snap sourced predictions
> evmab:augustus=10 #weight for augustus sourced predictions
> evmab:fgenesh=10 #weight for fgenesh sourced predictions
> evmab:genemark=10 #weight for genemark sourced predictions
> 
> 
> Regards,
> Tuan
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/2c7d2669/attachment-0002.html>

From xvazquezc at gmail.com  Tue Sep 19 18:02:04 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Wed, 20 Sep 2017 10:02:04 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
Message-ID: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>

Thanks Carson.

Last quick question. After the first run (before using the gene predictors)
I ran fasta_merge to get an idea of the numbers I should be looking for.
In summary, I got 14000 genes, only using Swissprot and a close highly
curated reference genome to avoid any "fake" protein or partial proteins
from draft annotations, plus assembled RNA-seq from my genome.
How should I consider this as a guide? (if I can do so) ... Is this a
number I should be aiming as a minimum number of genes? maximum? something
around that?

PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few
possible fragments due assembly (seq errors aside)

On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com> wrote:

> Gene predictors tend to over predict, so I would not take the high numbers
> given by SNAP and GeneMark as true counts. You will probably end up with
> something like 7-10k in the final results. But now Augustus is giving a
> higher count, you should be good to start running MAKER.
>
> ?Carson
>
>
>
>
> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> I did it that way and AUGUSTUS is predicting a more reasonable number of
> genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
>
> I haven't found any reference about but, would it be a good idea to train
> Augustus over the masked genome instead?
> Thanks,
>
>
>
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:
>
>> BUSCO may be generating too few models. BUSCO also identifies classes of
>> conserved short genes that may not represent enough training diversity for
>> your organism. Try running MAKER in protein2genome or est2genome mode, and
>> then train with those results.
>>
>> ?Carson
>>
>>
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
>> wrote:
>>
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained
>> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
>> is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content
>> close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea
>> https://gitlab.com/ezlab/busco/issues/49
>> Cheers,
>> Xabi
>>
>> --
>> Xabier V?zquez-Campos, *PhD*
>> *Research Associate*
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/ca7c08db/attachment-0002.html>

From himanimalhotra89 at gmail.com  Tue Sep 19 22:56:55 2017
From: himanimalhotra89 at gmail.com (himani malhotra)
Date: Wed, 20 Sep 2017 10:26:55 +0530
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
Message-ID: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>

---------- Forwarded message ----------
From: himani malhotra <himanimalhotra89 at gmail.com>
Date: Wed, Sep 20, 2017 at 10:24 AM
Subject: maker error
To: maker-devel-request at box290.bluehost.com


hello
I am using MAKER for gene prediction.I am getting error in Repbase
installation.I am sending you the error also,please help me.I have
installed repbase manually and unpacked its libraries in RepeatMasker
Library but still I am getting error.
Please help me.


Thanks

Himani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: makererror.png
Type: image/png
Size: 212522 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment-0002.png>

From munholl at uwindsor.ca  Wed Sep 20 08:53:04 2017
From: munholl at uwindsor.ca (Seth Munholland)
Date: Wed, 20 Sep 2017 10:53:04 -0400
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <CAL=sJwrjccQC0GdDa3Km1TojWMdN1aYoujntVsjdMjJ9ha2YUw@mail.gmail.com>

Hello,

When this happened to me it was a faulty pathing on my part when I
configured RepeatMasker (which I also manually installed).

Seth Munholland, B.Sc., Ph.D. Candidate
Department of Biological Sciences
Rm. 304 Biology Building
University of Windsor
401 Sunset Ave. N9B 3P4
T: (519) 253-3000 Ext: 4755

On Wed, Sep 20, 2017 at 12:56 AM, himani malhotra <
himanimalhotra89 at gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com
>
>
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase
> installation.I am sending you the error also,please help me.I have
> installed repbase manually and unpacked its libraries in RepeatMasker
> Library but still I am getting error.
> Please help me.
>
>
>
> Thanks
>
> Himani
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/c89e50fe/attachment-0002.html>

From Jimmy.Cross at uea.ac.uk  Wed Sep 20 08:02:53 2017
From: Jimmy.Cross at uea.ac.uk (James Cross (ITCS - Staff))
Date: Wed, 20 Sep 2017 14:02:53 +0000
Subject: [maker-devel] Maker MPI across nodes
Message-ID: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>

Hi Maker Developers,

We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core's so 56 Core's in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core's) as opposed to being run on a single node (28 Core's). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?

Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.

The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp).

The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker

Any help or advise you could give would be greatly appreciated.

Best Wishes
Jimmy
----------------------------------------------------------------------
Mr  James Cross
HPC Systems Developer
University of East Anglia
Norwich Research Park
ITCS
Norwich, Norfolk
NR4 7TJ

Information Services

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/e1e9d5cb/attachment-0002.html>

From patrick.tranvan at unil.ch  Thu Sep 21 03:26:52 2017
From: patrick.tranvan at unil.ch (Patrick Tran Van)
Date: Thu, 21 Sep 2017 09:26:52 +0000
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>,
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
Message-ID: <1505986013492.52354@unil.ch>

Hi Carson,


I have a doubt for the round 2, so in a previous reply you said:


" Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "


Does it means that I don't need to modify the section :


#-----Re-annotation Using MAKER Derived GFF3


?


If I let everything by default such as :


altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no


It will not look again for repeat and protein + transcriptome alignment ?

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Monday, July 3, 2017 10:50 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Advice on my pipeline

maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).

So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.

The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).

You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/

Thanks,
Carson


On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.

I have then use SNAP to train/filter it with:

maker2zff  specie.all.gff

Here are my results:

Number of gene after maker -> Number of gene after maker2zff

- Without corrected_est_fusion: 21621 -> 13875
- With corrected_est_fusion: 16850 -> 9098

1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
Normally I should find more genes with corrected_est_fusion right ?

2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?

 Thanks for your help


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Sorry the option is ?> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

?Carson


On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).

?Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Hello,

This is my first time running Maker for an insect genome annotation.

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1


Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170921/c54c44f5/attachment-0002.html>

From carsonhh at gmail.com  Fri Sep 22 11:57:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 11:57:56 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
	<CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
Message-ID: <06E8D6C3-B278-4820-B309-5CF61186FDCB@gmail.com>

I don?t think you can use the protein2genome option to estimate gene count. It will turn any alignment that matches at east 50% into a gene model. So you can get a lot of partial models which will inflate gene count. It?s good enough for training but not so much annotation.

?Carson


> On Sep 19, 2017, at 6:02 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Thanks Carson.
> 
> Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for.
> In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome. 
> How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that?
> 
> PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside)
> 
> On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.
> 
> ?Carson
> 
> 
> 
> 
>> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
>> In comparison, SNAP gives 16000 and GeneMark 19000.
>> 
>> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
>> Thanks,
>> 
>> 
>> 
>> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> 
>>> Hi,
>>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>>> Has anybody come up with any similar issue?
>>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>>> Cheers,
>>> Xabi
>>> 
>>> -- 
>>> Xabier V?zquez-Campos, PhD
>>> Research Associate
>>> NSW Systems Biology Initiative
>>> School of Biotechnology and Biomolecular Sciences
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/edabec82/attachment-0002.html>

From carsonhh at gmail.com  Fri Sep 22 13:47:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:47:36 -0600
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <5196E0C2-9FDC-4B6A-9D14-CA8514E002EF@gmail.com>

You have a couple of errors at the start indicating that you may have an issue with the perl forks module as well as RepeatMasker installations. I?d recommend redoing both installations. Also the screen shot you show is not the failure, it is MAKER giving up after failing 2 times. To capture the actual failure set the try count to 3, then rerun and see what comes up in STDERR. Redirect STDERR to a file using ?&>?
.
Example:
maker &> err.log

Thanks,
Carson


On Sep 19, 2017, at 10:56 PM, himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>> wrote:

> 
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com <mailto:maker-devel-request at box290.bluehost.com>
> 
> 
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase installation.I am sending you the error also,please help me.I have installed repbase manually and unpacked its libraries in RepeatMasker Library but still I am getting error.
> Please help me.
> 
> 
> 
> Thanks 
> 
> Himani
> 
> <makererror.png>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/fc4e340a/attachment-0002.html>

From carsonhh at gmail.com  Fri Sep 22 13:59:17 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:59:17 -0600
Subject: [maker-devel] Maker MPI across nodes
In-Reply-To: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
References: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
Message-ID: <BD2A6E4D-280B-4B38-AA1C-05C03503848C@gmail.com>

The "-mca btl ^openib? flag has the side affect of bypassing infiniband and using ethernet. But if alternate communicators are too slow, you can switch back to indirect infiniband by using '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?. That option will force IP over infiniband whichb instead of direct infiniband. OpenFabrics libraries used by infiniband has a know issue because of how it uses registered memory (it generates seg faults whenever a program does system calls - i.e. MAKER calling BLAST). So you can?t use direct infinband with MAKER. So try this instead ?>  '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?

Also if it stays slow, it likely means you are hitting IO limits. If that is the case, make sure you are note setting TMP= to a network mounted disk location, and that whatever temp space exists on your cluster it needs to be per node real local mounted disk and not network mounted disk.

?Carson


> On Sep 20, 2017, at 8:02 AM, James Cross (ITCS - Staff) <Jimmy.Cross at uea.ac.uk> wrote:
> 
> Hi Maker Developers,
>  
> We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core?s so 56 Core?s in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core?s) as opposed to being run on a single node (28 Core?s). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?
>  
> Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.
>  
> The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp). 
>  
> The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker
>  
> Any help or advise you could give would be greatly appreciated. 
>  
> Best Wishes
> Jimmy
> ----------------------------------------------------------------------
> Mr  James Cross
> HPC Systems Developer
> University of East Anglia
> Norwich Research Park
> ITCS
> Norwich, Norfolk
> NR4 7TJ
>  
> Information Services
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/7fdc5720/attachment-0002.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 14:04:10 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:04:10 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>

MAKER won?t produce est2genome results for est_gff. This is partially because est2genome results are only used for training gene predictors. So you are essentially just getting protein2genome results from your runs. Once you get a gene predictor trained you will see a difference, as it will use the intron/exon structure of alignments as hints to improve gene predictor performance.

?Carson


> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> I have tried the proposed options for a small example (yeast).
> 
> I had 
> - proteins (fasta) from another yeast and 
> - transcript annotation (gff) from cufflinks and StringTie
> 
> I'd like to compare the maker results for 
> - proteins and StringTie
> Vs.
> - proteins and cufflinks
> 
> I used the default options, except:
> genome=<genome fasta>
> 
> protein=<protein fasta>
> est_gff=<transcript gff>
> 
> est2genome=1
> protein2genome=1
> 
> (An example is attached.)
> 
> Then I ran maker:
> 
> maker -RM_off -c 24
> find . -type f -name *.gff -exec cat {} + | grep maker > filtered-maker-prediction.gff
> 
> (The run seems to be okay. There were no FAILED, ... in the log. Cf. attachment)
> 
> Each maker run was started in a separate subdirectory.
> However, I realized that both maker runs yielded almost the same result (just one minor edit). This made me curious. 
> As far as I understood the files, I received the (filtered?) exonerate predictions for the proteins (from the other yeast). Is this correct? Why did I not receive any predictions (purely) based on the RNA-seq data? Did I something wrong?
> 
> I'm looking forward to your reply.
> 
> Best regards, Jens
> 
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Dienstag, 19. September 2017 23:37
>> An: Keilwagen, Jens
>> Betreff: Re: MAKER
>> 
>> MAKER cannot use the BAM directly, but you can use something like
>> stringtie or trinity to assemble a transcript fasta that can be given
>> to the est= option.
>> 
>> Ab initio gene prediction is only enabled if you specify an hmm or
>> species file to use.  If all you want is homology based annotation, you
>> can try the est2genome and protein2genome options. Note the final
>> models may be partial if the alignments do not cover the gene end to
>> end.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> thanks a lot for your last email that .
>>> 
>>> I was asked to do homology-based gene prediction using RNA-seq and
>> Maker was proposed as one option.
>>> Hence I'd like to ask how to do that in the best possible way.
>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>> related species. How can I integrate the RNA-seq data?
>>> 
>>> Is it possible to deactivate ab-initio gene prediction by Augustus or
>> SNAP?
>>> 
>>> Thanks a lot in advance.
>>> 
>>> Bets regards, Jens
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>> An: Keilwagen, Jens
>>>> Cc: Mark Yandell
>>>> Betreff: Re: MAKER
>>>> 
>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>> could give the GFF3 results to the pred_gff= option in MAKER (comma
>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>> the same coordinate space as the assembly being annotated (genome=
>> option).
>>>> Whatever you give to pred_gff will be treated as a raw predictions
>> by
>>>> MAKER and will only be accepted as a final model if there are
>>>> evidence alignments (protein/EST) that support the model, and if
>>>> there are multiple alternate models at the same locus, only the
>> model
>>>> that is best supported by the protein/transcript evidence is kept.
>>>> 
>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>> will cause even raw predictions with no evidence support to be
>> maintained.
>>>> In the event of multiple models with no evidence support, the model
>>>> best matching the consensus of alternate models will be maintained.
>>>> 
>>>> Alternatively you can use the model_gff= options (comma separated
>>>> list
>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>> confidence than pred_gff. At least one model will always be kept
>>>> regardless of evidence support (same rules as pred_gff selection for
>>>> which model to keep when there are multiple). But model_gff will
>> also
>>>> affect how evidence clusters are determined compared to pred_gff
>>>> (model_gff features are allowed to merge bridging evidence
>> clusters).
>>>> MAKER will also go to extra lengths to pull forward existing names
>>>> and other data in the GFF3 for model_gff features.
>>>> 
>>>> If you do not have GFF3 files in the right coordinate space, but do
>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>> you can supply these to the protein= and transcript= options in
>> MAKER
>>>> together with est2genome=1 or protein2genome=1. This will cause
>> MAKER
>>>> to place the models using exonerate. You would probably also need to
>>>> add est_forward=1 to the control files to have MAKER try and derive
>>>> model names from the name of evidence alignments they were derived
>>>> from if you go this route.
>>>> 
>>>> You can also try treating the GFF3 predictions as hints to
>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>> predictions inform the behavior of predictors like SNAP and
>>>> Augustus). Might be interesting. You would have to alter results to
>>>> be match/match_part
>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>> 
>>>> Let me know if you have any more questions, and I?ll do my best to
>>>> help.
>>>> 
>>>> Thanks,
>>>> Carson
>>>> 
>>>> 
>>>> 
>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>> <myandell at genetics.utah.edu> wrote:
>>>>> 
>>>>> 
>>>>> Mark Yandell
>>>>> Professor of Human Genetics
>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>> University of Utah
>>>>> 15 North 2030 East, Room 2100
>>>>> Salt Lake City, UT 84112-5330
>>>>> ph:801-587-7707
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens" <jens.keilwagen at jki.bund.de>
>>>> wrote:
>>>>> 
>>>>>> Dear Prof. Yandell,
>>>>>> 
>>>>>> we have published a homology-based gene prediction program today:
>>>>>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw092
>>>>>> and I'd like to ask how we can use MAKER to combine predictions of
>>>>>> GeMoMa using different reference organisms, i.e. we try to predict
>>>>>> the genes of an target organism (e.g. wheat) using the annotated
>>>>>> genes of other reference organisms (e.g. grasses). GeMoMa returns
>>>> for
>>>>>> each reference organism a GFF with the predicted gene models in
>> the
>>>> target organism.
>>>>>> 
>>>>>> It would be great if you or someone from your team could give us
>>>> some
>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>> 
>>>>>> Thanks a lot and best regards, Jens
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> Dr. Jens Keilwagen
>>>>>> 
>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>> Cultivated
>>>>>> Plants
>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>> 
>>>>>> Erwin-Baur-Stra?e 27
>>>>>> 06484 Quedlinburg
>>>>>> Germany
>>>>>> 
>>>>>> Phone: ++49 (0)3946 47 510
>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>> 
>>>>>> 
>>>>> 
>>> 
> 
> <maker_opts.ctl><slurm-278767.out>


From eennadi at gmail.com  Fri Sep 22 13:27:37 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Fri, 22 Sep 2017 20:27:37 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
Message-ID: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:

> Ok, thanks.
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
>
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>
>> It would need to be a new run. You won't be able to use the updated
>> contig names with the old run.
>>
>> --Carson
>>
>> Sent from my iPhone
>>
>> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>> genome.fasta
>>
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_
>> trimmed_\(paired\)_,
>>
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,
>>
>> 1. How can I effect the change when maker has produced some files from
>> the the old sequence?
>>
>> I have spent more than 24 hours running maker and it has produced some
>> folders already.
>>
>> How can I make this change?
>>
>> Thanks
>>
>>
>>
>>
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/
>> profile/Emmanuel_Nnadi/publications
>>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>
>>> BLAST which is used by MAKER can not handle really long contig names.
>>> MAKER tries to get around this by adding a secondary tag to the fasta
>>> header when long names are detected. Even then it would be better to change
>>> the IDs of your contigs to avoid downstream failures.
>>>
>>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_?
>>> from each contig name.
>>>
>>> Example command to do that ?>
>>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>>> genome.fasta
>>>
>>> ?Carson
>>>
>>>
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>
>>> Hi Carson
>>> Thanks for your response its been helpful
>>>
>>> Please bear with me as I work through this
>>>
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it
>>> wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> examining contents of the fasta file and run log
>>>
>>>
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/
>>> profile/Emmanuel_Nnadi/publications
>>>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>
>>>> You can query valid species names using the queryTaxonomyDatabase.pl
>>>> script that comes with RepeatMasker. Try not to be too specific. In general
>>>> you should use the genus rather than the species for example (or even use
>>>> all of RepBase).
>>>>
>>>> Example ?>
>>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>
>>>> Hi Carson,
>>>>
>>>>  Thanks
>>>> I was able to start using maker.
>>>>
>>>> However I am working with a plant Genome novel. I had set the
>>>> repeatmasking to
>>>> 1. Dcotrep a names from the repbase release but maker returned it back
>>>> as not known to repeat masker
>>>>
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>>
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/
>>>> profile/Emmanuel_Nnadi/publications
>>>>
>>>>
>>>>
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>>>>
>>>>> MAKER will read the genome= options from the maker_opts.ctl file in
>>>>> your current directory or the maker_opts.ctl you specified on the command
>>>>> line. The error means you have left the value empty. Perhaps you did not
>>>>> save the changes you made or you did not specify the location of
>>>>> the maker_opts.ctl file to use.
>>>>>
>>>>> You can check the contents of the file using cat. Example ?>
>>>>> cat maker_opts.ctl
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>>
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running
>>>>> maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>>
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>>
>>>>> It gave the following *1_S7_assembly.fa I specified the name of the
>>>>> genome but when I ran maker in another tab it gave *
>>>>>
>>>>> #-----Genome (these are always required)
>>>>> genome=*1_S7_assembly.fa* #genome sequence (fasta file or fasta
>>>>> embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>> eukaryotic
>>>>>
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 =
>>>>> no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>>
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate
>>>>> organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>>
>>>>> #-----Protein Homology Evidence (for best results provide a file for
>>>>> at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple
>>>>> oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3
>>>>> file
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format
>>>>> for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta
>>>>> #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change
>>>>> this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e.
>>>>> seg and dust filtering)
>>>>>
>>>>>
>>>>> *I ran maker command on another tab and it returned the following*
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control
>>>>> files.
>>>>>
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>>
>>>>>
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same
>>>>> tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate
>>>>> protein fast for the de novo sequence and EST?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/
>>>>> profile/Emmanuel_Nnadi/publications
>>>>>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Here is a class on how to use MAKER taught a couple of years back ?>
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/M
>>>>>> AKER_Tutorial_for_GMOD_Online_Training_2014
>>>>>>
>>>>>> There is also a linked video as well as an amazon image of the class
>>>>>> material where you can run the image in the cloud and follow along.
>>>>>>
>>>>>> Thanks,
>>>>>> Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Carson,
>>>>>> Thanks a lot
>>>>>>
>>>>>> I ran this command maker -h it returned the following
>>>>>>
>>>>>> The last thing I wish to ask you, how can I load my genome fine and
>>>>>> being annotation?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>>
>>>>>> MAKER version 2.31.9
>>>>>>
>>>>>> Usage:
>>>>>>
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>>
>>>>>>
>>>>>> Description:
>>>>>>
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format
>>>>>> using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can
>>>>>> be used to
>>>>>>      produce gene annotations for new genomes as well as update
>>>>>> annotations
>>>>>>      from existing genome databases.
>>>>>>
>>>>>>      The three input arguments are control files that specify how
>>>>>> MAKER should
>>>>>>      behave. All options for MAKER should be set in the control
>>>>>> files, but a
>>>>>>      few can also be set on the command line. Command line options
>>>>>> provide a
>>>>>>      convenient machanism to override commonly altered control file
>>>>>> values.
>>>>>>      MAKER will automatically search for the control files in the
>>>>>> current
>>>>>>      working directory if they are not specified on the command line.
>>>>>>
>>>>>>      Input files listed in the control options files must be in fasta
>>>>>> format
>>>>>>      unless otherwise specified. Please see MAKER documentation to
>>>>>> learn more
>>>>>>      about control file  configuration.  MAKER will automatically try
>>>>>> and
>>>>>>      locate the user control files in the current working directory
>>>>>> if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>>
>>>>>>      It is important to note that MAKER does not try and recalculated
>>>>>> data that
>>>>>>      it has already calculated.  For example, if you run an analysis
>>>>>> twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any
>>>>>> of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from
>>>>>> the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f
>>>>>> flag.
>>>>>>
>>>>>>      MAKER also supports parallelization via MPI on computer
>>>>>> clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support
>>>>>> must be
>>>>>>      configured during the MAKER installation process for this to
>>>>>> work though
>>>>>>
>>>>>>
>>>>>> Options:
>>>>>>
>>>>>>      -genome|g <file>    Overrides the genome file path in the
>>>>>> control files
>>>>>>
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>>
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep
>>>>>> directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>>
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and
>>>>>> lower)
>>>>>>
>>>>>>      -base    <string>   Set the base name MAKER uses to save output
>>>>>> files.
>>>>>>                          MAKER uses the input genome file name by
>>>>>> default.
>>>>>>
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of
>>>>>> tries.
>>>>>>
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST
>>>>>> analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>>
>>>>>>      -force|f            Forces MAKER to delete old files before
>>>>>> running again.
>>>>>> This will require all blast analyses to be rerun.
>>>>>>
>>>>>>      -again|a            recaculate all annotations and output files
>>>>>> even if no
>>>>>> settings have changed. Does not delete old analyses.
>>>>>>
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status
>>>>>> messages.
>>>>>>
>>>>>>      -qq                 Even more quiet. There are no status
>>>>>> messages.
>>>>>>
>>>>>>      -dsindex            Quickly generate datastore index file. Note
>>>>>> that this
>>>>>>                          will not check if run settings have changed
>>>>>> on contigs
>>>>>>
>>>>>>      -nolock             Turn off file locks. May be usful on some
>>>>>> file systems,
>>>>>>                          but can cause race conditions if running in
>>>>>> parallel.
>>>>>>
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>>
>>>>>>      -CTL                Generate empty control files in the current
>>>>>> directory.
>>>>>>
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>>
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>>
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>>
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for
>>>>>> web-based GUI
>>>>>>
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>>
>>>>>>      -version            Prints the MAKER version.
>>>>>>
>>>>>>      -help|?             Prints this usage statement.
>>>>>>
>>>>>>
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/
>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Path needs to be a list of directories to search (you specified an
>>>>>>> executable location).
>>>>>>>
>>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop
>>>>>>> /Gpm/maker/bin
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> I tried to export PATH
>>>>>>>
>>>>>>> running
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>>
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaeme
>>>>>>> ka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>>
>>>>>>>
>>>>>>> secondly,
>>>>>>>
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>>
>>>>>>> nothing returned.
>>>>>>>
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> After install the executables will be in the ?/maker/bin directory.
>>>>>>>> Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>>>
>>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be
>>>>>>>> found just by typing ?maker'
>>>>>>>>
>>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_e
>>>>>>>> nv_var.html
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu> wrote:
>>>>>>>>
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what
>>>>>>>> is the result of ?which maker??
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Daniel
>>>>>>>> The reply is
>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>> -bash: MAKER: command not found
>>>>>>>>
>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>> Department of Microbiology,
>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>
>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you
>>>>>>>>> used to try to run MAKER? Can you show the result of running ?MAKER -ctl??
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ence,
>>>>>>>>> Thanks for your reply,
>>>>>>>>>
>>>>>>>>> This is the step and error received
>>>>>>>>>
>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>> Installing MAKER...
>>>>>>>>> Building MAKER
>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>
>>>>>>>>> The build status is
>>>>>>>>> =============================================================================
>>>>>>>>> STATUS MAKER v2.31.9==============================================================================
>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to
>>>>>>>>>> the mailing list the command and output (including errors) of the step that
>>>>>>>>>> didn?t work.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in
>>>>>>>>>> installing all prerequisites however running maker ./build install, it
>>>>>>>>>> showed that maker installed.
>>>>>>>>>>
>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>
>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>>>> ell-lab.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>> ell-lab.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/5d36dba0/attachment-0002.html>

From carsonhh at gmail.com  Fri Sep 22 14:06:06 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:06:06 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
Message-ID: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

> On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello all,
> Please how can I determine the following in maker:
> 1. The total number of chromosomes
> 2. The size of my genome
> 
> 
> Thanks
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> Ok, thanks. 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> 
>    
> 
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> It would need to be a new run. You won't be able to use the updated contig names with the old run. 
> 
> --Carson
> 
> Sent from my iPhone
> 
> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> 
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> 1. How can I effect the change when maker has produced some files from the the old sequence?
>> 
>> I have spent more than 24 hours running maker and it has produced some folders already.
>> 
>> How can I make this change?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.
>> 
>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.
>> 
>> Example command to do that ?> 
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> ?Carson
>> 
>> 
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>> 
>>> Hi Carson
>>> Thanks for your response its been helpful
>>> 
>>> Please bear with me as I work through this
>>> 
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> examining contents of the fasta file and run log
>>> 
>>> 
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>> You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).
>>> 
>>> Example ?>
>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>> 
>>>> Hi Carson,
>>>> 
>>>>  Thanks
>>>> I was able to start using maker.
>>>> 
>>>> However I am working with a plant Genome novel. I had set the repeatmasking to 
>>>> 1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker
>>>> 
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>> 
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>> 
>>>>    
>>>> 
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>> MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.
>>>> 
>>>> You can check the contents of the file using cat. Example ?> cat maker_opts.ctl
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>> 
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>> 
>>>>> It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave 
>>>>> 
>>>>> #-----Genome (these are always required)
>>>>> genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
>>>>> 
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>> 
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>> 
>>>>> #-----Protein Homology Evidence (for best results provide a file for at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3 file
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
>>>>> 
>>>>> 
>>>>> I ran maker command on another tab and it returned the following
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control files.
>>>>> 
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>> 
>>>>> 
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>> Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014>
>>>>> 
>>>>> There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.
>>>>> 
>>>>> Thanks,
>>>>> Carson
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Carson,
>>>>>> Thanks a lot 
>>>>>> 
>>>>>> I ran this command maker -h it returned the following
>>>>>> 
>>>>>> The last thing I wish to ask you, how can I load my genome fine and being annotation?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>> 
>>>>>> MAKER version 2.31.9
>>>>>> 
>>>>>> Usage:
>>>>>> 
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>> 
>>>>>> 
>>>>>> Description:
>>>>>> 
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can be used to
>>>>>>      produce gene annotations for new genomes as well as update annotations
>>>>>>      from existing genome databases.
>>>>>> 
>>>>>>      The three input arguments are control files that specify how MAKER should
>>>>>>      behave. All options for MAKER should be set in the control files, but a
>>>>>>      few can also be set on the command line. Command line options provide a
>>>>>>      convenient machanism to override commonly altered control file values.
>>>>>>      MAKER will automatically search for the control files in the current
>>>>>>      working directory if they are not specified on the command line.
>>>>>> 
>>>>>>      Input files listed in the control options files must be in fasta format
>>>>>>      unless otherwise specified. Please see MAKER documentation to learn more
>>>>>>      about control file  configuration.  MAKER will automatically try and
>>>>>>      locate the user control files in the current working directory if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>> 
>>>>>>      It is important to note that MAKER does not try and recalculated data that
>>>>>>      it has already calculated.  For example, if you run an analysis twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f flag.
>>>>>> 
>>>>>>      MAKER also supports parallelization via MPI on computer clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
>>>>>>      configured during the MAKER installation process for this to work though
>>>>>>      
>>>>>> 
>>>>>> Options:
>>>>>> 
>>>>>>      -genome|g <file>    Overrides the genome file path in the control files
>>>>>> 
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>> 
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>> 
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and lower)
>>>>>> 
>>>>>>      -base    <string>   Set the base name MAKER uses to save output files.
>>>>>>                          MAKER uses the input genome file name by default.
>>>>>> 
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of tries.
>>>>>> 
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>> 
>>>>>>      -force|f            Forces MAKER to delete old files before running again.
>>>>>> 			 This will require all blast analyses to be rerun.
>>>>>> 
>>>>>>      -again|a            recaculate all annotations and output files even if no
>>>>>> 			 settings have changed. Does not delete old analyses.
>>>>>> 
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status messages.
>>>>>> 
>>>>>>      -qq                 Even more quiet. There are no status messages.
>>>>>> 
>>>>>>      -dsindex            Quickly generate datastore index file. Note that this
>>>>>>                          will not check if run settings have changed on contigs
>>>>>> 
>>>>>>      -nolock             Turn off file locks. May be usful on some file systems,
>>>>>>                          but can cause race conditions if running in parallel.
>>>>>> 
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>> 
>>>>>>      -CTL                Generate empty control files in the current directory.
>>>>>> 
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>> 
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>> 
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>> 
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for web-based GUI
>>>>>> 
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>> 
>>>>>>      -version            Prints the MAKER version.
>>>>>> 
>>>>>>      -help|?             Prints this usage statement.
>>>>>> 
>>>>>> 
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>> Path needs to be a list of directories to search (you specified an executable location).
>>>>>> 
>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>> 
>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Thanks 
>>>>>>> 
>>>>>>> I tried to export PATH
>>>>>>> 
>>>>>>> running 
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>> 
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>> 
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>> 
>>>>>>> 
>>>>>>> secondly,
>>>>>>> 
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>> 
>>>>>>> nothing returned.
>>>>>>> 
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>>> After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>> 
>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'
>>>>>>> 
>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html <http://www.linfo.org/path_env_var.html>
>>>>>>> 
>>>>>>> ?Carson
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>> 
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker?? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Daniel
>>>>>>>>> The reply is 
>>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>>> -bash: MAKER: command not found
>>>>>>>>> 
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl?? 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Ence,
>>>>>>>>>> Thanks for your reply,
>>>>>>>>>> 
>>>>>>>>>> This is the step and error received
>>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>>> Installing MAKER...
>>>>>>>>>> Building MAKER
>>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>> 
>>>>>>>>>> The build status is
>>>>>>>>>> 
>>>>>>>>>> =============================================================================
>>>>>>>>>> STATUS MAKER v2.31.9
>>>>>>>>>> ==============================================================================
>>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>> 
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hello all,
>>>>>>>>>>> 
>>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>>>>>>>>>>> 
>>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>> 
>>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>>> Department of Microbiology,
>>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>>> 
>>>>>>>>>>>    
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/64e7446c/attachment-0002.html>

From carsonhh at gmail.com  Fri Sep 22 14:08:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:08:36 -0600
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <1505986013492.52354@unil.ch>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
	<1505986013492.52354@unil.ch>
Message-ID: <651D4267-0FD7-4A92-B778-8976B47353BB@gmail.com>

The gff3 passthrough options are there to help users get old data into MAKER when they have lost access to the original files. But for iterative running of the pipeline, it is more effective just to rerun in place so MAKER can access the raw alignment reports. The raw reports from the alignments have more detail than what is stored in the GFF3. Details that are lost when trying to use the GFF3 as input.

?Carson


> On Sep 21, 2017, at 3:26 AM, Patrick Tran Van <Patrick.TranVan at unil.ch> wrote:
> 
> Hi Carson,
> 
> I have a doubt for the round 2, so in a previous reply you said:
> 
> " Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "
>  
> Does it means that I don't need to modify the section :
> 
> #-----Re-annotation Using MAKER Derived GFF3
> 
> ?
> 
> If I let everything by default such as :
> 
> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no 
> 
> 
> It will not look again for repeat and protein + transcriptome alignment ?
> 
> Patrick Tran Van
> 
> Groups Chapuisat, Robinson-Rechavi & Schwander
> Department of Ecology and Evolution
> University of Lausanne
> Le Biophore
> CH-1015 Lausanne
> Switzerland
> Office 3206
> 
> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
> Sent: Monday, July 3, 2017 10:50 PM
> To: Patrick Tran Van
> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] Advice on my pipeline
>  
> maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).
> 
> So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.
> 
> The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).
> 
> You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/>
> 
> Thanks,
> Carson
> 
> 
> 
> 
>> On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>> 
>> So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.
>> 
>> I have then use SNAP to train/filter it with:
>> 
>> maker2zff  specie.all.gff
>> 
>> Here are my results:
>> 
>> Number of gene after maker -> Number of gene after maker2zff
>> 
>> - Without corrected_est_fusion: 21621 -> 13875
>> - With corrected_est_fusion: 16850 -> 9098
>> 
>> 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
>> Normally I should find more genes with corrected_est_fusion right ?
>> 
>> 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?
>> 
>>  Thanks for your help 
>> 
>> 
>> 
>> Patrick Tran Van
>> 
>> Groups Chapuisat, Robinson-Rechavi & Schwander
>> Department of Ecology and Evolution
>> University of Lausanne
>> Le Biophore
>> CH-1015 Lausanne
>> Switzerland
>> Office 3206
>> 
>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>> Sent: Monday, June 26, 2017 11:38 PM
>> To: Patrick Tran Van
>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>> Subject: Re: [maker-devel] Advice on my pipeline
>>  
>> Sorry the option is ?> correct_est_fusion
>> 
>> It is in the maker_opts.ctl file.
>> 
>> I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>> 
>>> Thanks for your answer.
>>> 
>>> 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
>>> Because I am using autoAug for this and it tooks a while to compute ..
>>> 
>>> 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:
>>> 
>>> WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl
>>> 
>>> (I am using v 2.31.8 )
>>> 
>>> 
>>> Patrick Tran Van
>>> 
>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>> Department of Ecology and Evolution
>>> University of Lausanne
>>> Le Biophore
>>> CH-1015 Lausanne
>>> Switzerland
>>> Office 3206
>>> 
>>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>>> Sent: Monday, June 5, 2017 8:29 PM
>>> To: Patrick Tran Van
>>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>>> Subject: Re: [maker-devel] Advice on my pipeline
>>>  
>>> Your plan sounds good. A couple of related notes.
>>> 
>>> Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.
>>> 
>>> Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).
>>> 
>>> ?Carson
>>> 
>>> 
>>>> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> This is my first time running Maker for an insect genome annotation. 
>>>> 
>>>> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:
>>>> 
>>>> 
>>>> What I have:
>>>> - RNA evidence: transcriptome
>>>> - Proteine evidence: swissprot/uniprot + busco protein set of insect
>>>> - Cegma and busco results of my genome
>>>> 
>>>> 
>>>> 1) Train SNAP with CEGMA
>>>> 
>>>> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).
>>>> 
>>>> 3) Create SNAP model from run A.
>>>> 
>>>> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 5) Create SNAP model from run B.
>>>> 
>>>> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 7)  Create SNAP model from run C AND Create Augustus gene model from run C
>>>> 
>>>> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1
>>>> 
>>>> 
>>>> 
>>>> Does it seems coherent ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Patrick Tran Van
>>>> 
>>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>>> Department of Ecology and Evolution
>>>> University of Lausanne
>>>> Le Biophore
>>>> CH-1015 Lausanne
>>>> Switzerland
>>>> Office 3206
>>>> 
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/3b6b64af/attachment-0002.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 14:19:22 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:19:22 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
	<1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <ADB216BF-2828-4906-A32F-58CC3989102F@genetics.utah.edu>

All est2genome and protein2genome do is take exonerate alignments of the fasta inputs and translate the longest ORF to get a rough base model that can be used to train a gene predictor. That is why we have it in the documentation that once the predictor is trained they should be turned off.

Once you get the gene predictor trained, MAKER will feed hints to the gene predictor derived from alignments and input GFF3. These hints greatly improve the performance of the gene predictors. MAKER will also use the alignemnts to filter out predictions htat do not match the evidence alignments.

?Carson


> On Sep 22, 2017, at 2:15 PM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> Thanks a lot for the information.
> 
> Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?
> 
> Could you confirm that?
> 
> Thanks a lot and best regards, Jens
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Freitag, 22. September 2017 22:04
>> An: Keilwagen, Jens
>> Cc: Maker Mailing List
>> Betreff: Re: MAKER
>> 
>> MAKER won?t produce est2genome results for est_gff. This is partially
>> because est2genome results are only used for training gene predictors.
>> So you are essentially just getting protein2genome results from your
>> runs. Once you get a gene predictor trained you will see a difference,
>> as it will use the intron/exon structure of alignments as hints to
>> improve gene predictor performance.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> I have tried the proposed options for a small example (yeast).
>>> 
>>> I had
>>> - proteins (fasta) from another yeast and
>>> - transcript annotation (gff) from cufflinks and StringTie
>>> 
>>> I'd like to compare the maker results for
>>> - proteins and StringTie
>>> Vs.
>>> - proteins and cufflinks
>>> 
>>> I used the default options, except:
>>> genome=<genome fasta>
>>> 
>>> protein=<protein fasta>
>>> est_gff=<transcript gff>
>>> 
>>> est2genome=1
>>> protein2genome=1
>>> 
>>> (An example is attached.)
>>> 
>>> Then I ran maker:
>>> 
>>> maker -RM_off -c 24
>>> find . -type f -name *.gff -exec cat {} + | grep maker >
>>> filtered-maker-prediction.gff
>>> 
>>> (The run seems to be okay. There were no FAILED, ... in the log. Cf.
>>> attachment)
>>> 
>>> Each maker run was started in a separate subdirectory.
>>> However, I realized that both maker runs yielded almost the same
>> result (just one minor edit). This made me curious.
>>> As far as I understood the files, I received the (filtered?)
>> exonerate predictions for the proteins (from the other yeast). Is this
>> correct? Why did I not receive any predictions (purely) based on the
>> RNA-seq data? Did I something wrong?
>>> 
>>> I'm looking forward to your reply.
>>> 
>>> Best regards, Jens
>>> 
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Dienstag, 19. September 2017 23:37
>>>> An: Keilwagen, Jens
>>>> Betreff: Re: MAKER
>>>> 
>>>> MAKER cannot use the BAM directly, but you can use something like
>>>> stringtie or trinity to assemble a transcript fasta that can be
>> given
>>>> to the est= option.
>>>> 
>>>> Ab initio gene prediction is only enabled if you specify an hmm or
>>>> species file to use.  If all you want is homology based annotation,
>>>> you can try the est2genome and protein2genome options. Note the
>> final
>>>> models may be partial if the alignments do not cover the gene end to
>>>> end.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
>> <jens.keilwagen at julius-
>>>> kuehn.de> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> 
>>>>> thanks a lot for your last email that .
>>>>> 
>>>>> I was asked to do homology-based gene prediction using RNA-seq and
>>>> Maker was proposed as one option.
>>>>> Hence I'd like to ask how to do that in the best possible way.
>>>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>>>> related species. How can I integrate the RNA-seq data?
>>>>> 
>>>>> Is it possible to deactivate ab-initio gene prediction by Augustus
>>>>> or
>>>> SNAP?
>>>>> 
>>>>> Thanks a lot in advance.
>>>>> 
>>>>> Bets regards, Jens
>>>>> 
>>>>>> -----Urspr?ngliche Nachricht-----
>>>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>>>> An: Keilwagen, Jens
>>>>>> Cc: Mark Yandell
>>>>>> Betreff: Re: MAKER
>>>>>> 
>>>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>>>> could give the GFF3 results to the pred_gff= option in MAKER
>> (comma
>>>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>>>> the same coordinate space as the assembly being annotated (genome=
>>>> option).
>>>>>> Whatever you give to pred_gff will be treated as a raw predictions
>>>> by
>>>>>> MAKER and will only be accepted as a final model if there are
>>>>>> evidence alignments (protein/EST) that support the model, and if
>>>>>> there are multiple alternate models at the same locus, only the
>>>> model
>>>>>> that is best supported by the protein/transcript evidence is kept.
>>>>>> 
>>>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>>>> will cause even raw predictions with no evidence support to be
>>>> maintained.
>>>>>> In the event of multiple models with no evidence support, the
>> model
>>>>>> best matching the consensus of alternate models will be
>> maintained.
>>>>>> 
>>>>>> Alternatively you can use the model_gff= options (comma separated
>>>>>> list
>>>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>>>> confidence than pred_gff. At least one model will always be kept
>>>>>> regardless of evidence support (same rules as pred_gff selection
>>>>>> for which model to keep when there are multiple). But model_gff
>>>>>> will
>>>> also
>>>>>> affect how evidence clusters are determined compared to pred_gff
>>>>>> (model_gff features are allowed to merge bridging evidence
>>>> clusters).
>>>>>> MAKER will also go to extra lengths to pull forward existing names
>>>>>> and other data in the GFF3 for model_gff features.
>>>>>> 
>>>>>> If you do not have GFF3 files in the right coordinate space, but
>> do
>>>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>>>> you can supply these to the protein= and transcript= options in
>>>> MAKER
>>>>>> together with est2genome=1 or protein2genome=1. This will cause
>>>> MAKER
>>>>>> to place the models using exonerate. You would probably also need
>>>>>> to add est_forward=1 to the control files to have MAKER try and
>>>>>> derive model names from the name of evidence alignments they were
>>>>>> derived from if you go this route.
>>>>>> 
>>>>>> You can also try treating the GFF3 predictions as hints to
>>>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>>>> predictions inform the behavior of predictors like SNAP and
>>>>>> Augustus). Might be interesting. You would have to alter results
>> to
>>>>>> be match/match_part
>>>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>>>> 
>>>>>> Let me know if you have any more questions, and I?ll do my best to
>>>>>> help.
>>>>>> 
>>>>>> Thanks,
>>>>>> Carson
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>>>> <myandell at genetics.utah.edu> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Mark Yandell
>>>>>>> Professor of Human Genetics
>>>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>>>> University of Utah
>>>>>>> 15 North 2030 East, Room 2100
>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>> ph:801-587-7707
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
>>>>>>> <jens.keilwagen at jki.bund.de>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Dear Prof. Yandell,
>>>>>>>> 
>>>>>>>> we have published a homology-based gene prediction program
>> today:
>>>>>>>> 
>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
>>>>>>>> 2 and I'd like to ask how we can use MAKER to combine
>> predictions
>>>>>>>> of GeMoMa using different reference organisms, i.e. we try to
>>>>>>>> predict the genes of an target organism (e.g. wheat) using the
>>>>>>>> annotated genes of other reference organisms (e.g. grasses).
>>>>>>>> GeMoMa returns
>>>>>> for
>>>>>>>> each reference organism a GFF with the predicted gene models in
>>>> the
>>>>>> target organism.
>>>>>>>> 
>>>>>>>> It would be great if you or someone from your team could give us
>>>>>> some
>>>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>>>> 
>>>>>>>> Thanks a lot and best regards, Jens
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> 
>>>>>>>> Dr. Jens Keilwagen
>>>>>>>> 
>>>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>>>> Cultivated
>>>>>>>> Plants
>>>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>>>> 
>>>>>>>> Erwin-Baur-Stra?e 27
>>>>>>>> 06484 Quedlinburg
>>>>>>>> Germany
>>>>>>>> 
>>>>>>>> Phone: ++49 (0)3946 47 510
>>>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>> <maker_opts.ctl><slurm-278767.out>
> 


From jens.keilwagen at julius-kuehn.de  Fri Sep 22 14:15:23 2017
From: jens.keilwagen at julius-kuehn.de (Keilwagen, Jens)
Date: Fri, 22 Sep 2017 20:15:23 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
Message-ID: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>

Hi Carson,

Thanks a lot for the information.

Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?

Could you confirm that?

Thanks a lot and best regards, Jens

> -----Urspr?ngliche Nachricht-----
> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> Gesendet: Freitag, 22. September 2017 22:04
> An: Keilwagen, Jens
> Cc: Maker Mailing List
> Betreff: Re: MAKER
> 
> MAKER won?t produce est2genome results for est_gff. This is partially
> because est2genome results are only used for training gene predictors.
> So you are essentially just getting protein2genome results from your
> runs. Once you get a gene predictor trained you will see a difference,
> as it will use the intron/exon structure of alignments as hints to
> improve gene predictor performance.
> 
> ?Carson
> 
> 
> > On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
> kuehn.de> wrote:
> >
> > Hi Carson,
> >
> > I have tried the proposed options for a small example (yeast).
> >
> > I had
> > - proteins (fasta) from another yeast and
> > - transcript annotation (gff) from cufflinks and StringTie
> >
> > I'd like to compare the maker results for
> > - proteins and StringTie
> > Vs.
> > - proteins and cufflinks
> >
> > I used the default options, except:
> > genome=<genome fasta>
> >
> > protein=<protein fasta>
> > est_gff=<transcript gff>
> >
> > est2genome=1
> > protein2genome=1
> >
> > (An example is attached.)
> >
> > Then I ran maker:
> >
> > maker -RM_off -c 24
> > find . -type f -name *.gff -exec cat {} + | grep maker >
> > filtered-maker-prediction.gff
> >
> > (The run seems to be okay. There were no FAILED, ... in the log. Cf.
> > attachment)
> >
> > Each maker run was started in a separate subdirectory.
> > However, I realized that both maker runs yielded almost the same
> result (just one minor edit). This made me curious.
> > As far as I understood the files, I received the (filtered?)
> exonerate predictions for the proteins (from the other yeast). Is this
> correct? Why did I not receive any predictions (purely) based on the
> RNA-seq data? Did I something wrong?
> >
> > I'm looking forward to your reply.
> >
> > Best regards, Jens
> >
> >
> >> -----Urspr?ngliche Nachricht-----
> >> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> >> Gesendet: Dienstag, 19. September 2017 23:37
> >> An: Keilwagen, Jens
> >> Betreff: Re: MAKER
> >>
> >> MAKER cannot use the BAM directly, but you can use something like
> >> stringtie or trinity to assemble a transcript fasta that can be
> given
> >> to the est= option.
> >>
> >> Ab initio gene prediction is only enabled if you specify an hmm or
> >> species file to use.  If all you want is homology based annotation,
> >> you can try the est2genome and protein2genome options. Note the
> final
> >> models may be partial if the alignments do not cover the gene end to
> >> end.
> >>
> >> ?Carson
> >>
> >>
> >>
> >>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
> <jens.keilwagen at julius-
> >> kuehn.de> wrote:
> >>>
> >>> Hi Carson,
> >>>
> >>> thanks a lot for your last email that .
> >>>
> >>> I was asked to do homology-based gene prediction using RNA-seq and
> >> Maker was proposed as one option.
> >>> Hence I'd like to ask how to do that in the best possible way.
> >>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
> >> related species. How can I integrate the RNA-seq data?
> >>>
> >>> Is it possible to deactivate ab-initio gene prediction by Augustus
> >>> or
> >> SNAP?
> >>>
> >>> Thanks a lot in advance.
> >>>
> >>> Bets regards, Jens
> >>>
> >>>> -----Urspr?ngliche Nachricht-----
> >>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
> >>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
> >>>> An: Keilwagen, Jens
> >>>> Cc: Mark Yandell
> >>>> Betreff: Re: MAKER
> >>>>
> >>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
> >>>> could give the GFF3 results to the pred_gff= option in MAKER
> (comma
> >>>> separated lists accepted). The GFF3 file of predictions must be in
> >>>> the same coordinate space as the assembly being annotated (genome=
> >> option).
> >>>> Whatever you give to pred_gff will be treated as a raw predictions
> >> by
> >>>> MAKER and will only be accepted as a final model if there are
> >>>> evidence alignments (protein/EST) that support the model, and if
> >>>> there are multiple alternate models at the same locus, only the
> >> model
> >>>> that is best supported by the protein/transcript evidence is kept.
> >>>>
> >>>> You can also set the keep_preds=1 option when using pred_gff. This
> >>>> will cause even raw predictions with no evidence support to be
> >> maintained.
> >>>> In the event of multiple models with no evidence support, the
> model
> >>>> best matching the consensus of alternate models will be
> maintained.
> >>>>
> >>>> Alternatively you can use the model_gff= options (comma separated
> >>>> list
> >>>> ok) to input the GFF3 file.  model_gff features are given higher
> >>>> confidence than pred_gff. At least one model will always be kept
> >>>> regardless of evidence support (same rules as pred_gff selection
> >>>> for which model to keep when there are multiple). But model_gff
> >>>> will
> >> also
> >>>> affect how evidence clusters are determined compared to pred_gff
> >>>> (model_gff features are allowed to merge bridging evidence
> >> clusters).
> >>>> MAKER will also go to extra lengths to pull forward existing names
> >>>> and other data in the GFF3 for model_gff features.
> >>>>
> >>>> If you do not have GFF3 files in the right coordinate space, but
> do
> >>>> have protein fasta or transcript fasta for the GeMoMa predictions,
> >>>> you can supply these to the protein= and transcript= options in
> >> MAKER
> >>>> together with est2genome=1 or protein2genome=1. This will cause
> >> MAKER
> >>>> to place the models using exonerate. You would probably also need
> >>>> to add est_forward=1 to the control files to have MAKER try and
> >>>> derive model names from the name of evidence alignments they were
> >>>> derived from if you go this route.
> >>>>
> >>>> You can also try treating the GFF3 predictions as hints to
> >>>> traditional ab initio gene finders like SNAP or Augustus by giving
> >>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
> >>>> predictions inform the behavior of predictors like SNAP and
> >>>> Augustus). Might be interesting. You would have to alter results
> to
> >>>> be match/match_part
> >>>> GFF3 features to give them to the est_gff or protein_gff options.
> >>>>
> >>>> Let me know if you have any more questions, and I?ll do my best to
> >>>> help.
> >>>>
> >>>> Thanks,
> >>>> Carson
> >>>>
> >>>>
> >>>>
> >>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
> >>>> <myandell at genetics.utah.edu> wrote:
> >>>>>
> >>>>>
> >>>>> Mark Yandell
> >>>>> Professor of Human Genetics
> >>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
> >>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
> >>>>> University of Utah
> >>>>> 15 North 2030 East, Room 2100
> >>>>> Salt Lake City, UT 84112-5330
> >>>>> ph:801-587-7707
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
> >>>>> <jens.keilwagen at jki.bund.de>
> >>>> wrote:
> >>>>>
> >>>>>> Dear Prof. Yandell,
> >>>>>>
> >>>>>> we have published a homology-based gene prediction program
> today:
> >>>>>>
> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
> >>>>>> 2 and I'd like to ask how we can use MAKER to combine
> predictions
> >>>>>> of GeMoMa using different reference organisms, i.e. we try to
> >>>>>> predict the genes of an target organism (e.g. wheat) using the
> >>>>>> annotated genes of other reference organisms (e.g. grasses).
> >>>>>> GeMoMa returns
> >>>> for
> >>>>>> each reference organism a GFF with the predicted gene models in
> >> the
> >>>> target organism.
> >>>>>>
> >>>>>> It would be great if you or someone from your team could give us
> >>>> some
> >>>>>> hints or point us to correct paragraph in the documentation.
> >>>>>>
> >>>>>> Thanks a lot and best regards, Jens
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Dr. Jens Keilwagen
> >>>>>>
> >>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
> >> Cultivated
> >>>>>> Plants
> >>>>>> 	Institute for Biosafety in Plant Biotechnology
> >>>>>>
> >>>>>> Erwin-Baur-Stra?e 27
> >>>>>> 06484 Quedlinburg
> >>>>>> Germany
> >>>>>>
> >>>>>> Phone: ++49 (0)3946 47 510
> >>>>>> EMail: jens.keilwagen at jki.bund.de
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >
> > <maker_opts.ctl><slurm-278767.out>


From venyao at qq.com  Sun Sep 24 03:08:43 2017
From: venyao at qq.com (=?ISO-8859-1?B?V2VuIFlhbw==?=)
Date: Sun, 24 Sep 2017 17:08:43 +0800
Subject: [maker-devel] integrate gmap into Maker
Message-ID: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>

Dear Guys,


I am using Maker to annotate my genome sequence. However, it costs too much time.


By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.


I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the


alignment to Maker. If so, this may save much time, as GMAP is very fast.


Thanks!


Best regards,


Wen Yao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170924/8d42e58d/attachment-0002.html>

From eennadi at gmail.com  Sun Sep 24 15:24:10 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Sun, 24 Sep 2017 22:24:10 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
Message-ID: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>

Hello,

Good day,

I am trying to assign putative gene function to the maker generated fasta.
I am using NCBI

I keep getting this error
  Command line argument error: Argument "query". File is not accessible:
`muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'

What do I do?

can I use blast2go in place of ncbi command line software?

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu> wrote:

> Hi Emmanuel, In order for anyone to help you, you need post to the mailing
> list the command and output (including errors) of the step that didn?t
> work.
>
> Thanks,
> Daniel Ence
>
>
> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>
> Hello all,
>
> I downloaded Maker and tried to install it. I succeeded in installing all
> prerequisites however running maker ./build install, it showed that maker
> installed.
>
> However trying to run maker it wouldn't run.
>
> Please how do I install maker to run on local computer?
>
> Thanks
>
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170924/90a7c717/attachment-0002.html>

From dandence at gmail.com  Mon Sep 25 08:11:31 2017
From: dandence at gmail.com (Daniel Ence)
Date: Mon, 25 Sep 2017 10:11:31 -0400
Subject: [maker-devel] integrate gmap into Maker
In-Reply-To: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
References: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
Message-ID: <7E5F06C8-05B2-447F-A695-DDE7673BDEFF@gmail.com>

Without commenting on the merits of GMAP vs Blastn or Exonerate, you can provide evidence alignments from any source in gff format in the maker control files. I think for GMAP this would mean converting the sam/bam outputs to a gff3 format, but I don?t know those steps of the top of my head. 

~Daniel 


> On Sep 24, 2017, at 5:08 AM, Wen Yao <venyao at qq.com> wrote:
> 
> Dear Guys,
> 
>  
> 
> I am using Maker to annotate my genome sequence. However, it costs too much time.
> 
> By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.
> 
> I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the
> 
> alignment to Maker. If so, this may save much time, as GMAP is very fast.
> 
> 
> 
> Thanks!
> 
>  
> 
> Best regards,
> 
>  
> 
> Wen Yao
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/143d3024/attachment-0002.html>

From carsonhh at gmail.com  Mon Sep 25 10:07:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 25 Sep 2017 10:07:39 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
Message-ID: <07342091-897A-46C2-B000-76A283FE5FB1@gmail.com>

I?m not sure what you mean by NCBI. Do you mean BLAST? If so, you probably did not format and index your input database before running BLAST. See BLAST documentation.

Also the file you are using ?> muc1_genome_snap2.all.maker.snap_masked.proteins.fasta

That is not the maker result file. That is a reference fasta of raw SNAP results. The MAKER result file will have a name like this (see maker documentation) ?> muc1_genome_snap2.all.maker.proteins.fasta

?Carson


> On Sep 24, 2017, at 3:24 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello,
> 
> Good day,
> 
> I am trying to assign putative gene function to the maker generated fasta. I am using NCBI
> 
> I keep getting this error
>   Command line argument error: Argument "query". File is not accessible:  `muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'
> 
> What do I do?
> 
> can I use blast2go in place of ncbi command line software?
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
> 
> Thanks,
> Daniel Ence
> 
> 
>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>> 
>> Hello all,
>> 
>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>> 
>> However trying to run maker it wouldn't run.
>> 
>> Please how do I install maker to run on local computer?
>> 
>> Thanks
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> 
>>    
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/c21cf4d8/attachment-0002.html>

From xvazquezc at gmail.com  Tue Sep 26 01:23:13 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Tue, 26 Sep 2017 17:23:13 +1000
Subject: [maker-devel] question about Maker-MPI
Message-ID: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>

Hi Carson,
We finally got Maker working with MPI (mpich, openmpi was a dead end...)
and I have a question about how Maker distributes the computation load.
I know, correct me if I'm wrong, that with MPI, Maker runs blast in
parallel (1 instance per thread) for protein2genome and est2genome. This
indeed improves enormously the speed for the initial run.
But, does it take advance of this at the time of running the gene
predictors? I think there is no benefit on multiple cpus in non-MPI mode
but I have no idea in MPI.
Thank you in advance,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/f9539591/attachment-0002.html>

From carsonhh at gmail.com  Tue Sep 26 09:28:58 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 09:28:58 -0600
Subject: [maker-devel] question about Maker-MPI
In-Reply-To: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
References: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
Message-ID: <E29F4653-61A3-4E33-967A-4E1A9C8C4721@gmail.com>

MAKER parallelizes at multiple levels. For the ab initio predictors, it will run multiple contigs simultaneously (so each one will get their own ab initio predictor running). For large contigs it will further divide it into 10Mb chunks, and each will run simultaneously.

?Carson


> On Sep 26, 2017, at 1:23 AM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi Carson,
> We finally got Maker working with MPI (mpich, openmpi was a dead end...) and I have a question about how Maker distributes the computation load.
> I know, correct me if I'm wrong, that with MPI, Maker runs blast in parallel (1 instance per thread) for protein2genome and est2genome. This indeed improves enormously the speed for the initial run.
> But, does it take advance of this at the time of running the gene predictors? I think there is no benefit on multiple cpus in non-MPI mode but I have no idea in MPI.
> Thank you in advance,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/52293014/attachment-0002.html>

From cjfields at illinois.edu  Mon Sep 25 08:53:39 2017
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 25 Sep 2017 14:53:39 +0000
Subject: [maker-devel] Maker not installing
In-Reply-To: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
	<78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
Message-ID: <ED8DB3BD-0981-4883-8CE0-E920BCEE0CC6@illinois.edu>

Emmanuel,

Look for anything that will help calculate basic assembly metrics, such as N50, NG50, L50, etc.; these almost always give overall assembly size, and total scaffolds/contigs.  For instance I?ve used this:

http://korflab.ucdavis.edu/datasets/Assemblathon/Assemblathon2/Basic_metrics/assemblathon_stats.pl

(it requires FALite, which is here: http://korflab.ucdavis.edu/Unix_and_Perl/FAlite.pm )

The Broad also has GAEMR (http://software.broadinstitute.org/software/gaemr/ ), but I haven?t tested it myself (I?ve heard it?s a bit finicky).

Also, see this: https://www.biostars.org/p/237591/ , which has a few more options.

chris

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Carson Holt <carsonhh at gmail.com>
Date: Friday, September 22, 2017 at 3:09 PM
To: Emmanuel Nnadi <eennadi at gmail.com>
Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Maker not installing

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Ok, thanks.
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
It would need to be a new run. You won't be able to use the updated contig names with the old run.
--Carson

Sent from my iPhone

On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Hi carson
Thanks for the tip
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

1. How can I effect the change when maker has produced some files from the the old sequence?

I have spent more than 24 hours running maker and it has produced some folders already.

How can I make this change?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.

I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.

Example command to do that ?>
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

?Carson


On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson
Thanks for your response its been helpful

Please bear with me as I work through this

1. Please how do I generate EST for my novel sequences?
2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
3. One error in the contig just returned this value
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
ERROR: RepeatMasker failed
--> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

examining contents of the fasta file and run log


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).

Example ?>
perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"

?Carson


On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,

 Thanks
I was able to start using maker.

However I am working with a plant Genome novel. I had set the repeatmasking to
1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker

How can I use specific known genomes for repeat masking
Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.

You can check the contents of the file using cat. Example ?> cat maker_opts.ctl

?Carson


On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
I however encountered another problem

when I ran the command nano -c maker_opts.ctl

It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave

#-----Genome (these are always required)
genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)


I ran maker command on another tab and it returned the following
STATUS: Parsing control files...
ERROR: You have failed to provide a value for 'genome' in the control files.

--> rank=NA, hostname=emmannamekasMBP


Questions
1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014

There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.

Thanks,
Carson


On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot

I ran this command maker -h it returned the following

The last thing I wish to ask you, how can I load my genome fine and being annotation?

Thanks

emmannamekasMBP:maker emmannaemeka$ maker -h

MAKER version 2.31.9

Usage:

     maker [options] <maker_opts> <maker_bopts> <maker_exe>


Description:

     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.

     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.

     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.

     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.

     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though


Options:

     -genome|g <file>    Overrides the genome file path in the control files

     -RM_off|R           Turns all repeat masking options off.

     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.

     -old_struct         Use the old directory styles (MAKER 2.26 and lower)

     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.

     -tries|t <integer>  Run contigs up to the specified number of tries.

     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!

     -force|f            Forces MAKER to delete old files before running again.
This will require all blast analyses to be rerun.

     -again|a            recaculate all annotations and output files even if no
settings have changed. Does not delete old analyses.

     -quiet|q            Regular quiet. Only a handlful of status messages.

     -qq                 Even more quiet. There are no status messages.

     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs

     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.

     -TMP                Specify temporary directory to use.

     -CTL                Generate empty control files in the current directory.

     -OPTS               Generates just the maker_opts.ctl file.

     -BOPTS              Generates just the maker_bopts.ctl file.

     -EXE                Generates just the maker_exe.ctl file.

     -MWAS    <option>   Easy way to control mwas_server for web-based GUI

                              options:  STOP
                                        START
                                        RESTART

     -version            Prints the MAKER version.

     -help|?             Prints this usage statement.


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Path needs to be a list of directories to search (you specified an executable location).

So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker

Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin

?Carson


On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>>
wrote:

Thanks

I tried to export PATH

running
echo $PATH in the maker directory this returned

/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker


1. Does it mean that PATH has been exported?


secondly,

I tried to run
the command maker -h, which maker, maker -CTL

nothing returned.

2. how do i start up maker?
3. Do I need to be in maker directory to start maker?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker

You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'

Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html

?Carson


On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:

Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker??


On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Daniel
The reply is
emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
-bash: MAKER: command not found

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl??

Thanks,
Daniel Ence


On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Ence,
Thanks for your reply,

This is the step and error received

emmannamekasMBP:src emmannaemeka$ ./build install

Installing MAKER...

Building MAKER

Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)


The build status is


=============================================================================

STATUS MAKER v2.31.9

==============================================================================

PERL Dependencies:  VERIFIED

External Programs:  VERIFIED

External C Libraries:   VERIFIED

MPI SUPPORT:        DISABLED

MWAS Web Interface: DISABLED

MAKER PACKAGE:      CONFIGURATION OK

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work.

Thanks,
Daniel Ence


On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,

I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.

However trying to run maker it wouldn't run.

Please how do I install maker to run on local computer?

Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/2ac6b193/attachment-0002.html>

From tfallon at mit.edu  Tue Sep 26 11:40:21 2017
From: tfallon at mit.edu (Tim Fallon)
Date: Tue, 26 Sep 2017 13:40:21 -0400
Subject: [maker-devel] MAKER changelog?
Message-ID: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>

Hi there,

I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?

All the best,
-Tim

Timothy R. Fallon
PhD candidate
Laboratory of Jing-Ke Weng
Department of Biology
MIT

tfallon at mit.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1853 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment-0002.p7s>

From carsonhh at gmail.com  Tue Sep 26 12:34:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 12:34:16 -0600
Subject: [maker-devel] MAKER changelog?
In-Reply-To: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
References: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
Message-ID: <C32D3C31-125B-4D3D-8E0B-CD4ED629E541@gmail.com>

Here you go.

*updated the locations for repbase and augustus
*make library install more portable for newer perl versions
*fix for cdna2genome single exon strand
*updates for beter hints in augustus (exact rather than partial intron match)
*added allow_overlap for UTR in fungi and prokaryotes
*uri escape snap name in zff conversion
*fix for BioPerl-live related error (also submitted fix to BioPerl)
*jaccard cluster and bug fixes for cigar string
*Added zff2genebank script for training augustus (adapted from Jason Stajich's zff2augustus_gbk.pl)

?Carson


> On Sep 26, 2017, at 11:40 AM, Tim Fallon <tfallon at mit.edu> wrote:
> 
> Hi there,
> 
> I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?
> 
> All the best,
> -Tim
> 
> Timothy R. Fallon
> PhD candidate
> Laboratory of Jing-Ke Weng
> Department of Biology
> MIT
> 
> tfallon at mit.edu <mailto:tfallon at mit.edu>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/a7ae24bf/attachment-0002.html>

From qwzhang0601 at gmail.com  Wed Sep 27 08:30:28 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 10:30:28 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
Message-ID: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>

Hello:

Thank you for all your previous comments and suggestions. We annotated a
new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both
transcriptome and protein sequences as evidences (including 10k reviewed
Mammalian and 340k predicted rodent protein sequences from uniprot). We
predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5,
and 74% have domains by "InterProScan". It seems the genome was well
annotated, but I still feel  28800 protein coding genes are too many for a
rodent species. Do you think this gene set is good for downstream analysis
(e.g., gene family expansion analysis, positive selection analysis)? Or can
I do further filtering to make the number of genes closer to estimated
number (e.g., 22,000)?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b07f2f47/attachment-0002.html>

From dandence at gmail.com  Wed Sep 27 08:54:30 2017
From: dandence at gmail.com (Daniel Ence)
Date: Wed, 27 Sep 2017 10:54:30 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
Message-ID: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>

Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 

I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
> 
> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
> 
> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
> 
> Thanks
> 
> Best
> Quanwei
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/17cf26fd/attachment-0002.html>

From michael.s.campbell1 at gmail.com  Wed Sep 27 09:34:11 2017
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Wed, 27 Sep 2017 11:34:11 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
Message-ID: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>

Hi Quanwei,

The first thing that comes to mind with too many genes is undermasked repeats. You could check the Pfam donmains for things like integrase, GAG proteins, and other transposon related domains. I would also look a bit closer at the genes with AEDs greater than 0.5. Looking and things like average numner of exons per transcript and average gene and transcript lengths can help pick out dodgy genes. You could also do some filtering on the QI values output by MAKER. It is defensible to create a ?higher quality? set by limiting it to genes with AEDs less than 0.5 and puting some requirement on the fractions of splice sites confirmed by EST/mRNA-seq alignments. 

Take care,
Mike
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
> 
> Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 
> 
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.
> 
> Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 
> 
> Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 
> 
> Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 
> 
> Hope this helps, 
> Daniel
> 
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
>> 
>> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
>> 
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b72e2514/attachment-0002.html>

From xvazquezc at gmail.com  Wed Sep 27 18:32:30 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Thu, 28 Sep 2017 10:32:30 +1000
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
Message-ID: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>

Hi Quanwei,
Following Michael comment, even if you use Swissprot, there are over 2700
transposases in it. If there is some undermasking, they will show up as
evidence.
Cheers,
Xabi

On 28 September 2017 at 01:34, Michael Campbell <
michael.s.campbell1 at gmail.com> wrote:

> Hi Quanwei,
>
> The first thing that comes to mind with too many genes is undermasked
> repeats. You could check the Pfam donmains for things like integrase, GAG
> proteins, and other transposon related domains. I would also look a bit
> closer at the genes with AEDs greater than 0.5. Looking and things like
> average numner of exons per transcript and average gene and transcript
> lengths can help pick out dodgy genes. You could also do some filtering on
> the QI values output by MAKER. It is defensible to create a ?higher
> quality? set by limiting it to genes with AEDs less than 0.5 and puting
> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
> alignments.
>
> Take care,
> Mike
>
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>
> Hi Quanwei, I think that your genome assembly probably contains many
> contigs that are too small to contain full gene sequences. Rather than
> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
> paramter: ?skip genome contigs below this length (under 10kbp are often
> useless)?.
>
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
> contigs but excluding those contigs would probably reduce your gene count.
> These may be fragments or duplicates of genes present on these sequences
> that weren?t assembled properly.
>
> Also using predicted protein sequences from uniprot as evidence in your
> annotation is probably not advisable since those sequences are not from
> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
> that you asked about earlier.
>
> Additionally requiring a minimum protein length as you asked about earlier
> could also reduce the gene count.
>
> Ultimately, you may do whatever filtering you find necessary and
> justifiable for your annotation depending on the biology of your organism
> and the methods that generated your assembly, and your annotation.
>
> Hope this helps,
> Daniel
>
> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> Thank you for all your previous comments and suggestions. We annotated a
> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
> annotation (about 250k scaffolds).
>
> For repeats masking, we also build a species specific library. We used
> both transcriptome and protein sequences as evidences (including 10k
> reviewed Mammalian and 340k predicted rodent protein sequences from
> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>
> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
> and 74% have domains by "InterProScan". It seems the genome was well
> annotated, but I still feel  28800 protein coding genes are too many for a
> rodent species. Do you think this gene set is good for downstream analysis
> (e.g., gene family expansion analysis, positive selection analysis)? Or can
> I do further filtering to make the number of genes closer to estimated
> number (e.g., 22,000)?
>
> Thanks
>
> Best
> Quanwei
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170928/1a63a2ec/attachment-0002.html>

From qwzhang0601 at gmail.com  Wed Sep 27 20:04:43 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 22:04:43 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
	<CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
Message-ID: <CAOW6FSJPZBiriKh9L5knuGp_ZCSEVxw4+eftyddk+o3kFwTTCw@mail.gmail.com>

Thank you all for your comments and suggestions. Yes, even when I only use
Swissprot I still have 26.5k protein coding genes. As you mentioned one
reason may be related to repeat masking, and another one may be because of
inclusion of short scaffolds, which further lead to protein fragments.

About the repeat masking, I use the latest Repeatmaker and Repbase
(selected Mammalian), I also build species specific repeat libraries
following
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic.
About transposases I know the Maker pipe line already provided
"transposable element proteins". I do not know what else I can do.

About the short scaffolds, in  fact among the 26.5k genes, only about 400
genes are predicted from scaffolds shorter than 10kb. Besides, I know there
are some very short proteins (e.g.,the mouse protein RL41 (60s ribosomal
protein) has lengh 25). I think short scaffolds may also include some short
proteins.

Now, I plan to start from the 26.5k protein coding genes. I think the less
reliable ones will be filtered out in downstream analysis. For example,
when we construct the gene families, those fragments or falsely predicted
proteins will more like to be excluded from gene families.

Thank you all for your suggestions.

Best
Qaunwei


2017-09-27 20:32 GMT-04:00 Xabier V?zquez-Campos <xvazquezc at gmail.com>:

> Hi Quanwei,
> Following Michael comment, even if you use Swissprot, there are over 2700
> transposases in it. If there is some undermasking, they will show up as
> evidence.
> Cheers,
> Xabi
>
> On 28 September 2017 at 01:34, Michael Campbell <
> michael.s.campbell1 at gmail.com> wrote:
>
>> Hi Quanwei,
>>
>> The first thing that comes to mind with too many genes is undermasked
>> repeats. You could check the Pfam donmains for things like integrase, GAG
>> proteins, and other transposon related domains. I would also look a bit
>> closer at the genes with AEDs greater than 0.5. Looking and things like
>> average numner of exons per transcript and average gene and transcript
>> lengths can help pick out dodgy genes. You could also do some filtering on
>> the QI values output by MAKER. It is defensible to create a ?higher
>> quality? set by limiting it to genes with AEDs less than 0.5 and puting
>> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
>> alignments.
>>
>> Take care,
>> Mike
>>
>> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>>
>> Hi Quanwei, I think that your genome assembly probably contains many
>> contigs that are too small to contain full gene sequences. Rather than
>> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
>> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
>> paramter: ?skip genome contigs below this length (under 10kbp are often
>> useless)?.
>>
>> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
>> contigs but excluding those contigs would probably reduce your gene count.
>> These may be fragments or duplicates of genes present on these sequences
>> that weren?t assembled properly.
>>
>> Also using predicted protein sequences from uniprot as evidence in your
>> annotation is probably not advisable since those sequences are not from
>> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
>> that you asked about earlier.
>>
>> Additionally requiring a minimum protein length as you asked about
>> earlier could also reduce the gene count.
>>
>> Ultimately, you may do whatever filtering you find necessary and
>> justifiable for your annotation depending on the biology of your organism
>> and the methods that generated your assembly, and your annotation.
>>
>> Hope this helps,
>> Daniel
>>
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Hello:
>>
>> Thank you for all your previous comments and suggestions. We annotated a
>> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
>> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
>> annotation (about 250k scaffolds).
>>
>> For repeats masking, we also build a species specific library. We used
>> both transcriptome and protein sequences as evidences (including 10k
>> reviewed Mammalian and 340k predicted rodent protein sequences from
>> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>>
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
>> and 74% have domains by "InterProScan". It seems the genome was well
>> annotated, but I still feel  28800 protein coding genes are too many for a
>> rodent species. Do you think this gene set is good for downstream analysis
>> (e.g., gene family expansion analysis, positive selection analysis)? Or can
>> I do further filtering to make the number of genes closer to estimated
>> number (e.g., 22,000)?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/4b9e4898/attachment-0002.html>

From qwzhang0601 at gmail.com  Thu Sep 28 06:05:19 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Thu, 28 Sep 2017 08:05:19 -0400
Subject: [maker-devel] gene annotation for a better genome
Message-ID: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>

Hello:

Recently, we got a new version of NMR genome, whose genome had been
assembled and annotated a few years ago. We can download the gene
annotation from NCBI.

Now we want to annotate the new genome using Maker2 pipeline. I wonder how
can I fully make use of existing annotations. On the other hand, since the
previous genome is not very well assemblies, some genes annotation maybe
false positives. I hope those false positive genes in previous annotation
won't mislead Maker2 for current gene annotation.

Do you have any suggestions. Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170928/4192c41f/attachment-0002.html>

From carsonhh at gmail.com  Fri Sep 29 10:36:09 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 29 Sep 2017 10:36:09 -0600
Subject: [maker-devel] gene annotation for a better genome
In-Reply-To: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
References: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
Message-ID: <5AFEDD05-DF02-463F-A6EE-1619A9BB968D@gmail.com>

You can try using the est2genome=1 option to map the old models forward onto the new assembly as if they were ESTs (add a line that says est_forward=1 to the control file to maintain old naming and set est=1 to the old model transcript file). Then provide the final models as a pred_gff for a subsuquent run (i.e. a traditional MAKER run where you are annotating the new assembly with transcript and protein evidence and ab initio predictors). Don?t supply the old models to est= on that run.

The idea behind doing it this way is:
1. You need to get old models onto the new assembly so coordinates will change. So by doing it this way, you will at least be able to move many models forward based on homology.
2. By providing the models to pred_gff on a subsequent MAKER run, you are just letting old models compete against new annotations. They will be rejected if they have no evidence support, or can be kept if they score better than alternate models from SNAP/Augustus. That way you have the chance to integrate old models while at the same time rejecting some old models that have no evidence overlap.

?Carson


> On Sep 28, 2017, at 6:05 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Recently, we got a new version of NMR genome, whose genome had been assembled and annotated a few years ago. We can download the gene annotation from NCBI. 
> 
> Now we want to annotate the new genome using Maker2 pipeline. I wonder how can I fully make use of existing annotations. On the other hand, since the previous genome is not very well assemblies, some genes annotation maybe false positives. I hope those false positive genes in previous annotation won't mislead Maker2 for current gene annotation.
> 
> Do you have any suggestions. Thanks
> 
> Best
> Quanwei  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From willett4 at email.unc.edu  Fri Sep 29 11:20:46 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 29 Sep 2017 17:20:46 +0000
Subject: [maker-devel] question on gene numbers with quality_filter.pl
Message-ID: <16C1890A-2042-4BE1-93CE-8A8DC0C18151@ad.unc.edu>

Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this?

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625


http://labs.bio.unc.edu/Willett/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170929/740b9569/attachment-0002.html>

From willett4 at email.unc.edu  Fri Sep  1 09:22:34 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 1 Sep 2017 15:22:34 +0000
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
Message-ID: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>

Hi Everyone-

I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:

"Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?

This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 

We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.

If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).

Thanks,

Best,

Chris Willett


error 48600

#--------- command -------------#
Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_3

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_3

error 48599

Widget::augustus:
/nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
#-------------------------------#
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-195-51.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_11

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_11

error 48592

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 5
 ...processing 1 of 5
 ...processing 2 of 5
 ...processing 3 of 5
 ...processing 4 of 5
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-193-25.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_5

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_5

error 47069

#--------- command -------------#
Widget::snap:
/nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
#-------------------------------#
scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
deleted:0 genes
 ...processing 0 of 10
 ...processing 1 of 10
 ...processing 2 of 10
 ...processing 3 of 10
 ...processing 4 of 10
 ...processing 5 of 10
 ...processing 6 of 10
 ...processing 7 of 10
 ...processing 8 of 10
 ...processing 9 of 10
Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
--> rank=NA, hostname=c-183-35.kd.unc.edu
ERROR: Failed while annotating transcripts
ERROR: Chunk failed at level:1, tier_type:4
FAILED CONTIG:Chromosome_12

ERROR: Chunk failed at level:6, tier_type:0
FAILED CONTIG:Chromosome_12


Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
 

From chzelin at gmail.com  Tue Sep  5 07:59:09 2017
From: chzelin at gmail.com (zl c)
Date: Tue, 5 Sep 2017 09:59:09 -0400
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
Message-ID: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>

Hello,

I run maker for most sequences successfully but fail some long sequences.
The error is:

Widget::tblastx:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db
db.778415-832259.for_tblastx.fasta -query ...778415.832259.0
-num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000
-searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking
true -show_gis -out   OUT.tblastx

#-------------------------------#


------------- EXCEPTION: Bio::Root::Exception -------------

MSG: Can't get HSPs: data not collected.

STACK: Error::throw

STACK: Bio::Root::Root::throw
/usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486

STACK: Bio::Search::Hit::PhatHit::Base::hsps
/spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552

STACK: Widget::tblastx::keepers
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192

STACK: Widget::tblastx::parse
/spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251

STACK: GI::tblastx
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260

STACK: GI::reblast_merged_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471

STACK: GI::merge_resolve_hits
/spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291

STACK: Process::MpiChunk::_go
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320

STACK: Process::MpiChunk::run
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340

STACK: Process::MpiChunk::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: Process::MpiTiers::run_all
/spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287

STACK: /home/chenz11/program/maker/bin/maker:695

-----------------------------------------------------------

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

--> rank=NA, hostname=cn3544

ERROR: Failed while collecting tblastx reports

ERROR: Chunk failed at level:5, tier_type:3

FAILED CONTIG:tig00011625_arrow


ERROR: Chunk failed at level:4, tier_type:0

FAILED CONTIG:tig00011625_arrow


examining contents of the fasta file and run log

I've read a relative thread on the google group and checked my tblastx
output. I found that the number of HSPs should be larger than 1000,000, but
only output 1000,000, which make some alignments have no HSPs. Is there any
setting that could solve the problem?

Thanks,
Zelin

--------------------------------------------
Zelin Chen [chzelin at gmail.com]


NIH/NHGRI
Building 50, Room 5531
50 SOUTH DR, MSC 8004
BETHESDA, MD 20892-8004
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/867d1aef/attachment-0003.html>

From qwzhang0601 at gmail.com  Tue Sep  5 14:24:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 16:24:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
Message-ID: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>

Hello:

We are doing genome annotation for a new rodent species. We have finished
the training of the ab initio gene predictors successful by setting the
following parameters (split_hit=40000, max_dna_len=1000000, and 99k
mammalian Swiss protein sequences as evidences.

But when I used the trained model to do the genome annotation, I got the
following kinds of errors (shown in red). I used the same parameters as
those for training, except for addition of 340k rodent TrEMBL protein
sequences for protein evidences (i.e., I use both 99k mammalian Swiss
protein sequences and 340k rodent TrEMBL protein sequences).

I am doing the annotation on a cluster and started multiple Maker in the
same directory (I had tried to use MPI but met some problems).

Do you have any suggestions? Many thanks
#some kinds of errors
open3: fork failed: Cannot allocate memory at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
--> rank=NA, hostname=n520
ERROR: Failed while doing blastx of proteins
ERROR: Chunk failed at level:8, tier_type:3
FAILED CONTIG:Contig2


setting up GFF3 output and fasta chunks
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n513
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig12378


Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/d504a94d/attachment-0003.html>

From carsonhh at gmail.com  Tue Sep  5 14:56:01 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 14:56:01 -0600
Subject: [maker-devel] ERROR: Can't call method "start" on an undefined
 value at ../lib/maker/join.pm line 535
In-Reply-To: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
References: <FC5A69D8-3FE8-45F7-B902-2847E8DC802A@ad.unc.edu>
Message-ID: <7DCB519E-9AFA-4D10-8046-72DE99C5E4FF@gmail.com>

Did you use gff3 input to MAKER for any steps (example pred_gff or est_gff)?

?Carson

> On Sep 1, 2017, at 9:22 AM, Willett, Christopher S <willett4 at email.unc.edu> wrote:
> 
> Hi Everyone-
> 
> I was wondering if anyone had any ideas about this error that I am seeing for some of our contigs that is causing them to fail but not others. Here is the error message:
> 
> "Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.?
> 
> This is either coming up in the augustus or snap portions of the analysis. See below for errors from 4 different contigs compiled. They are from two different runs where I was trying to exclude a couple different parts in each to see if I could isolate the problem. It seems like slightly different errors but the same contigs are failing in both. 
> 
> We recently improved our genome and are trying to do a reannotation on it. I am using a very similar MAKER setup that I used previously with an earlier version of the genome and did not have any contigs fail in that case but now 8 of the 12 large contigs are failing (sizes ~18Mb). The MAKER version is 2.31.8.
> 
> If anyone has any thoughts on what the issue might be and how I could resolve it I would appreciate it (or if I should provide more information).
> 
> Thanks,
> 
> Best,
> 
> Chris Willett
> 
> 
> 
> error 48600
> 
> #--------- command -------------#
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_H_uEHb/0/19_0.1142953-1147539.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_3
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_3
> 
> error 48599
> 
> Widget::augustus:
> /nas02/apps/maker-2.31.8/src/augustus-3.2.2/bin/augustus --species=Tigriopus_californicus --strand=forward --UTR=off --hintsfile=/tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.xdef.augustus --extrinsicCfgFile=/nas02/apps/maker-2.31.8/src/augustus-3.2.2/config/extrinsic/extrinsic.MPE.cfg /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus.fasta > /tmp/maker_TQWBu_/0/49_0.7061374-7062552.Tigriopus_californicus.auto_annotator.augustus
> #-------------------------------#
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-195-51.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_11
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_11
> 
> error 48592
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker_WADb1j/0/38_0.1656654-1673307.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 5
> ...processing 1 of 5
> ...processing 2 of 5
> ...processing 3 of 5
> ...processing 4 of 5
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-193-25.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_5
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_5
> 
> error 47069
> 
> #--------- command -------------#
> Widget::snap:
> /nas02/apps/maker-2.31.8/src/maker/exe/snap/snap -plus /proj/willetlb/users/cwillett/MAKER_analyses/SD_full/PB11_12_eukscafs_sp3_sum_hm1.hmm -xdef /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.x
> def.snap  /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap.fasta > /tmp/maker__PSNX1/0/53_0.6154539-6169567.PB11_12_eukscafs_sp3_sum_hm1.hmm.auto_annotator.snap
> #-------------------------------#
> scoring....decoding.10.20.30.40.50.60.70.80.90.100 done
> deleted:0 genes
> ...processing 0 of 10
> ...processing 1 of 10
> ...processing 2 of 10
> ...processing 3 of 10
> ...processing 4 of 10
> ...processing 5 of 10
> ...processing 6 of 10
> ...processing 7 of 10
> ...processing 8 of 10
> ...processing 9 of 10
> Can't call method "start" on an undefined value at /nas02/apps/maker-2.31.8/src/maker/bin/../lib/maker/join.pm line 535.
> --> rank=NA, hostname=c-183-35.kd.unc.edu
> ERROR: Failed while annotating transcripts
> ERROR: Chunk failed at level:1, tier_type:4
> FAILED CONTIG:Chromosome_12
> 
> ERROR: Chunk failed at level:6, tier_type:0
> FAILED CONTIG:Chromosome_12
> 
> 
> Can't call method "start" on an undefined value at ../lib/maker/join.pm line 535.
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From carsonhh at gmail.com  Tue Sep  5 15:48:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 15:48:56 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
Message-ID: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>

You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.

So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).

?Carson


> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
> 
> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
> 
> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
> 
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
> 
> 
> Best
> Quanwei

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/c2fb8514/attachment-0003.html>

From carsonhh at gmail.com  Tue Sep  5 16:04:00 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:04:00 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
Message-ID: <846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>

The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.

?Carson


> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com> wrote:
> 
> Hello,
> 
> I run maker for most sequences successfully but fail some long sequences. The error is: 
> 
> Widget::tblastx:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
> #-------------------------------#
> 
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Can't get HSPs: data not collected.
> STACK: Error::throw
> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
> STACK: /home/chenz11/program/maker/bin/maker:695
> -----------------------------------------------------------
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> --> rank=NA, hostname=cn3544
> ERROR: Failed while collecting tblastx reports
> ERROR: Chunk failed at level:5, tier_type:3
> FAILED CONTIG:tig00011625_arrow
> 
> ERROR: Chunk failed at level:4, tier_type:0
> FAILED CONTIG:tig00011625_arrow
> 
> examining contents of the fasta file and run log
> 
> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>  
> Thanks,
> Zelin
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/a316398a/attachment-0003.html>

From qwzhang0601 at gmail.com  Tue Sep  5 16:04:23 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 5 Sep 2017 18:04:23 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
Message-ID: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>

Dear Carson:

Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds.
I set max_dna_len as 1Mb, because there are quite many long scaffolds
(e.g., the longest one is about 100Mb). Would you explain whether smaller
"max_dna_len" will decrease the quality of annotation (e.g., split some
genes in the same scaffold)?


Best
Quanwei

2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> You ran out of memory. You probably set max_dna_len too high for the
> machines you are using. There is a note in the maker_opts.ctl file that
> tells you that this value affects memory usage.
>
> So you can either set it lower, or if running under MPI, use fewer CPUs
> per node (how you do this is MPI flavor dependent, but some flavors let you
> do this by setting process count lower combined with the round robin
> option).
>
> ?Carson
>
>
>
> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> We are doing genome annotation for a new rodent species. We have finished
> the training of the ab initio gene predictors successful by setting the
> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
> mammalian Swiss protein sequences as evidences.
>
> But when I used the trained model to do the genome annotation, I got the
> following kinds of errors (shown in red). I used the same parameters as
> those for training, except for addition of 340k rodent TrEMBL protein
> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
> protein sequences and 340k rodent TrEMBL protein sequences).
>
> I am doing the annotation on a cluster and started multiple Maker in the
> same directory (I had tried to use MPI but met some problems).
>
> Do you have any suggestions? Many thanks
> #some kinds of errors
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> --> rank=NA, hostname=n520
> ERROR: Failed while doing blastx of proteins
> ERROR: Chunk failed at level:8, tier_type:3
> FAILED CONTIG:Contig2
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n513
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig12378
>
>
> Best
> Quanwei
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/8c55b5a3/attachment-0003.html>

From carsonhh at gmail.com  Tue Sep  5 16:08:28 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 5 Sep 2017 16:08:28 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
Message-ID: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>

max_dna_len is the window size for keeping data in RAM. Smaller values do not split genes. But values lower than 100kb can create issues (if a single gene models spans 3 or more windows, it creates a weird failure).

?Carson


> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thanks. I wonder whether smaller "max_dna_len" will split longer scaffolds. I set max_dna_len as 1Mb, because there are quite many long scaffolds (e.g., the longest one is about 100Mb). Would you explain whether smaller "max_dna_len" will decrease the quality of annotation (e.g., split some genes in the same scaffold)? 
> 
> 
> Best
> Quanwei  
> 
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> You ran out of memory. You probably set max_dna_len too high for the machines you are using. There is a note in the maker_opts.ctl file that tells you that this value affects memory usage.
> 
> So you can either set it lower, or if running under MPI, use fewer CPUs per node (how you do this is MPI flavor dependent, but some flavors let you do this by setting process count lower combined with the round robin option).
> 
> ?Carson
> 
> 
> 
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> We are doing genome annotation for a new rodent species. We have finished the training of the ab initio gene predictors successful by setting the following parameters (split_hit=40000, max_dna_len=1000000, and 99k mammalian Swiss protein sequences as evidences. 
>> 
>> But when I used the trained model to do the genome annotation, I got the following kinds of errors (shown in red). I used the same parameters as those for training, except for addition of 340k rodent TrEMBL protein sequences for protein evidences (i.e., I use both 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences). 
>> 
>> I am doing the annotation on a cluster and started multiple Maker in the same directory (I had tried to use MPI but met some problems).  
>> 
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>> 
>> 
>> Best
>> Quanwei
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170905/6032bfb2/attachment-0003.html>

From qwzhang0601 at gmail.com  Wed Sep  6 09:51:54 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 6 Sep 2017 11:51:54 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
Message-ID: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>

Dear Carson:

(1) Thank you for your explanation. I will try to set max_dna_len as 400kb
for our rodent species, which is a little bit higher than the suggested
value for large vertebrate genome (in the maker manual it mentioned
"300,000 is a good max_dna_len on large vertebrate genomes if memory is not
a limiting factor").

(2) By reading some of your replies in the maker google group, and I
noticed that it can reduce memory and save time for annotation if I set
depth_blast to a certain number. So I changed the following parameters. But
I wonder, whether it will decrease the quality of annotation? If it won't
affect the quality, can I even use a smaller number (e.g., 20) to save more
memory and time?

depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

(3) I also have some concerns about the speed, especially for the long
scaffolds (around 100Mb). I wonder which part is the most time consuming
for genome annotation (repeat masking, blast, or polishing?).
Particularly, I wonder whether the blastx of protein evidence will take
majority of time. Now, I have prepared 99k mammalian Swiss protein
sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
am considering whether I can save much time if I only use the 99k mammalian
Swiss protein sequences as evidences.

(4) For some reasons, I can not run maker though MPI on our cluster. So I
can only start multiple maker. I wonder if it is possible to let multiple
maker to annotate the same long scaffold (i.e., for a single sequence I
start multiple maker, without splitting the long sequence into shorter
ones).

(5) Still about the speed issue. I read some of your comments about "cpus"
parameters in the maker_opts file (
http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html).
And I know it indicate the number of cpus for a single chunk. So if I set
"cpus=2" in the maker_opts file, then I can use the following command to
submit the job, right?

**************** the bash file used to submit the maker job
#!/bin/bash

#$ -cwd
#$ -S /bin/bash
#$ -j y
#$ -N makerT2
#$ -l h_vmem=8g
#$ -pe smp 2

module load MAKER/2.31.9/perl.5.22.1

maker --q 2> maker_test.error


Many thanks

Best
Qaunwei


2017-09-05 18:08 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> max_dna_len is the window size for keeping data in RAM. Smaller values do
> not split genes. But values lower than 100kb can create issues (if a single
> gene models spans 3 or more windows, it creates a weird failure).
>
> ?Carson
>
>
>
>
> On Sep 5, 2017, at 4:04 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thanks. I wonder whether smaller "max_dna_len" will split longer
> scaffolds. I set max_dna_len as 1Mb, because there are quite many long
> scaffolds (e.g., the longest one is about 100Mb). Would you explain whether
> smaller "max_dna_len" will decrease the quality of annotation (e.g., split
> some genes in the same scaffold)?
>
>
> Best
> Quanwei
>
> 2017-09-05 17:48 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> You ran out of memory. You probably set max_dna_len too high for the
>> machines you are using. There is a note in the maker_opts.ctl file that
>> tells you that this value affects memory usage.
>>
>> So you can either set it lower, or if running under MPI, use fewer CPUs
>> per node (how you do this is MPI flavor dependent, but some flavors let you
>> do this by setting process count lower combined with the round robin
>> option).
>>
>> ?Carson
>>
>>
>>
>> On Sep 5, 2017, at 2:24 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Hello:
>>
>> We are doing genome annotation for a new rodent species. We have finished
>> the training of the ab initio gene predictors successful by setting the
>> following parameters (split_hit=40000, max_dna_len=1000000, and 99k
>> mammalian Swiss protein sequences as evidences.
>>
>> But when I used the trained model to do the genome annotation, I got the
>> following kinds of errors (shown in red). I used the same parameters as
>> those for training, except for addition of 340k rodent TrEMBL protein
>> sequences for protein evidences (i.e., I use both 99k mammalian Swiss
>> protein sequences and 340k rodent TrEMBL protein sequences).
>>
>> I am doing the annotation on a cluster and started multiple Maker in the
>> same directory (I had tried to use MPI but met some problems).
>>
>> Do you have any suggestions? Many thanks
>> #some kinds of errors
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> --> rank=NA, hostname=n520
>> ERROR: Failed while doing blastx of proteins
>> ERROR: Chunk failed at level:8, tier_type:3
>> FAILED CONTIG:Contig2
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n513
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig12378
>>
>>
>> Best
>> Quanwei
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170906/5ef9f187/attachment-0003.html>

From carsonhh at gmail.com  Wed Sep  6 10:06:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 6 Sep 2017 10:06:46 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
Message-ID: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>


> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
> 
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.


> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.

BLASTN (ESTs) -> fastest as it is searching nucleotide space
BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX

Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.


> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).

Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.


> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  

The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.


?Carson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170906/2e1e3d6b/attachment-0003.html>

From carsonhh at gmail.com  Thu Sep  7 09:12:46 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Thu, 7 Sep 2017 09:12:46 -0600
Subject: [maker-devel] MSG: Can't get HSPs: data not collected.
In-Reply-To: <CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
References: <CAO_vRvZyMrfhL=nN79xnJeoXwFbEqTPwF8siioHJ_DFQTUpL2Q@mail.gmail.com>
	<846D5971-E6EE-40F3-AF26-3124AA370029@gmail.com>
	<CAO_vRvbQLQfsRUfXMC1f8=U9L2kJ=PnFSuePjwHpRJE39zdV1Q@mail.gmail.com>
Message-ID: <2B046506-1E32-4840-B3B6-6DABB4A5D4C2@gmail.com>

I?m glad it fixed it.

?Carson

> On Sep 6, 2017, at 8:27 PM, zl c <chzelin at gmail.com> wrote:
> 
> Hi Carson,
> 
> I try blast-2.6.0+ and it works. Thank you very much.
> 
> Thanks
> Zelin Chen
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> 
> On Tue, Sep 5, 2017 at 6:04 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> The last time I saw this error, it was with blast-2.5.1+. Not sure if the failure you are seeing is the same one. It was caused by a truncated BLAST report (so some results have no alignments). I have an open ticket with the BLAST development group. I never received confirmation that it is fixed, but you can try updating to 2.6 and see if that fixes it.  If not, switch to legacy BLAST (not blast plus) and see if it goes away.
> 
> ?Carson
> 
> 
>> On Sep 5, 2017, at 7:59 AM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>> 
>> Hello,
>> 
>> I run maker for most sequences successfully but fail some long sequences. The error is: 
>> 
>> Widget::tblastx:
>> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db db.778415-832259.for_tblastx.fasta -query ...778415.832259.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 16 -lcase_masking -seg yes -soft_masking true -show_gis -out   OUT.tblastx
>> #-------------------------------#
>> 
>> ------------- EXCEPTION: Bio::Root::Exception -------------
>> MSG: Can't get HSPs: data not collected.
>> STACK: Error::throw
>> STACK: Bio::Root::Root::throw /usr/local/Perl/5.18.2/lib/perl5/site_perl/5.18.2/Bio/Root/Root.pm:486
>> STACK: Bio::Search::Hit::PhatHit::Base::hsps /spin1/home/linux/chenz11/program/maker/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm:552
>> STACK: Widget::tblastx::keepers /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:192 <http://tblastx.pm:192/>
>> STACK: Widget::tblastx::parse /spin1/home/linux/chenz11/program/maker/bin/../lib/Widget/tblastx.pm:133 <http://tblastx.pm:133/>
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3251
>> STACK: GI::tblastx /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:3260
>> STACK: GI::reblast_merged_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:471
>> STACK: GI::merge_resolve_hits /spin1/home/linux/chenz11/program/maker/bin/../lib/GI.pm:291
>> STACK: Process::MpiChunk::_go /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:2320
>> STACK: Process::MpiChunk::run /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:340
>> STACK: Process::MpiChunk::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiChunk.pm:356
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: Process::MpiTiers::run_all /spin1/home/linux/chenz11/program/maker/bin/../lib/Process/MpiTiers.pm:287
>> STACK: /home/chenz11/program/maker/bin/maker:695
>> -----------------------------------------------------------
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> --> rank=NA, hostname=cn3544
>> ERROR: Failed while collecting tblastx reports
>> ERROR: Chunk failed at level:5, tier_type:3
>> FAILED CONTIG:tig00011625_arrow
>> 
>> ERROR: Chunk failed at level:4, tier_type:0
>> FAILED CONTIG:tig00011625_arrow
>> 
>> examining contents of the fasta file and run log
>> 
>> I've read a relative thread on the google group and checked my tblastx output. I found that the number of HSPs should be larger than 1000,000, but only output 1000,000, which make some alignments have no HSPs. Is there any setting that could solve the problem?
>>  
>> Thanks,
>> Zelin
>> 
>> --------------------------------------------
>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>> 
>> 
>> NIH/NHGRI
>> Building 50, Room 5531
>> 50 SOUTH DR, MSC 8004 
>> BETHESDA, MD 20892-8004
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170907/441f80c9/attachment-0003.html>

From qwzhang0601 at gmail.com  Fri Sep  8 21:25:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Fri, 8 Sep 2017 23:25:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
Message-ID: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>

Dear Carson:

I got the following error again. Is this still related to memory issues? I
wonder whether there can be other reasons lead to this error? This time, I
got this error during training of the SNAP model. Before, even I set
max_dna_len=1Mb, I can train the model successfully.  And in the current
training (where I get the following error),  I have decreased the
max_dna_len to 300kb. I required the same amount memory as before. The only
difference is that I am using both mammalian repeat library and species
specific repeat library, while previously I only use the mammalian repeat
library. Will it greatly increases the requirement of memory to use both
repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
have also set the depth_blast as 30 in current training.

Thank you! Have a nice weekend!


#---------------------------------------------------------------------
Now starting the contig!!
SeqID: Contig10
Length: 18773588
#---------------------------------------------------------------------


setting up GFF3 output and fasta chunks
doing repeat masking
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
doing blastx repeats
collecting blastx repeatmasking
processing all repeats
doing repeat masking
Can't kill a non-numeric process ID at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
--> rank=NA, hostname=n224
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:Contig10

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:Contig10

Best
Quanwei

2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

>
> (2) By reading some of your replies in the maker google group, and I
> noticed that it can reduce memory and save time for annotation if I set
> depth_blast to a certain number. So I changed the following parameters. But
> I wonder, whether it will decrease the quality of annotation? If it won't
> affect the quality, can I even use a smaller number (e.g., 20) to save more
> memory and time?
>
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> This values really only affects the final evidence kept in the GFF3 when
> you look at it in a browser. It has not affect on the annotation. This is
> because internally MAKER already collapses evidence down to the 10 best
> non-redundant features per evidence set per locus. The rest are put in the
> GFF3 just for reference. by setting it lower, you are just letting MAKER
> know it can through things away even sooner since you don?t want them in
> the GFF3. It provides a minor improvement for memory use, but
> max_dna_length is the big one that has the greatest effect.
>
>
> (3) I also have some concerns about the speed, especially for the long
> scaffolds (around 100Mb). I wonder which part is the most time consuming
> for genome annotation (repeat masking, blast, or polishing?).
> Particularly, I wonder whether the blastx of protein evidence will take
> majority of time. Now, I have prepared 99k mammalian Swiss protein
> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
> am considering whether I can save much time if I only use the 99k mammalian
> Swiss protein sequences as evidences.
>
>
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
> times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12
> times slower than BLASTN and twice as slow as BLASTX
>
> Also double the dataset size, double the runtime. Larger window sizes via
> max_dna_length will also increase runtimes.
>
>
> (4) For some reasons, I can not run maker though MPI on our cluster. So I
> can only start multiple maker. I wonder if it is possible to let multiple
> maker to annotate the same long scaffold (i.e., for a single sequence I
> start multiple maker, without splitting the long sequence into shorter
> ones).
>
>
> Without MPI you won?t be able to split up large contigs. At the very least
> you can try and run on a single node and set MPI to use all CPUs on that
> node. It?s less difficult to set up compared to cross node jobs via MPI.
>
>
> (5) Still about the speed issue. I read some of your comments about "cpus"
> parameters in the maker_opts file (http://gmod.827538.n3.nabble.
> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
> the maker_opts file, then I can use the following command to submit the
> job, right?
>
>
> The cpu parameter only affects how many CPUs are given to the blast
> command line. So only the BLASt step will speed up, so I recommend using
> MPI to get all steps to speed up. Even if you are only running on a single
> node, you can give all CPUs to the mpiexec command.
>
>
> ?Carson
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170908/08852c2f/attachment-0003.html>

From xvazquezc at gmail.com  Sun Sep 10 19:03:11 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 11 Sep 2017 11:03:11 +1000
Subject: [maker-devel] augustus underpredicting
Message-ID: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>

Hi,
I have been annotating a fungal genome as usual, using Busco-trained
Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
is predicting a mere 207 genes compared to 15-20k from the other two.
I've never had this problem. The genome has an unusual repeat content close
to 50%, not sure if that might suppose a problem.
Has anybody come up with any similar issue?
I also asked to Busco developers if they have any idea
https://gitlab.com/ezlab/busco/issues/49
Cheers,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/7ac7b97f/attachment-0003.html>

From qwzhang0601 at gmail.com  Mon Sep 11 10:19:50 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 12:19:50 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>

Dear Carson:

About the error in my above email, I found the contig was correctly
annotated at the second time RETRY. So please ignore my last email. But
now, for a few number of scaffolds, I met problems to process the repeats
(as shown below in red). I used both Mammalia repeat library and species
specific repeat library (which is generated by your pipeline "
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic").
There were no such problems when I only used Mammalia repeat library. Do
you have any ideas about this? What could be the reason? Or do you have any
suggestions for me to find the reason? Many thanks

Here are some parameters I used

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Mammalia #select a model organism for RepBase masking in
RepeatMasker
rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
repeat library in fasta format for Repe

max_dna_len=300000
split_hit=40000
depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking


Died at
/gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31


Best
Quanwei

2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I got the following error again. Is this still related to memory issues? I
> wonder whether there can be other reasons lead to this error? This time, I
> got this error during training of the SNAP model. Before, even I set
> max_dna_len=1Mb, I can train the model successfully.  And in the current
> training (where I get the following error),  I have decreased the
> max_dna_len to 300kb. I required the same amount memory as before. The only
> difference is that I am using both mammalian repeat library and species
> specific repeat library, while previously I only use the mammalian repeat
> library. Will it greatly increases the requirement of memory to use both
> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
> have also set the depth_blast as 30 in current training.
>
> Thank you! Have a nice weekend!
>
>
>
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
>
>
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
>
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
>
> Best
> Quanwei
>
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>>
>> (2) By reading some of your replies in the maker google group, and I
>> noticed that it can reduce memory and save time for annotation if I set
>> depth_blast to a certain number. So I changed the following parameters. But
>> I wonder, whether it will decrease the quality of annotation? If it won't
>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>> memory and time?
>>
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> This values really only affects the final evidence kept in the GFF3 when
>> you look at it in a browser. It has not affect on the annotation. This is
>> because internally MAKER already collapses evidence down to the 10 best
>> non-redundant features per evidence set per locus. The rest are put in the
>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>> know it can through things away even sooner since you don?t want them in
>> the GFF3. It provides a minor improvement for memory use, but
>> max_dna_length is the big one that has the greatest effect.
>>
>>
>> (3) I also have some concerns about the speed, especially for the long
>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>> for genome annotation (repeat masking, blast, or polishing?).
>> Particularly, I wonder whether the blastx of protein evidence will take
>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>> am considering whether I can save much time if I only use the 99k mammalian
>> Swiss protein sequences as evidences.
>>
>>
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>> times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>> 12 times slower than BLASTN and twice as slow as BLASTX
>>
>> Also double the dataset size, double the runtime. Larger window sizes via
>> max_dna_length will also increase runtimes.
>>
>>
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I
>> can only start multiple maker. I wonder if it is possible to let multiple
>> maker to annotate the same long scaffold (i.e., for a single sequence I
>> start multiple maker, without splitting the long sequence into shorter
>> ones).
>>
>>
>> Without MPI you won?t be able to split up large contigs. At the very
>> least you can try and run on a single node and set MPI to use all CPUs on
>> that node. It?s less difficult to set up compared to cross node jobs via
>> MPI.
>>
>>
>> (5) Still about the speed issue. I read some of your comments about
>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I know
>> it indicate the number of cpus for a single chunk. So if I set "cpus=2" in
>> the maker_opts file, then I can use the following command to submit the
>> job, right?
>>
>>
>> The cpu parameter only affects how many CPUs are given to the blast
>> command line. So only the BLASt step will speed up, so I recommend using
>> MPI to get all steps to speed up. Even if you are only running on a single
>> node, you can give all CPUs to the mpiexec command.
>>
>>
>> ?Carson
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/126b5351/attachment-0003.html>

From carsonhh at gmail.com  Mon Sep 11 10:48:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:48:16 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
Message-ID: <5C2477A3-CDBA-458A-95CA-E6DC912417B3@gmail.com>

It may can a memory issue or an IO issue. Some resource is being taxed and creating a non-responsive bottleneck. If you are running MAKER multiple times in the same directory, you may have to run fewer processes. Also if you are running without MPI, run with MPI instead as it will better manage the parallelization and use fewer resources than multiple individual processes.

?Carson


> On Sep 8, 2017, at 9:25 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/a9e87402/attachment-0003.html>

From carsonhh at gmail.com  Mon Sep 11 10:50:41 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 10:50:41 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
Message-ID: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>

BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.

?Carson


> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
> Cheers,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f7e3efe3/attachment-0003.html>

From carsonhh at gmail.com  Mon Sep 11 11:07:12 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:07:12 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
Message-ID: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>

I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.

For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).

?Carson


> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
> 
> Here are some parameters I used
> 
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
> 
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> 
> Best
> Quanwei
> 
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
> 
> Thank you! Have a nice weekend! 
> 
> 
> 
> #---------------------------------------------------------------------
> Now starting the contig!!
> SeqID: Contig10
> Length: 18773588
> #---------------------------------------------------------------------
> 
> 
> setting up GFF3 output and fasta chunks
> doing repeat masking
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> doing blastx repeats
> collecting blastx repeatmasking
> processing all repeats
> doing repeat masking
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> --> rank=NA, hostname=n224
> ERROR: Failed while doing repeat masking
> ERROR: Chunk failed at level:0, tier_type:1
> FAILED CONTIG:Contig10
> 
> ERROR: Chunk failed at level:2, tier_type:0
> FAILED CONTIG:Contig10
> 
> Best
> Quanwei
> 
> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> 
>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>> 
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
> 
> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
> 
> 
>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
> 
> BLASTN (ESTs) -> fastest as it is searching nucleotide space
> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
> 
> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
> 
> 
>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
> 
> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
> 
> 
>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
> 
> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
> 
> 
> ?Carson
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/0885c26a/attachment-0003.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:12:29 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:12:29 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
Message-ID: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>

Dear Carson:

I only run 5 Maker instances in each directory (and set cpus=2). If it is
related to memory issue or an IO issue, I am not sure why the much longer
scaffolds (than the failed ones) were all annotated successfully, but the
relatively shorter ones failed.

I have set "tries=5" (#number of times to try a contig if there is a
failure for some reason). I will try "clean_try=1" and test on the failed
scaffolds individually with larger memory to see whether they can be
annotated.

Thank you!

Best
Quanwei

2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> I think the cause of the error may have been a little further upstream
> from what you pasted in the e-mail. One thing that may be happening is that
> you are taxing resources (like IO) if running MAKER multiple times or on
> too many CPUs. That can lead to failures because of truncated BLAST reports
> etc. In which case you can just retry and that will get around those types
> of IO derived errors. MAKER can generate a lot of IO, and if you are
> working on network mounted locations (i.e. the storage being used is
> actually across the network), then they can be lest robust than local
> storage (when under heavy load NFS can falsely report success on read/write
> operations that actually failed). It?s the reason we built in the retry
> capabilities of MAKER.
>
> For contigs that continuously fail, you may need to set clean_try=1. That
> will cause failures to start from scratch (i.e. delete all old reports on
> failure rather than just those suspected of being truncated).
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> About the error in my above email, I found the contig was correctly
> annotated at the second time RETRY. So please ignore my last email. But
> now, for a few number of scaffolds, I met problems to process the repeats
> (as shown below in red). I used both Mammalia repeat library and species
> specific repeat library (which is generated by your pipeline "
> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/
> Repeat_Library_Construction--Basic"). There were no such problems when I
> only used Mammalia repeat library. Do you have any ideas about this? What
> could be the reason? Or do you have any suggestions for me to find the
> reason? Many thanks
>
> Here are some parameters I used
>
> #-----Repeat Masking (leave values blank to skip repeat masking)
> model_org=Mammalia #select a model organism for RepBase masking in
> RepeatMasker
> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
> repeat library in fasta format for Repe
>
> max_dna_len=300000
> split_hit=40000
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
>
> Best
> Quanwei
>
> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I got the following error again. Is this still related to memory issues?
>> I wonder whether there can be other reasons lead to this error? This time,
>> I got this error during training of the SNAP model. Before, even I set
>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>> training (where I get the following error),  I have decreased the
>> max_dna_len to 300kb. I required the same amount memory as before. The only
>> difference is that I am using both mammalian repeat library and species
>> specific repeat library, while previously I only use the mammalian repeat
>> library. Will it greatly increases the requirement of memory to use both
>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>> have also set the depth_blast as 30 in current training.
>>
>> Thank you! Have a nice weekend!
>>
>>
>>
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>>
>>
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>>
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>>
>> Best
>> Quanwei
>>
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>>
>>> (2) By reading some of your replies in the maker google group, and I
>>> noticed that it can reduce memory and save time for annotation if I set
>>> depth_blast to a certain number. So I changed the following parameters. But
>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>> memory and time?
>>>
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> This values really only affects the final evidence kept in the GFF3 when
>>> you look at it in a browser. It has not affect on the annotation. This is
>>> because internally MAKER already collapses evidence down to the 10 best
>>> non-redundant features per evidence set per locus. The rest are put in the
>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>> know it can through things away even sooner since you don?t want them in
>>> the GFF3. It provides a minor improvement for memory use, but
>>> max_dna_length is the big one that has the greatest effect.
>>>
>>>
>>> (3) I also have some concerns about the speed, especially for the long
>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>> for genome annotation (repeat masking, blast, or polishing?).
>>> Particularly, I wonder whether the blastx of protein evidence will take
>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>> am considering whether I can save much time if I only use the 99k mammalian
>>> Swiss protein sequences as evidences.
>>>
>>>
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>> times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>
>>> Also double the dataset size, double the runtime. Larger window sizes
>>> via max_dna_length will also increase runtimes.
>>>
>>>
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>> start multiple maker, without splitting the long sequence into shorter
>>> ones).
>>>
>>>
>>> Without MPI you won?t be able to split up large contigs. At the very
>>> least you can try and run on a single node and set MPI to use all CPUs on
>>> that node. It?s less difficult to set up compared to cross node jobs via
>>> MPI.
>>>
>>>
>>> (5) Still about the speed issue. I read some of your comments about
>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>> know it indicate the number of cpus for a single chunk. So if I set
>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>> submit the job, right?
>>>
>>>
>>> The cpu parameter only affects how many CPUs are given to the blast
>>> command line. So only the BLASt step will speed up, so I recommend using
>>> MPI to get all steps to speed up. Even if you are only running on a single
>>> node, you can give all CPUs to the mpiexec command.
>>>
>>>
>>> ?Carson
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/f02b6a0b/attachment-0003.html>

From carsonhh at gmail.com  Mon Sep 11 11:14:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:14:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
Message-ID: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>

It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.

?Carson


> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
> 
> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
> 
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
> 
> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>> 
>> Here are some parameters I used
>> 
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>> 
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> 
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>> 
>> Thank you! Have a nice weekend! 
>> 
>> 
>> 
>> #---------------------------------------------------------------------
>> Now starting the contig!!
>> SeqID: Contig10
>> Length: 18773588
>> #---------------------------------------------------------------------
>> 
>> 
>> setting up GFF3 output and fasta chunks
>> doing repeat masking
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> doing blastx repeats
>> collecting blastx repeatmasking
>> processing all repeats
>> doing repeat masking
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>> --> rank=NA, hostname=n224
>> ERROR: Failed while doing repeat masking
>> ERROR: Chunk failed at level:0, tier_type:1
>> FAILED CONTIG:Contig10
>> 
>> ERROR: Chunk failed at level:2, tier_type:0
>> FAILED CONTIG:Contig10
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> 
>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>> 
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>> 
>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>> 
>> 
>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>> 
>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>> 
>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>> 
>> 
>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>> 
>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>> 
>> 
>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>> 
>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>> 
>> 
>> ?Carson
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/2a88e334/attachment-0003.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:16:49 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:16:49 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
Message-ID: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>

Dear Carson:

I met some problems to use MPI. I will give it another try.
Thank you!

Best
Quanwei

2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> It could be either. Please use MPI instead of starting multiple instances.
> It will greatly reduce both IO and RAM usage.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I only run 5 Maker instances in each directory (and set cpus=2). If it is
> related to memory issue or an IO issue, I am not sure why the much longer
> scaffolds (than the failed ones) were all annotated successfully, but the
> relatively shorter ones failed.
>
> I have set "tries=5" (#number of times to try a contig if there is a
> failure for some reason). I will try "clean_try=1" and test on the failed
> scaffolds individually with larger memory to see whether they can be
> annotated.
>
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> I think the cause of the error may have been a little further upstream
>> from what you pasted in the e-mail. One thing that may be happening is that
>> you are taxing resources (like IO) if running MAKER multiple times or on
>> too many CPUs. That can lead to failures because of truncated BLAST reports
>> etc. In which case you can just retry and that will get around those types
>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>> working on network mounted locations (i.e. the storage being used is
>> actually across the network), then they can be lest robust than local
>> storage (when under heavy load NFS can falsely report success on read/write
>> operations that actually failed). It?s the reason we built in the retry
>> capabilities of MAKER.
>>
>> For contigs that continuously fail, you may need to set clean_try=1. That
>> will cause failures to start from scratch (i.e. delete all old reports on
>> failure rather than just those suspected of being truncated).
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> About the error in my above email, I found the contig was correctly
>> annotated at the second time RETRY. So please ignore my last email. But
>> now, for a few number of scaffolds, I met problems to process the repeats
>> (as shown below in red). I used both Mammalia repeat library and species
>> specific repeat library (which is generated by your pipeline "
>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>> eat_Library_Construction--Basic"). There were no such problems when I
>> only used Mammalia repeat library. Do you have any ideas about this? What
>> could be the reason? Or do you have any suggestions for me to find the
>> reason? Many thanks
>>
>> Here are some parameters I used
>>
>> #-----Repeat Masking (leave values blank to skip repeat masking)
>> model_org=Mammalia #select a model organism for RepBase masking in
>> RepeatMasker
>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific
>> repeat library in fasta format for Repe
>>
>> max_dna_len=300000
>> split_hit=40000
>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>
>>
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>> 33708 --> rank=NA, hostname=n409
>> 33709 ERROR: Failed while processing all repeats
>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>> 33711 FAILED CONTIG:Contig31
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I got the following error again. Is this still related to memory issues?
>>> I wonder whether there can be other reasons lead to this error? This time,
>>> I got this error during training of the SNAP model. Before, even I set
>>> max_dna_len=1Mb, I can train the model successfully.  And in the current
>>> training (where I get the following error),  I have decreased the
>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>> difference is that I am using both mammalian repeat library and species
>>> specific repeat library, while previously I only use the mammalian repeat
>>> library. Will it greatly increases the requirement of memory to use both
>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>> have also set the depth_blast as 30 in current training.
>>>
>>> Thank you! Have a nice weekend!
>>>
>>>
>>>
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>>
>>>
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>> line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>>
>>>> (2) By reading some of your replies in the maker google group, and I
>>>> noticed that it can reduce memory and save time for annotation if I set
>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>> memory and time?
>>>>
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> This values really only affects the final evidence kept in the GFF3
>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>> know it can through things away even sooner since you don?t want them in
>>>> the GFF3. It provides a minor improvement for memory use, but
>>>> max_dna_length is the big one that has the greatest effect.
>>>>
>>>>
>>>> (3) I also have some concerns about the speed, especially for the long
>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>> Swiss protein sequences as evidences.
>>>>
>>>>
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6
>>>> times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least
>>>> 12 times slower than BLASTN and twice as slow as BLASTX
>>>>
>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>> via max_dna_length will also increase runtimes.
>>>>
>>>>
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So
>>>> I can only start multiple maker. I wonder if it is possible to let multiple
>>>> maker to annotate the same long scaffold (i.e., for a single sequence I
>>>> start multiple maker, without splitting the long sequence into shorter
>>>> ones).
>>>>
>>>>
>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>> MPI.
>>>>
>>>>
>>>> (5) Still about the speed issue. I read some of your comments about
>>>> "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.
>>>> com/open3-fork-failed-Cannot-allocate-memory-td4025117.html). And I
>>>> know it indicate the number of cpus for a single chunk. So if I set
>>>> "cpus=2" in the maker_opts file, then I can use the following command to
>>>> submit the job, right?
>>>>
>>>>
>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>> node, you can give all CPUs to the mpiexec command.
>>>>
>>>>
>>>> ?Carson
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6edaec49/attachment-0003.html>

From carsonhh at gmail.com  Mon Sep 11 11:18:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:18:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
Message-ID: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>

If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>

It?s easy to install yourself, and tends to be very robust to failure.

?Carson


> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I met some problems to use MPI. I will give it another try.
> Thank you!
> 
> Best
> Quanwei
> 
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>> 
>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>> 
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>> 
>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>> 
>>> Here are some parameters I used
>>> 
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>> 
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> 
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>> 
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>> Dear Carson:
>>> 
>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>> 
>>> Thank you! Have a nice weekend! 
>>> 
>>> 
>>> 
>>> #---------------------------------------------------------------------
>>> Now starting the contig!!
>>> SeqID: Contig10
>>> Length: 18773588
>>> #---------------------------------------------------------------------
>>> 
>>> 
>>> setting up GFF3 output and fasta chunks
>>> doing repeat masking
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> doing blastx repeats
>>> collecting blastx repeatmasking
>>> processing all repeats
>>> doing repeat masking
>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>> --> rank=NA, hostname=n224
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:Contig10
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:Contig10
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> 
>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>> 
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>> 
>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>> 
>>> 
>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>> 
>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>> 
>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>> 
>>> 
>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>> 
>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>> 
>>> 
>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>> 
>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>> 
>>> 
>>> ?Carson
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/ee287570/attachment-0003.html>

From qwzhang0601 at gmail.com  Mon Sep 11 11:27:22 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 13:27:22 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
Message-ID: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>

Dear Carson:

Would you please explain what do you mean by "a single machine"? I am
running maker2 on our high performance cluster. The cluster has more than
1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
as the scheduler. Can I use MPICH3?

Thanks

Best
Quanwei

2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> If you are just using a single machine (and not cross machine MPI), use
> MPICH3 ?> https://www.mpich.org
>
> It?s easy to install yourself, and tends to be very robust to failure.
>
> ?Carson
>
>
>
> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I met some problems to use MPI. I will give it another try.
> Thank you!
>
> Best
> Quanwei
>
> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> It could be either. Please use MPI instead of starting multiple
>> instances. It will greatly reduce both IO and RAM usage.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I only run 5 Maker instances in each directory (and set cpus=2). If it is
>> related to memory issue or an IO issue, I am not sure why the much longer
>> scaffolds (than the failed ones) were all annotated successfully, but the
>> relatively shorter ones failed.
>>
>> I have set "tries=5" (#number of times to try a contig if there is a
>> failure for some reason). I will try "clean_try=1" and test on the failed
>> scaffolds individually with larger memory to see whether they can be
>> annotated.
>>
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> I think the cause of the error may have been a little further upstream
>>> from what you pasted in the e-mail. One thing that may be happening is that
>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>> etc. In which case you can just retry and that will get around those types
>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>> working on network mounted locations (i.e. the storage being used is
>>> actually across the network), then they can be lest robust than local
>>> storage (when under heavy load NFS can falsely report success on read/write
>>> operations that actually failed). It?s the reason we built in the retry
>>> capabilities of MAKER.
>>>
>>> For contigs that continuously fail, you may need to set clean_try=1.
>>> That will cause failures to start from scratch (i.e. delete all old reports
>>> on failure rather than just those suspected of being truncated).
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> About the error in my above email, I found the contig was correctly
>>> annotated at the second time RETRY. So please ignore my last email. But
>>> now, for a few number of scaffolds, I met problems to process the repeats
>>> (as shown below in red). I used both Mammalia repeat library and species
>>> specific repeat library (which is generated by your pipeline "
>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>> eat_Library_Construction--Basic"). There were no such problems when I
>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>> could be the reason? Or do you have any suggestions for me to find the
>>> reason? Many thanks
>>>
>>> Here are some parameters I used
>>>
>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>> model_org=Mammalia #select a model organism for RepBase masking in
>>> RepeatMasker
>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>> specific repeat library in fasta format for Repe
>>>
>>> max_dna_len=300000
>>> split_hit=40000
>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>
>>>
>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>> line 188.
>>> 33708 --> rank=NA, hostname=n409
>>> 33709 ERROR: Failed while processing all repeats
>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>> 33711 FAILED CONTIG:Contig31
>>>
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>
>>>> Dear Carson:
>>>>
>>>> I got the following error again. Is this still related to memory
>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>> This time, I got this error during training of the SNAP model. Before, even
>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>> current training (where I get the following error),  I have decreased the
>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>> difference is that I am using both mammalian repeat library and species
>>>> specific repeat library, while previously I only use the mammalian repeat
>>>> library. Will it greatly increases the requirement of memory to use both
>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>> have also set the depth_blast as 30 in current training.
>>>>
>>>> Thank you! Have a nice weekend!
>>>>
>>>>
>>>>
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>>
>>>>
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>> line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>>
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>>
>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>> memory and time?
>>>>>
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> This values really only affects the final evidence kept in the GFF3
>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>> know it can through things away even sooner since you don?t want them in
>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>
>>>>>
>>>>> (3) I also have some concerns about the speed, especially for the long
>>>>> scaffolds (around 100Mb). I wonder which part is the most time consuming
>>>>> for genome annotation (repeat masking, blast, or polishing?).
>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>> Swiss protein sequences as evidences.
>>>>>
>>>>>
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>> 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>
>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>> via max_dna_length will also increase runtimes.
>>>>>
>>>>>
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>> shorter ones).
>>>>>
>>>>>
>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>> MPI.
>>>>>
>>>>>
>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>> "cpus" parameters in the maker_opts file (
>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>> I can use the following command to submit the job, right?
>>>>>
>>>>>
>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>
>>>>>
>>>>> ?Carson
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/6fd07594/attachment-0003.html>

From carsonhh at gmail.com  Mon Sep 11 11:46:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 11 Sep 2017 11:46:39 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
Message-ID: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>

Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.

MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.

Example command for a 20 CPU node ?>  mpiexec -n 20 maker

?Carson


> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson: 
> 
> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
> 
> Thanks
> 
> Best
> Quanwei
> 
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
> 
> It?s easy to install yourself, and tends to be very robust to failure.
> 
> ?Carson
> 
> 
> 
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>> 
>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>> 
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>> 
>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>> 
>>>> Here are some parameters I used
>>>> 
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>> 
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> 
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>> 
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>> Dear Carson:
>>>> 
>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>> 
>>>> Thank you! Have a nice weekend! 
>>>> 
>>>> 
>>>> 
>>>> #---------------------------------------------------------------------
>>>> Now starting the contig!!
>>>> SeqID: Contig10
>>>> Length: 18773588
>>>> #---------------------------------------------------------------------
>>>> 
>>>> 
>>>> setting up GFF3 output and fasta chunks
>>>> doing repeat masking
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> doing blastx repeats
>>>> collecting blastx repeatmasking
>>>> processing all repeats
>>>> doing repeat masking
>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>> --> rank=NA, hostname=n224
>>>> ERROR: Failed while doing repeat masking
>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>> FAILED CONTIG:Contig10
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> 
>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>> 
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>> 
>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>> 
>>>> 
>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>> 
>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>> 
>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>> 
>>>> 
>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>> 
>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>> 
>>>> 
>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>> 
>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>> 
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/cef90e76/attachment-0003.html>

From qwzhang0601 at gmail.com  Mon Sep 11 12:33:51 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 11 Sep 2017 14:33:51 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
Message-ID: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>

Dear Carson:

I see. Thank you. I will try it.

Best
Quanwei

2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> Each node is a single machine. Because you currently run without MPI, each
> MAKER job you submit runs on a single machine. So you are either running
> multiple times on the same node, or you submitted 5 separate batch jobs in
> which case you may have a single maker process on each of 5 nodes.
>
> MPI can parallelize on the same node or across nodes. If you request 10
> nodes, then it can communicate across nodes to run the job on all hardware.
> Or you can run MPI on a single node and ask for all CPUs on that node. In
> that case it will split up work within a single node and use all resources
> just on that node. So if you can?t get MPI to work across nodes, you can
> just submit a job that goes to a single node and ask for all CPUs on that
> node (multinode jobs may be hard to configure, but single node jobs are
> very easy). Just set the -n parameter of mpiexec to the CPU count of that
> node, and it will parallelize within the node.
>
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>
> ?Carson
>
>
>
>
>
> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Would you please explain what do you mean by "a single machine"? I am
> running maker2 on our high performance cluster. The cluster has more than
> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
> as the scheduler. Can I use MPICH3?
>
> Thanks
>
> Best
> Quanwei
>
> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> If you are just using a single machine (and not cross machine MPI), use
>> MPICH3 ?> https://www.mpich.org
>>
>> It?s easy to install yourself, and tends to be very robust to failure.
>>
>> ?Carson
>>
>>
>>
>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> I met some problems to use MPI. I will give it another try.
>> Thank you!
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> It could be either. Please use MPI instead of starting multiple
>>> instances. It will greatly reduce both IO and RAM usage.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>> is related to memory issue or an IO issue, I am not sure why the much
>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>> but the relatively shorter ones failed.
>>>
>>> I have set "tries=5" (#number of times to try a contig if there is a
>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>> scaffolds individually with larger memory to see whether they can be
>>> annotated.
>>>
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> I think the cause of the error may have been a little further upstream
>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>> etc. In which case you can just retry and that will get around those types
>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>> working on network mounted locations (i.e. the storage being used is
>>>> actually across the network), then they can be lest robust than local
>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>> operations that actually failed). It?s the reason we built in the retry
>>>> capabilities of MAKER.
>>>>
>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>> on failure rather than just those suspected of being truncated).
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> About the error in my above email, I found the contig was correctly
>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>> specific repeat library (which is generated by your pipeline "
>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>> could be the reason? Or do you have any suggestions for me to find the
>>>> reason? Many thanks
>>>>
>>>> Here are some parameters I used
>>>>
>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>> RepeatMasker
>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>> specific repeat library in fasta format for Repe
>>>>
>>>> max_dna_len=300000
>>>> split_hit=40000
>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>
>>>>
>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>> line 188.
>>>> 33708 --> rank=NA, hostname=n409
>>>> 33709 ERROR: Failed while processing all repeats
>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>> 33711 FAILED CONTIG:Contig31
>>>>
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I got the following error again. Is this still related to memory
>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>> current training (where I get the following error),  I have decreased the
>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>> difference is that I am using both mammalian repeat library and species
>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>> have also set the depth_blast as 30 in current training.
>>>>>
>>>>> Thank you! Have a nice weekend!
>>>>>
>>>>>
>>>>>
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>>
>>>>>
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>> line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>>
>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>> memory and time?
>>>>>>
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>
>>>>>>
>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>> Swiss protein sequences as evidences.
>>>>>>
>>>>>>
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least
>>>>>> 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>
>>>>>> Also double the dataset size, double the runtime. Larger window sizes
>>>>>> via max_dna_length will also increase runtimes.
>>>>>>
>>>>>>
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>> shorter ones).
>>>>>>
>>>>>>
>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>> MPI.
>>>>>>
>>>>>>
>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>> "cpus" parameters in the maker_opts file (
>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>> llocate-memory-td4025117.html). And I know it indicate the number of
>>>>>> cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then
>>>>>> I can use the following command to submit the job, right?
>>>>>>
>>>>>>
>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170911/e23e5faa/attachment-0003.html>

From qwzhang0601 at gmail.com  Wed Sep 13 08:51:32 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:51:32 -0400
Subject: [maker-devel] Repeats annotation
Message-ID: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>

Dear Carson:

We have generated species specific repeat library following your pipeline (
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic).
And did genome annotation by maker2 by using both species specific repeat
library and mammalian repeat library.

Now, we want to do some comparison about the repeat contexts among
different species. So I want to generate species specific for other species
and also use both their species specific repeat library and mammalian
repeat library. But I found, I can only provide either the species specific
repeat library or mammalian repeat library to RepeatMasker (not for both).
I wonder whether I can run maker2 on those genome but only for repeat
masking.

BTW, by running RepeatMasker we can get a summary report (as below), I
wonder whether there is any script from maker2 to analyze repeats element
(or other tools to process the output of maker2).

Many thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     301634 bp ( 34.93 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:               134        14362 bp    1.66 %
      Alu/B1          28         2183 bp    0.25 %
      MIRs            21         2860 bp    0.33 %

LINEs:               188       129104 bp   14.95 %
      LINE1          168       124633 bp   14.43 %
      LINE2           16         4266 bp    0.49 %
      L3/CR1           4          205 bp    0.02 %
      RTE              0            0 bp    0.00 %

LTR elements:        127       101129 bp   11.71 %
      ERVL            10         3057 bp    0.35 %
      ERVL-MaLRs      22         6902 bp    0.80 %
      ERV_classI      66        80258 bp    9.29 %
      ERV_classII     29        10912 bp    1.26 %

DNA elements:         27         4402 bp    0.51 %
      hAT-Charlie     13         1836 bp    0.21 %
      TcMar-Tigger     8         1651 bp    0.19 %

Unclassified:          4         1590 bp    0.18 %

Total interspersed repeats:    250587 bp   29.02 %


Small RNA:             9          616 bp    0.07 %

Satellites:           66        40820 bp    4.73 %
Simple repeats:      159         7235 bp    0.84 %
Low complexity:       50         2766 bp    0.32 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be mammalia
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/739f1e6a/attachment-0003.html>

From qwzhang0601 at gmail.com  Wed Sep 13 08:32:34 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 10:32:34 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
Message-ID: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>

Dear Carson:

I did more tests on one of the contigs (with length 863kb) that failed when
doing repeat masking. I found it only fail when I added the species
specific repeat library, and it can be successfully annotated when only
considering mammalian repeat library. When I did the test I only picked the
this contig and run maker with 64G memory. So I think the failure should
not be the problem with memory or IO, because even the contigs with length
98Mb can be annotated with memory 32G.

I also run RepeatMasker on this contig with mammalian and species specific
repeat library, separately. I found when I use  mammalian repeat library,
about 35% was masked as repeats, while it is 65% when I use species
specific repeat library (as shown below in blue). I wonder whether the high
level of repeats can lead to the failure of this contig.  Do you have any
ideas about this. Thanks


file name: test_scaffold31.fasta
sequences:             1
total length:     863590 bp  (858757 bp excl N/X-runs)
GC level:         37.02 %
bases masked:     562909 bp ( 65.18 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:              113        16134 bp    1.87 %
      ALUs           71        12479 bp    1.45 %
      MIRs            1          133 bp    0.02 %

LINEs:              251       380142 bp   44.02 %
      LINE1         211       210623 bp   24.39 %
      LINE2           1           86 bp    0.01 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:       246       101221 bp   11.72 %
      ERVL            5         1037 bp    0.12 %
      ERVL-MaLRs     18         2744 bp    0.32 %
      ERV_classI    201        90942 bp   10.53 %
      ERV_classII    18         5964 bp    0.69 %

DNA elements:        39        14177 bp    1.64 %
     hAT-Charlie      7         3864 bp    0.45 %
     TcMar-Tigger     7         1706 bp    0.20 %

Unclassified:       196        45831 bp    5.31 %

Total interspersed repeats:   557505 bp   64.56 %


Small RNA:            3          823 bp    0.10 %

Satellites:           2          237 bp    0.03 %
Simple repeats:      94         4472 bp    0.52 %
Low complexity:      18          766 bp    0.09 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element


The query species was assumed to be homo
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

run with rmblastn version 2.2.27+
The query was compared to classified sequences in
".../consensi.fa.classifiednoProtFinal"


Best
Quanwei

2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:

> Dear Carson:
>
> I see. Thank you. I will try it.
>
> Best
> Quanwei
>
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> Each node is a single machine. Because you currently run without MPI,
>> each MAKER job you submit runs on a single machine. So you are either
>> running multiple times on the same node, or you submitted 5 separate batch
>> jobs in which case you may have a single maker process on each of 5 nodes.
>>
>> MPI can parallelize on the same node or across nodes. If you request 10
>> nodes, then it can communicate across nodes to run the job on all hardware.
>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>> that case it will split up work within a single node and use all resources
>> just on that node. So if you can?t get MPI to work across nodes, you can
>> just submit a job that goes to a single node and ask for all CPUs on that
>> node (multinode jobs may be hard to configure, but single node jobs are
>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>> node, and it will parallelize within the node.
>>
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>
>> ?Carson
>>
>>
>>
>>
>>
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Dear Carson:
>>
>> Would you please explain what do you mean by "a single machine"? I am
>> running maker2 on our high performance cluster. The cluster has more than
>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>> as the scheduler. Can I use MPICH3?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> If you are just using a single machine (and not cross machine MPI), use
>>> MPICH3 ?> https://www.mpich.org
>>>
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>
>>> ?Carson
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> It could be either. Please use MPI instead of starting multiple
>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>> but the relatively shorter ones failed.
>>>>
>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>> scaffolds individually with larger memory to see whether they can be
>>>> annotated.
>>>>
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> I think the cause of the error may have been a little further upstream
>>>>> from what you pasted in the e-mail. One thing that may be happening is that
>>>>> you are taxing resources (like IO) if running MAKER multiple times or on
>>>>> too many CPUs. That can lead to failures because of truncated BLAST reports
>>>>> etc. In which case you can just retry and that will get around those types
>>>>> of IO derived errors. MAKER can generate a lot of IO, and if you are
>>>>> working on network mounted locations (i.e. the storage being used is
>>>>> actually across the network), then they can be lest robust than local
>>>>> storage (when under heavy load NFS can falsely report success on read/write
>>>>> operations that actually failed). It?s the reason we built in the retry
>>>>> capabilities of MAKER.
>>>>>
>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>> on failure rather than just those suspected of being truncated).
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> About the error in my above email, I found the contig was correctly
>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>> specific repeat library (which is generated by your pipeline "
>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>> eat_Library_Construction--Basic"). There were no such problems when I
>>>>> only used Mammalia repeat library. Do you have any ideas about this? What
>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>> reason? Many thanks
>>>>>
>>>>> Here are some parameters I used
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>> specific repeat library in fasta format for Repe
>>>>>
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>
>>>>>
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>> line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>>
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I got the following error again. Is this still related to memory
>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>> current training (where I get the following error),  I have decreased the
>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>
>>>>>> Thank you! Have a nice weekend!
>>>>>>
>>>>>>
>>>>>>
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #-----------------------------------------------------------
>>>>>> ----------
>>>>>>
>>>>>>
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>> line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>>
>>>>>>> (2) By reading some of your replies in the maker google group, and I
>>>>>>> noticed that it can reduce memory and save time for annotation if I set
>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>> memory and time?
>>>>>>>
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>
>>>>>>>
>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>> Swiss protein sequences as evidences.
>>>>>>>
>>>>>>>
>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>> least 6 times slower than BLASTN
>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>
>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>
>>>>>>>
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster.
>>>>>>> So I can only start multiple maker. I wonder if it is possible to let
>>>>>>> multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>> shorter ones).
>>>>>>>
>>>>>>>
>>>>>>> Without MPI you won?t be able to split up large contigs. At the very
>>>>>>> least you can try and run on a single node and set MPI to use all CPUs on
>>>>>>> that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>> MPI.
>>>>>>>
>>>>>>>
>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>
>>>>>>>
>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/c1467038/attachment-0003.html>

From mathog at caltech.edu  Wed Sep 13 12:01:11 2017
From: mathog at caltech.edu (mathog)
Date: Wed, 13 Sep 2017 11:01:11 -0700
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
Message-ID: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>

Greetings,

I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It 
just won't start.  OpenMPI works fine with a small test program, it just 
doesn't work with maker.  It fails in exactly the same way on a second 
Centos system with minor software differences (Centos 6.9 and perl 5.20 
compiled without thread support, the perl on the first machine had 
thread support.) The gory details were posted already in a Centos forum 
so rather than repeat it all here, this is a link to that thread:

    https://www.centos.org/forums/viewtopic.php?f=14&t=64099

maker was unpacked from the maker-2.31.9.tgz a second time (after moving 
the original) after setting up the "module add openmpi-x86_64" to my 
.bash_profile
and logging in cleanly.  It was rebuilt.  The build messages were 
identical to the previous ones and when a run was attempted it also 
failed in exactly the same way.

I also tried to subscribe to the list here

   
https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

once yesterday, and once today, but no email ever came back.  Hopefully 
this message gets through!

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From carsonhh at gmail.com  Wed Sep 13 12:23:11 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:23:11 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
Message-ID: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>

These are the 3 errors you have shown in your e-mails ?>
open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.

The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.

The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.


IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.

Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.

1. Some things to check. Make sure TMP= is not being set to a network mounted location.
2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.

Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.

Thanks,
Carson


> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
> 
> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
> 
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
> 
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
> 
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
> 
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
> 
> Unclassified:       196        45831 bp    5.31 %
> 
> Total interspersed repeats:   557505 bp   64.56 %
> 
> 
> Small RNA:            3          823 bp    0.10 %
> 
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be homo          
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
> 
> 
> Best
> Quanwei
> 
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
> Dear Carson:
> 
> I see. Thank you. I will try it.
> 
> Best
> Quanwei
> 
> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
> 
> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
> 
> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
> 
> ?Carson
> 
> 
> 
> 
> 
>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson: 
>> 
>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>> 
>> It?s easy to install yourself, and tends to be very robust to failure.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson:
>>> 
>>> I met some problems to use MPI. I will give it another try.
>>> Thank you!
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>> 
>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>> 
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>> 
>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>> 
>>>>> Here are some parameters I used
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>> 
>>>>> max_dna_len=300000
>>>>> split_hit=40000
>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> 
>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>> 33708 --> rank=NA, hostname=n409
>>>>> 33709 ERROR: Failed while processing all repeats
>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>> 33711 FAILED CONTIG:Contig31
>>>>> 
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>> Dear Carson:
>>>>> 
>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>> 
>>>>> Thank you! Have a nice weekend! 
>>>>> 
>>>>> 
>>>>> 
>>>>> #---------------------------------------------------------------------
>>>>> Now starting the contig!!
>>>>> SeqID: Contig10
>>>>> Length: 18773588
>>>>> #---------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> setting up GFF3 output and fasta chunks
>>>>> doing repeat masking
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> doing blastx repeats
>>>>> collecting blastx repeatmasking
>>>>> processing all repeats
>>>>> doing repeat masking
>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>> --> rank=NA, hostname=n224
>>>>> ERROR: Failed while doing repeat masking
>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>> FAILED CONTIG:Contig10
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> 
>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>> 
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>> 
>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>> 
>>>>> 
>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>> 
>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>> 
>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>> 
>>>>> 
>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>> 
>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>> 
>>>>> 
>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>> 
>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>> 
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/3c646981/attachment-0003.html>

From carsonhh at gmail.com  Wed Sep 13 12:26:08 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:26:08 -0600
Subject: [maker-devel] Repeats annotation
In-Reply-To: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
References: <CAOW6FS+YBobDCpEAZ=8YYoV+LKrTV7qJhmKYktm15pAh84i3Kw@mail.gmail.com>
Message-ID: <40F80C42-836A-41FF-9C9F-1F45C5816283@gmail.com>

I don?t know of any tool to analyze the repeat info. MAKER really only focuses on getting the masking done for the gene prediction, and while it does keep the repeats as features in the GFF3, it does not do any kind of analysis. You would have to do that outside of MAKER.

?Carson


> On Sep 13, 2017, at 8:51 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> We have generated species specific repeat library following your pipeline (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>). And did genome annotation by maker2 by using both species specific repeat library and mammalian repeat library. 
> 
> Now, we want to do some comparison about the repeat contexts among different species. So I want to generate species specific for other species and also use both their species specific repeat library and mammalian repeat library. But I found, I can only provide either the species specific repeat library or mammalian repeat library to RepeatMasker (not for both). I wonder whether I can run maker2 on those genome but only for repeat masking. 
> 
> BTW, by running RepeatMasker we can get a summary report (as below), I wonder whether there is any script from maker2 to analyze repeats element (or other tools to process the output of maker2). 
> 
> Many thanks
> 
> 
> file name: test_scaffold31.fasta    
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     301634 bp ( 34.93 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:               134        14362 bp    1.66 %
>       Alu/B1          28         2183 bp    0.25 %
>       MIRs            21         2860 bp    0.33 %
> 
> LINEs:               188       129104 bp   14.95 %
>       LINE1          168       124633 bp   14.43 %
>       LINE2           16         4266 bp    0.49 %
>       L3/CR1           4          205 bp    0.02 %
>       RTE              0            0 bp    0.00 %
> 
> LTR elements:        127       101129 bp   11.71 %
>       ERVL            10         3057 bp    0.35 %
>       ERVL-MaLRs      22         6902 bp    0.80 %
>       ERV_classI      66        80258 bp    9.29 %
>       ERV_classII     29        10912 bp    1.26 %
> 
> DNA elements:         27         4402 bp    0.51 %
>       hAT-Charlie     13         1836 bp    0.21 %
>       TcMar-Tigger     8         1651 bp    0.19 %
> 
> Unclassified:          4         1590 bp    0.18 %
> 
> Total interspersed repeats:    250587 bp   29.02 %
> 
> 
> Small RNA:             9          616 bp    0.07 %
> 
> Satellites:           66        40820 bp    4.73 %
> Simple repeats:      159         7235 bp    0.84 %
> Low complexity:       50         2766 bp    0.32 %
> ==================================================
> 
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>                                                       
> 
> The query species was assumed to be mammalia      
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>         
> run with rmblastn version 2.2.27+ 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/9744da83/attachment-0003.html>

From carsonhh at gmail.com  Wed Sep 13 12:41:24 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 12:41:24 -0600
Subject: [maker-devel] OpenMPI issues,
 no response in two attempts to subscribe to list
In-Reply-To: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
References: <a1a5313e90b4f2043f9910f2624986eb@saf.bio.caltech.edu>
Message-ID: <BA16E294-BE01-47DC-8113-C018C38480FC@gmail.com>

Mi David,

First thing. MAKER binds shared C libraries using Perl, so you have to tell MAKER where to find the needed files before you install it. Then it compiles the bindings and saves them for MAKER to use. If you have two MPI installation, you may have MAKER setup to use one of the installations then you are trying to call it with the other one. That would break the compiles bindings.

Also make sure you did the following (info from the ?/maker/INSTALL instructions file) ?> 

"make sure to set LD_PRELOAD to the location of libmpi.so before even trying to install MAKER. It must also be set before running MAKER (or any program that binds OpenMPI's shared libraries), so it's best just to add it to your ~/.bash_profile. (i.e. export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so)."

Remember to replace '/usr/local/openmpi/lib/libmpi.so? with the actual location of the file.

Second once you can get maker to start under OpenMPI, you may get freezes or failures part way into a run because OpenFabrics libraries use registered memory in a weird way that can cause system calls in a program to fail with a snowballing error effect. Adding this to the mpiexec options can stop this from occurring ?> '-mca btl ^openib'

That option has the side effect of disabling infiniband and using the ethernet adapter instead. However if you need to use the infiniband adapter, you can use this flag instead '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0'

That command will use IP over infiniband rather than the native infiniband which will have the same effect of diabling the OpenFabrics libraries.

Thanks,
Carson


> On Sep 13, 2017, at 12:01 PM, mathog <mathog at caltech.edu> wrote:
> 
> Greetings,
> 
> I'm trying to run maker 2.31.9 with OpenMPI on a Centos 6.7 system.  It just won't start.  OpenMPI works fine with a small test program, it just doesn't work with maker.  It fails in exactly the same way on a second Centos system with minor software differences (Centos 6.9 and perl 5.20 compiled without thread support, the perl on the first machine had thread support.) The gory details were posted already in a Centos forum so rather than repeat it all here, this is a link to that thread:
> 
>   https://www.centos.org/forums/viewtopic.php?f=14&t=64099
> 
> maker was unpacked from the maker-2.31.9.tgz a second time (after moving the original) after setting up the "module add openmpi-x86_64" to my .bash_profile
> and logging in cleanly.  It was rebuilt.  The build messages were identical to the previous ones and when a run was attempted it also failed in exactly the same way.
> 
> I also tried to subscribe to the list here
> 
>  https://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> once yesterday, and once today, but no email ever came back.  Hopefully this message gets through!
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From qwzhang0601 at gmail.com  Wed Sep 13 13:42:01 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 15:42:01 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
Message-ID: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>

Dear Carson:

Thank you for your explanation.  Sorry for not describing my problem
clearly. The first two errors were all gone after I changed the parameters
you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
following error for two contigs among thousands of contigs. One of the two
failed contigs has length 863k, and I have done more tests on this contig
individually. By running repeatmask on this contig, 65% was masked when
using species specific repeat library, while it is only 35% when using
mammalian repeat library. Since longer contigs (even 98Mb) can all be
annotated, I doubt why this much shorter one can fail due to IO.

I did not set "TMP", and I am running on a high performance cluster. I am
not sure whether it is a virtual memory or not. I will check it later. Many
thanks

Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
line 188.
33708 --> rank=NA, hostname=n409
33709 ERROR: Failed while processing all repeats
33710 ERROR: Chunk failed at level:3, tier_type:1
33711 FAILED CONTIG:Contig31

Best
Quanwei

2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/Widget/blastx.pm line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.
> 31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
>
> The first two are memory related with the second being because it cannot
> kill a lock maintainer thread that it was not able to start because of lack
> of memory.
>
> The third one is IO related. It is a truncated file that succeeded on the
> second try according to the e-mail you sent.
>
>
> IO errors are quite common with NFS (network mounted file systems). It?s
> one of the most frequent issues submitted to the devel list. MAKER can hit
> IO limits long before it hits CPU limits. One of the most frequent casues
> of these issues is that the user set TMP= in the control files to a manual
> location that is not suitable for high IO (note TMP= defaults to /tmp). The
> location should always be a true locally mounted disk. Sometimes this is a
> virtual location (not really local disk but network mounted disk or an in
> memory location). With the former you will get frequent IO failures and
> with the latter you will also get out of memory issues.
>
> Note that when you supply more data files you will also use more memory
> (to hold analysis results). According to your e-mail the last error you got
> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
> with two input files but not when you supply a single input file further
> suggests you are running low on RAM.
>
> 1. Some things to check. Make sure TMP= is not being set to a network
> mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory
> on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users
> or because you failed to request enough RAM during job submission.
>
> Finally, try running interactively so you can see what the memory and
> directory locations look like on the node you get assigned for the job
> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
> local disk?). Also run with MPI rather than starting multiple MAKER
> instances. It uses resources better.
>
> Thanks,
> Carson
>
>
>
>
>
>
> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> I did more tests on one of the contigs (with length 863kb) that failed
> when doing repeat masking. I found it only fail when I added the species
> specific repeat library, and it can be successfully annotated when only
> considering mammalian repeat library. When I did the test I only picked the
> this contig and run maker with 64G memory. So I think the failure should
> not be the problem with memory or IO, because even the contigs with length
> 98Mb can be annotated with memory 32G.
>
> I also run RepeatMasker on this contig with mammalian and species specific
> repeat library, separately. I found when I use  mammalian repeat library,
> about 35% was masked as repeats, while it is 65% when I use species
> specific repeat library (as shown below in blue). I wonder whether the high
> level of repeats can lead to the failure of this contig.  Do you have any
> ideas about this. Thanks
>
>
>
> file name: test_scaffold31.fasta
> sequences:             1
> total length:     863590 bp  (858757 bp excl N/X-runs)
> GC level:         37.02 %
> bases masked:     562909 bp ( 65.18 %)
> ==================================================
>                number of      length   percentage
>                elements*    occupied  of sequence
> --------------------------------------------------
> SINEs:              113        16134 bp    1.87 %
>       ALUs           71        12479 bp    1.45 %
>       MIRs            1          133 bp    0.02 %
>
> LINEs:              251       380142 bp   44.02 %
>       LINE1         211       210623 bp   24.39 %
>       LINE2           1           86 bp    0.01 %
>       L3/CR1          0            0 bp    0.00 %
>
> LTR elements:       246       101221 bp   11.72 %
>       ERVL            5         1037 bp    0.12 %
>       ERVL-MaLRs     18         2744 bp    0.32 %
>       ERV_classI    201        90942 bp   10.53 %
>       ERV_classII    18         5964 bp    0.69 %
>
> DNA elements:        39        14177 bp    1.64 %
>      hAT-Charlie      7         3864 bp    0.45 %
>      TcMar-Tigger     7         1706 bp    0.20 %
>
> Unclassified:       196        45831 bp    5.31 %
>
> Total interspersed repeats:   557505 bp   64.56 %
>
>
> Small RNA:            3          823 bp    0.10 %
>
> Satellites:           2          237 bp    0.03 %
> Simple repeats:      94         4472 bp    0.52 %
> Low complexity:      18          766 bp    0.09 %
> ==================================================
>
> * most repeats fragmented by insertions or deletions
>   have been counted as one element
>
>
> The query species was assumed to be homo
> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>
> run with rmblastn version 2.2.27+
> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"
>
>
>
> Best
> Quanwei
>
> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>
>> Dear Carson:
>>
>> I see. Thank you. I will try it.
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>
>>> Each node is a single machine. Because you currently run without MPI,
>>> each MAKER job you submit runs on a single machine. So you are either
>>> running multiple times on the same node, or you submitted 5 separate batch
>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>
>>> MPI can parallelize on the same node or across nodes. If you request 10
>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>> that case it will split up work within a single node and use all resources
>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>> just submit a job that goes to a single node and ask for all CPUs on that
>>> node (multinode jobs may be hard to configure, but single node jobs are
>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>> node, and it will parallelize within the node.
>>>
>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>
>>> ?Carson
>>>
>>>
>>>
>>>
>>>
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>> wrote:
>>>
>>> Dear Carson:
>>>
>>> Would you please explain what do you mean by "a single machine"? I am
>>> running maker2 on our high performance cluster. The cluster has more than
>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>> as the scheduler. Can I use MPICH3?
>>>
>>> Thanks
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> If you are just using a single machine (and not cross machine MPI), use
>>>> MPICH3 ?> https://www.mpich.org
>>>>
>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> It could be either. Please use MPI instead of starting multiple
>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it
>>>>> is related to memory issue or an IO issue, I am not sure why the much
>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>> but the relatively shorter ones failed.
>>>>>
>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>> scaffolds individually with larger memory to see whether they can be
>>>>> annotated.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> I think the cause of the error may have been a little further
>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>> being used is actually across the network), then they can be lest robust
>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>> the retry capabilities of MAKER.
>>>>>>
>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> About the error in my above email, I found the contig was correctly
>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>> reason? Many thanks
>>>>>>
>>>>>> Here are some parameters I used
>>>>>>
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>> RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>> specific repeat library in fasta format for Repe
>>>>>>
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>
>>>>>>
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>> line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> I got the following error again. Is this still related to memory
>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>
>>>>>>> Thank you! Have a nice weekend!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>> Now starting the contig!!
>>>>>>> SeqID: Contig10
>>>>>>> Length: 18773588
>>>>>>> #-----------------------------------------------------------
>>>>>>> ----------
>>>>>>>
>>>>>>>
>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>> doing repeat masking
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> doing blastx repeats
>>>>>>> collecting blastx repeatmasking
>>>>>>> processing all repeats
>>>>>>> doing repeat masking
>>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>>>>>>> line 1050.
>>>>>>> --> rank=NA, hostname=n224
>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>> FAILED CONTIG:Contig10
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>
>>>>>>>>
>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>> memory and time?
>>>>>>>>
>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>>
>>>>>>>>
>>>>>>>> This values really only affects the final evidence kept in the GFF3
>>>>>>>> when you look at it in a browser. It has not affect on the annotation. This
>>>>>>>> is because internally MAKER already collapses evidence down to the 10 best
>>>>>>>> non-redundant features per evidence set per locus. The rest are put in the
>>>>>>>> GFF3 just for reference. by setting it lower, you are just letting MAKER
>>>>>>>> know it can through things away even sooner since you don?t want them in
>>>>>>>> the GFF3. It provides a minor improvement for memory use, but
>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>
>>>>>>>>
>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>
>>>>>>>>
>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>> least 6 times slower than BLASTN
>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>
>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>
>>>>>>>>
>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>> shorter ones).
>>>>>>>>
>>>>>>>>
>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>> MPI.
>>>>>>>>
>>>>>>>>
>>>>>>>> (5) Still about the speed issue. I read some of your comments about
>>>>>>>> "cpus" parameters in the maker_opts file (
>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>
>>>>>>>>
>>>>>>>> The cpu parameter only affects how many CPUs are given to the blast
>>>>>>>> command line. So only the BLASt step will speed up, so I recommend using
>>>>>>>> MPI to get all steps to speed up. Even if you are only running on a single
>>>>>>>> node, you can give all CPUs to the mpiexec command.
>>>>>>>>
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/31f8118a/attachment-0003.html>

From carsonhh at gmail.com  Wed Sep 13 14:21:14 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Wed, 13 Sep 2017 14:21:14 -0600
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
Message-ID: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>

One final thought. If you are using rmblast as part of the RepeatMasker installation, it may be suffering a bug that some blast version suffer from that can sometimes lead to truncation of a blast report  (example of a separate error related to blast report truncation here)?> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ <https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ>

As a result there is a special update to rmblast ?> http://www.repeatmasker.org/RMBlast.html <http://www.repeatmasker.org/RMBlast.html>

So if you are not using the update try it, but if you are using the update and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update may be the cause or the cure or RepeatMasker errors).

?Carson


> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Dear Carson:
> 
> Thank you for your explanation.  Sorry for not describing my problem clearly. The first two errors were all gone after I changed the parameters you suggested (e.g., max_dna_len, depeth_blast). Now I only get the following error for two contigs among thousands of contigs. One of the two failed contigs has length 863k, and I have done more tests on this contig individually. By running repeatmask on this contig, 65% was masked when using species specific repeat library, while it is only 35% when using mammalian repeat library. Since longer contigs (even 98Mb) can all be annotated, I doubt why this much shorter one can fail due to IO.
> 
> I did not set "TMP", and I am running on a high performance cluster. I am not sure whether it is a virtual memory or not. I will check it later. Many thanks
> 
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
> 
> Best
> Quanwei
> 
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> These are the 3 errors you have shown in your e-mails ?>
> open3: fork failed: Cannot allocate memory at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm <http://blastx.pm/> line 40.
> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
> 
> The first two are memory related with the second being because it cannot kill a lock maintainer thread that it was not able to start because of lack of memory.
> 
> The third one is IO related. It is a truncated file that succeeded on the second try according to the e-mail you sent.
> 
> 
> IO errors are quite common with NFS (network mounted file systems). It?s one of the most frequent issues submitted to the devel list. MAKER can hit IO limits long before it hits CPU limits. One of the most frequent casues of these issues is that the user set TMP= in the control files to a manual location that is not suitable for high IO (note TMP= defaults to /tmp). The location should always be a true locally mounted disk. Sometimes this is a virtual location (not really local disk but network mounted disk or an in memory location). With the former you will get frequent IO failures and with the latter you will also get out of memory issues.
> 
> Note that when you supply more data files you will also use more memory (to hold analysis results). According to your e-mail the last error you got was 'Can't kill a non-numeric process ID?. Correct? So getting the error with two input files but not when you supply a single input file further suggests you are running low on RAM.
> 
> 1. Some things to check. Make sure TMP= is not being set to a network mounted location.
> 2. Make sure your temporary directory is not a virtual in memory directory on the node being used.
> 3. If nodes are shared, you may run out of memory because of other users or because you failed to request enough RAM during job submission.
> 
> Finally, try running interactively so you can see what the memory and directory locations look like on the node you get assigned for the job (check space and mount points. Is /tmp or whereever you set TMP= in fact a local disk?). Also run with MPI rather than starting multiple MAKER instances. It uses resources better.
> 
> Thanks,
> Carson
> 
> 
> 
> 
> 
> 
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Dear Carson:
>> 
>> I did more tests on one of the contigs (with length 863kb) that failed when doing repeat masking. I found it only fail when I added the species specific repeat library, and it can be successfully annotated when only considering mammalian repeat library. When I did the test I only picked the this contig and run maker with 64G memory. So I think the failure should not be the problem with memory or IO, because even the contigs with length 98Mb can be annotated with memory 32G. 
>> 
>> I also run RepeatMasker on this contig with mammalian and species specific repeat library, separately. I found when I use  mammalian repeat library, about 35% was masked as repeats, while it is 65% when I use species specific repeat library (as shown below in blue). I wonder whether the high level of repeats can lead to the failure of this contig.  Do you have any ideas about this. Thanks
>> 
>> 
>> 
>> file name: test_scaffold31.fasta    
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>> 
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>> 
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>> 
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>> 
>> Unclassified:       196        45831 bp    5.31 %
>> 
>> Total interspersed repeats:   557505 bp   64.56 %
>> 
>> 
>> Small RNA:            3          823 bp    0.10 %
>> 
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>> 
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>                                                       
>> 
>> The query species was assumed to be homo          
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>         
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in ".../consensi.fa.classifiednoProtFinal"  
>> 
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>> Dear Carson:
>> 
>> I see. Thank you. I will try it.
>> 
>> Best
>> Quanwei
>> 
>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>> Each node is a single machine. Because you currently run without MPI, each MAKER job you submit runs on a single machine. So you are either running multiple times on the same node, or you submitted 5 separate batch jobs in which case you may have a single maker process on each of 5 nodes.
>> 
>> MPI can parallelize on the same node or across nodes. If you request 10 nodes, then it can communicate across nodes to run the job on all hardware. Or you can run MPI on a single node and ask for all CPUs on that node. In that case it will split up work within a single node and use all resources just on that node. So if you can?t get MPI to work across nodes, you can just submit a job that goes to a single node and ask for all CPUs on that node (multinode jobs may be hard to configure, but single node jobs are very easy). Just set the -n parameter of mpiexec to the CPU count of that node, and it will parallelize within the node.
>> 
>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>> 
>> ?Carson
>> 
>> 
>> 
>> 
>> 
>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>> 
>>> Dear Carson: 
>>> 
>>> Would you please explain what do you mean by "a single machine"? I am running maker2 on our high performance cluster. The cluster has more than 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used as the scheduler. Can I use MPICH3?
>>> 
>>> Thanks
>>> 
>>> Best
>>> Quanwei
>>> 
>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>> If you are just using a single machine (and not cross machine MPI), use MPICH3 ?> https://www.mpich.org <https://www.mpich.org/>
>>> 
>>> It?s easy to install yourself, and tends to be very robust to failure.
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>> 
>>>> Dear Carson:
>>>> 
>>>> I met some problems to use MPI. I will give it another try.
>>>> Thank you!
>>>> 
>>>> Best
>>>> Quanwei
>>>> 
>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>> It could be either. Please use MPI instead of starting multiple instances. It will greatly reduce both IO and RAM usage.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>> 
>>>>> Dear Carson:
>>>>> 
>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If it is related to memory issue or an IO issue, I am not sure why the much longer scaffolds (than the failed ones) were all annotated successfully, but the relatively shorter ones failed.  
>>>>> 
>>>>> I have set "tries=5" (#number of times to try a contig if there is a failure for some reason). I will try "clean_try=1" and test on the failed scaffolds individually with larger memory to see whether they can be annotated. 
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> Best
>>>>> Quanwei
>>>>> 
>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>> I think the cause of the error may have been a little further upstream from what you pasted in the e-mail. One thing that may be happening is that you are taxing resources (like IO) if running MAKER multiple times or on too many CPUs. That can lead to failures because of truncated BLAST reports etc. In which case you can just retry and that will get around those types of IO derived errors. MAKER can generate a lot of IO, and if you are working on network mounted locations (i.e. the storage being used is actually across the network), then they can be lest robust than local storage (when under heavy load NFS can falsely report success on read/write operations that actually failed). It?s the reason we built in the retry capabilities of MAKER.
>>>>> 
>>>>> For contigs that continuously fail, you may need to set clean_try=1. That will cause failures to start from scratch (i.e. delete all old reports on failure rather than just those suspected of being truncated).
>>>>> 
>>>>> ?Carson
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>>>>>> 
>>>>>> Dear Carson:
>>>>>> 
>>>>>> About the error in my above email, I found the contig was correctly annotated at the second time RETRY. So please ignore my last email. But now, for a few number of scaffolds, I met problems to process the repeats (as shown below in red). I used both Mammalia repeat library and species specific repeat library (which is generated by your pipeline "http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic>"). There were no such problems when I only used Mammalia repeat library. Do you have any ideas about this? What could be the reason? Or do you have any suggestions for me to find the reason? Many thanks  
>>>>>> 
>>>>>> Here are some parameters I used
>>>>>> 
>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>> model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker
>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism specific repeat library in fasta format for Repe
>>>>>> 
>>>>>> max_dna_len=300000
>>>>>> split_hit=40000
>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> 
>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm line 188.
>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>> 
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>>:
>>>>>> Dear Carson:
>>>>>> 
>>>>>> I got the following error again. Is this still related to memory issues? I wonder whether there can be other reasons lead to this error? This time, I got this error during training of the SNAP model. Before, even I set  max_dna_len=1Mb, I can train the model successfully.  And in the current training (where I get the following error),  I have decreased the max_dna_len to 300kb. I required the same amount memory as before. The only difference is that I am using both mammalian repeat library and species specific repeat library, while previously I only use the mammalian repeat library. Will it greatly increases the requirement of memory to use both repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I have also set the depth_blast as 30 in current training.
>>>>>> 
>>>>>> Thank you! Have a nice weekend! 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> #---------------------------------------------------------------------
>>>>>> Now starting the contig!!
>>>>>> SeqID: Contig10
>>>>>> Length: 18773588
>>>>>> #---------------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> setting up GFF3 output and fasta chunks
>>>>>> doing repeat masking
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> doing blastx repeats
>>>>>> collecting blastx repeatmasking
>>>>>> processing all repeats
>>>>>> doing repeat masking
>>>>>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line 1050.
>>>>>> --> rank=NA, hostname=n224
>>>>>> ERROR: Failed while doing repeat masking
>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>> FAILED CONTIG:Contig10
>>>>>> 
>>>>>> Best
>>>>>> Quanwei
>>>>>> 
>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
>>>>>> 
>>>>>>> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
>>>>>>> 
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>> 
>>>>>> This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don?t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.
>>>>>> 
>>>>>> 
>>>>>>> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.
>>>>>> 
>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>> 
>>>>>> Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.
>>>>>> 
>>>>>> 
>>>>>>> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).
>>>>>> 
>>>>>> Without MPI you won?t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It?s less difficult to set up compared to cross node jobs via MPI.
>>>>>> 
>>>>>> 
>>>>>>> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  
>>>>>> 
>>>>>> The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.
>>>>>> 
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/5707fd81/attachment-0003.html>

From qwzhang0601 at gmail.com  Wed Sep 13 14:26:11 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 13 Sep 2017 16:26:11 -0400
Subject: [maker-devel] Some errors reported by Maker2
In-Reply-To: <55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
References: <CAOW6FS+UYF4wNDoEvJs8iXrDh4gv7770G9t=tc+2KyUFGb6OMA@mail.gmail.com>
	<816C7F69-617B-4329-8F06-ED29468E5244@gmail.com>
	<CAOW6FSKv5hChsm8YDkKp4sU9SXb6i=6teJ0egdz3sNWXcH+PDw@mail.gmail.com>
	<995CFFD1-BDC3-4959-AEF5-1E7F4637939C@gmail.com>
	<CAOW6FSLFbuiih8mggAmQ6zHRezmL7JaPDx7DLfBM6Fq_z3_D6w@mail.gmail.com>
	<9B5E5414-961A-4817-9A4D-07BC5CE71187@gmail.com>
	<CAOW6FSKXO6c9WOHkrPsAUMhM7awbndiFmpsYmTOzpi3evbfEvg@mail.gmail.com>
	<CAOW6FS+0nO7dzSYotb8=KQRjvWhh5PNcFTtBLkqV6MiXEaXQpA@mail.gmail.com>
	<92F322F3-CC62-46D7-8F2F-777D8E131AB0@gmail.com>
	<CAOW6FSJWHDpF-guc6BV0ABSyLZNoK6rJtkd4uxXgYrH3ASaHCw@mail.gmail.com>
	<488D639E-AD22-401A-93F9-5242DAD3ABF1@gmail.com>
	<CAOW6FSKE9Mi_ZLy-LOATvaofRqbUhR_1+K6wmRrrfF=UW6Ye2g@mail.gmail.com>
	<3E72E710-6805-4276-B5A4-22BDB4198903@gmail.com>
	<CAOW6FSKeAh4QWr0Zs=XFfH1862LiMfXtZp9veekTP1bEsFrGiw@mail.gmail.com>
	<A6B66F0B-2D98-4ECB-8DCB-16FDFE002DBD@gmail.com>
	<CAOW6FSKECD3MaqsV=FNZ+2UncywdsPO0vptSuojUkRdkz16k-g@mail.gmail.com>
	<CAOW6FSJcdMTuLbN8-O-Vb7OxEXvU=F3Nwz7Rhx6qzXJ30ubV1w@mail.gmail.com>
	<6B4E0F47-7186-4B12-97A1-EB63D2EB7079@gmail.com>
	<CAOW6FS+JddppLEgbNjS3ipBUc67tbC1FeUfi9=8TzN+NB35X8g@mail.gmail.com>
	<55597FCD-B589-4FFB-A8FA-37659A48BF58@gmail.com>
Message-ID: <CAOW6FSKU9Tn6HN3fZAnXquVU0OrdsxZuHB8GCG76BNQAZ_kdKg@mail.gmail.com>

Dear Carson:

I will take a look at try it. Thank you.

Best
Quanwei

2017-09-13 16:21 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> One final thought. If you are using rmblast as part of the RepeatMasker
> installation, it may be suffering a bug that some blast version suffer from
> that can sometimes lead to truncation of a blast report  (example of a
> separate error related to blast report truncation here)?>
> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ
>
> As a result there is a special update to rmblast ?>
> http://www.repeatmasker.org/RMBlast.html
>
> So if you are not using the update try it, but if you are using the update
> and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update
> may be the cause or the cure or RepeatMasker errors).
>
> ?Carson
>
>
>
> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thank you for your explanation.  Sorry for not describing my problem
> clearly. The first two errors were all gone after I changed the parameters
> you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
> following error for two contigs among thousands of contigs. One of the two
> failed contigs has length 863k, and I have done more tests on this contig
> individually. By running repeatmask on this contig, 65% was masked when
> using species specific repeat library, while it is only 35% when using
> mammalian repeat library. Since longer contigs (even 98Mb) can all be
> annotated, I doubt why this much shorter one can fail due to IO.
>
> I did not set "TMP", and I am running on a high performance cluster. I am
> not sure whether it is a virtual memory or not. I will check it later. Many
> thanks
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
> Best
> Quanwei
>
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> These are the 3 errors you have shown in your e-mails ?>
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>>
>> The first two are memory related with the second being because it cannot
>> kill a lock maintainer thread that it was not able to start because of lack
>> of memory.
>>
>> The third one is IO related. It is a truncated file that succeeded on the
>> second try according to the e-mail you sent.
>>
>>
>> IO errors are quite common with NFS (network mounted file systems). It?s
>> one of the most frequent issues submitted to the devel list. MAKER can hit
>> IO limits long before it hits CPU limits. One of the most frequent casues
>> of these issues is that the user set TMP= in the control files to a manual
>> location that is not suitable for high IO (note TMP= defaults to /tmp). The
>> location should always be a true locally mounted disk. Sometimes this is a
>> virtual location (not really local disk but network mounted disk or an in
>> memory location). With the former you will get frequent IO failures and
>> with the latter you will also get out of memory issues.
>>
>> Note that when you supply more data files you will also use more memory
>> (to hold analysis results). According to your e-mail the last error you got
>> was 'Can't kill a non-numeric process ID?. Correct? So getting the error
>> with two input files but not when you supply a single input file further
>> suggests you are running low on RAM.
>>
>> 1. Some things to check. Make sure TMP= is not being set to a network
>> mounted location.
>> 2. Make sure your temporary directory is not a virtual in memory
>> directory on the node being used.
>> 3. If nodes are shared, you may run out of memory because of other users
>> or because you failed to request enough RAM during job submission.
>>
>> Finally, try running interactively so you can see what the memory and
>> directory locations look like on the node you get assigned for the job
>> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
>> local disk?). Also run with MPI rather than starting multiple MAKER
>> instances. It uses resources better.
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Dear Carson:
>>
>> I did more tests on one of the contigs (with length 863kb) that failed
>> when doing repeat masking. I found it only fail when I added the species
>> specific repeat library, and it can be successfully annotated when only
>> considering mammalian repeat library. When I did the test I only picked the
>> this contig and run maker with 64G memory. So I think the failure should
>> not be the problem with memory or IO, because even the contigs with length
>> 98Mb can be annotated with memory 32G.
>>
>> I also run RepeatMasker on this contig with mammalian and species
>> specific repeat library, separately. I found when I use  mammalian repeat
>> library, about 35% was masked as repeats, while it is 65% when I use
>> species specific repeat library (as shown below in blue). I wonder whether
>> the high level of repeats can lead to the failure of this contig.  Do you
>> have any ideas about this. Thanks
>>
>>
>>
>> file name: test_scaffold31.fasta
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>>
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>>
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>>
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>>
>> Unclassified:       196        45831 bp    5.31 %
>>
>> Total interspersed repeats:   557505 bp   64.56 %
>>
>>
>> Small RNA:            3          823 bp    0.10 %
>>
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>>
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>
>>
>> The query species was assumed to be homo
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in
>> ".../consensi.fa.classifiednoProtFinal"
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I see. Thank you. I will try it.
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> Each node is a single machine. Because you currently run without MPI,
>>>> each MAKER job you submit runs on a single machine. So you are either
>>>> running multiple times on the same node, or you submitted 5 separate batch
>>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>>
>>>> MPI can parallelize on the same node or across nodes. If you request 10
>>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>>> that case it will split up work within a single node and use all resources
>>>> just on that node. So if you can?t get MPI to work across nodes, you can
>>>> just submit a job that goes to a single node and ask for all CPUs on that
>>>> node (multinode jobs may be hard to configure, but single node jobs are
>>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>>> node, and it will parallelize within the node.
>>>>
>>>> Example command for a 20 CPU node ?>  mpiexec -n 20 maker
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> Would you please explain what do you mean by "a single machine"? I am
>>>> running maker2 on our high performance cluster. The cluster has more than
>>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>>> as the scheduler. Can I use MPICH3?
>>>>
>>>> Thanks
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> If you are just using a single machine (and not cross machine MPI),
>>>>> use MPICH3 ?> https://www.mpich.org
>>>>>
>>>>> It?s easy to install yourself, and tends to be very robust to failure.
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I met some problems to use MPI. I will give it another try.
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> It could be either. Please use MPI instead of starting multiple
>>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>>
>>>>>> ?Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If
>>>>>> it is related to memory issue or an IO issue, I am not sure why the much
>>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>>> but the relatively shorter ones failed.
>>>>>>
>>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>>> scaffolds individually with larger memory to see whether they can be
>>>>>> annotated.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>> I think the cause of the error may have been a little further
>>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>>> being used is actually across the network), then they can be lest robust
>>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>>> read/write operations that actually failed). It?s the reason we built in
>>>>>>> the retry capabilities of MAKER.
>>>>>>>
>>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> About the error in my above email, I found the contig was correctly
>>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>>> reason? Many thanks
>>>>>>>
>>>>>>> Here are some parameters I used
>>>>>>>
>>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>>> RepeatMasker
>>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>>> specific repeat library in fasta format for Repe
>>>>>>>
>>>>>>> max_dna_len=300000
>>>>>>> split_hit=40000
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>>> line 188.
>>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>>
>>>>>>>> Dear Carson:
>>>>>>>>
>>>>>>>> I got the following error again. Is this still related to memory
>>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>>
>>>>>>>> Thank you! Have a nice weekend!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>> Now starting the contig!!
>>>>>>>> SeqID: Contig10
>>>>>>>> Length: 18773588
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>>
>>>>>>>>
>>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>>> doing repeat masking
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> collecting blastx repeatmasking
>>>>>>>> processing all repeats
>>>>>>>> doing repeat masking
>>>>>>>> Can't kill a non-numeric process ID at
>>>>>>>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line
>>>>>>>> 1050.
>>>>>>>> --> rank=NA, hostname=n224
>>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> Best
>>>>>>>> Quanwei
>>>>>>>>
>>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>>> memory and time?
>>>>>>>>>
>>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element
>>>>>>>>> masking
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This values really only affects the final evidence kept in the
>>>>>>>>> GFF3 when you look at it in a browser. It has not affect on the annotation.
>>>>>>>>> This is because internally MAKER already collapses evidence down to the 10
>>>>>>>>> best non-redundant features per evidence set per locus. The rest are put in
>>>>>>>>> the GFF3 just for reference. by setting it lower, you are just letting
>>>>>>>>> MAKER know it can through things away even sooner since you don?t want them
>>>>>>>>> in the GFF3. It provides a minor improvement for memory use, but
>>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>>> least 6 times slower than BLASTN
>>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>>
>>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>>> shorter ones).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Without MPI you won?t be able to split up large contigs. At the
>>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>>> on that node. It?s less difficult to set up compared to cross node jobs via
>>>>>>>>> MPI.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (5) Still about the speed issue. I read some of your comments
>>>>>>>>> about "cpus" parameters in the maker_opts file (
>>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The cpu parameter only affects how many CPUs are given to the
>>>>>>>>> blast command line. So only the BLASt step will speed up, so I recommend
>>>>>>>>> using MPI to get all steps to speed up. Even if you are only running on a
>>>>>>>>> single node, you can give all CPUs to the mpiexec command.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ?Carson
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/42eb2d53/attachment-0003.html>

From xvazquezc at gmail.com  Sun Sep 17 19:12:56 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Mon, 18 Sep 2017 11:12:56 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
Message-ID: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>

I did it that way and AUGUSTUS is predicting a more reasonable number of
genes, about 12500 in Maker, but about 19000 in the model assessment step.
In comparison, SNAP gives 16000 and GeneMark 19000.

I haven't found any reference about but, would it be a good idea to train
Augustus over the masked genome instead?
Thanks,


On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:

> BUSCO may be generating too few models. BUSCO also identifies classes of
> conserved short genes that may not represent enough training diversity for
> your organism. Try running MAKER in protein2genome or est2genome mode, and
> then train with those results.
>
> ?Carson
>
>
> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> Hi,
> I have been annotating a fungal genome as usual, using Busco-trained
> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
> is predicting a mere 207 genes compared to 15-20k from the other two.
> I've never had this problem. The genome has an unusual repeat content
> close to 50%, not sure if that might suppose a problem.
> Has anybody come up with any similar issue?
> I also asked to Busco developers if they have any idea
> https://gitlab.com/ezlab/busco/issues/49
> Cheers,
> Xabi
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a8cfffd6/attachment-0003.html>

From qwzhang0601 at gmail.com  Mon Sep 18 21:07:25 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Mon, 18 Sep 2017 23:07:25 -0400
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
Message-ID: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>

Hello:

Would you please explain what is the difference between
"maker-...-agustus..." and "augustus_masked..." gene models?

I know  "augustus_masked..." gene models are raw august predictions, while
"maker-...-agustus..." are hit derived gene models. But by default, maker2
reports gene models with evidence support (protein sequences or
transcripts). Then why some gene models are hit derived while other models
(with evidence support) are raw augustus prediction (even there are protein
sequences or transcript evidence)?

BTW, is it true that generally the "maker-...-agustus..." gene models are
more reliable than the "augustus_masked..." gene models?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170918/a273a8fe/attachment-0003.html>

From qwzhang0601 at gmail.com  Mon Sep 18 22:14:38 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 00:14:38 -0400
Subject: [maker-devel] about min_protein
Message-ID: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>

Hello:

I am working on a rodent species and get 28k annotated genes, I wonder
whether you have any suggestions about the "min_protein" parameter?

I did not change the parameter in my current annotation. I get several very
short predicted proteins (even those with only 1 amino acid).

min_protein=0 #require at least this many amino acids in predicted proteins

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/3bd06e0a/attachment-0003.html>

From qwzhang0601 at gmail.com  Tue Sep 19 06:47:00 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Tue, 19 Sep 2017 08:47:00 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
Message-ID: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>

Thank you Daniel. I wonder whether there is a suggested value for the
?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people
often use. I am studying a rodent species.

Thank you.

Best
Quanwei

2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com>:

> Hi Quanwei,
>
> Increasing the ?min_protein" parameter should get ride of those very short
> predicted proteins.
>
>
>
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder
> whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several
> very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted
> proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/f2b950ea/attachment-0003.html>

From dandence at gmail.com  Tue Sep 19 06:29:35 2017
From: dandence at gmail.com (Daniel Ence)
Date: Tue, 19 Sep 2017 08:29:35 -0400
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
Message-ID: <CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>

Hi Quanwei, 

Increasing the ?min_protein" parameter should get ride of those very short predicted proteins. 


> On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter? 
> 
> I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid). 
>  
> min_protein=0 #require at least this many amino acids in predicted proteins
> 
> Thanks
> 
> Best
> Quanwei
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From tuanduonganh at gmail.com  Tue Sep 19 11:23:39 2017
From: tuanduonganh at gmail.com (Tuan Duong Anh)
Date: Tue, 19 Sep 2017 19:23:39 +0200
Subject: [maker-devel] MAKER3 beta - EVM under predicting
Message-ID: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>

Dear MAKER-devel group

I have been testing out MAKER3 beta version and found out that EVM always
returns much less number of models. Did any one experience this before? I
do expect that EVM will return less models when compare to other, but not
to this extend (only 20% of the expected gene models). Any suggestion would
be much appreciated.

## Number of models obtained by each gene predictors:

HLIG.all.maker.augustus_masked.proteins.fasta:11224

HLIG.all.maker.evm.proteins.fasta:1974

HLIG.all.maker.genemark.proteins.fasta:11352

HLIG.all.maker.proteins.fasta:13672

HLIG.all.maker.snap_masked.proteins.fasta:13404

## maker_evm.ctl

#-----Transcript weights

evmtrans=10 #default weight for source unspecified est/alt_est alignments

evmtrans:blastn=0 #weight for blastn sourced alignments

evmtrans:est2genome=10 #weight for est2genome sourced alignments

evmtrans:tblastx=0 #weight for tblastx sourced alignments

evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments


#-----Protein weights

evmprot=10 #default weight for source unspecified protein alignments

evmprot:blastx=2 #weight for blastx sourced alignments

evmprot:protein2genome=10 #weight for protein2genome sourced alignments


#-----Abinitio Prediction weights

evmab=10 #default weight for source unspecified ab initio predictions

evmab:snap=7 #weight for snap sourced predictions

evmab:augustus=10 #weight for augustus sourced predictions

evmab:fgenesh=10 #weight for fgenesh sourced predictions

evmab:genemark=10 #weight for genemark sourced predictions


Regards,

Tuan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/4e1fc970/attachment-0003.html>

From carsonhh at gmail.com  Tue Sep 19 15:34:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:34:40 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
Message-ID: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>

Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.

?Carson


> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
> 
> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
> Thanks,
> 
> 
> 
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
> 
> ?Carson
> 
> 
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>> Cheers,
>> Xabi
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/768b9648/attachment-0003.html>

From carsonhh at gmail.com  Tue Sep 19 15:40:27 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:40:27 -0600
Subject: [maker-devel] Question about "maker-", "augustus_masked",
 "snap_masked" gene model
In-Reply-To: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
References: <CAOW6FS+MkBOsHct0Ph+mdHD-VTeyN2RbHS2_1Neye6qZtUDWrA@mail.gmail.com>
Message-ID: <56CC4BEB-083E-4DE6-99F3-CB34A1735AB4@gmail.com>

MAKER uses all derived models as a pool of alternate models for a given locus.  The one that best matches the aligned evidence is then selected using the AED calculation described in the MAKER2 publication. Overall hint based models tend to perform better than the raw models because they get extra info about observed intron/exon structure from alignments. There is also a discussion of this in the MAKER2 paper.

?Carson


> On Sep 18, 2017, at 9:07 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Would you please explain what is the difference between "maker-...-agustus..." and "augustus_masked..." gene models? 
> 
> I know  "augustus_masked..." gene models are raw august predictions, while "maker-...-agustus..." are hit derived gene models. But by default, maker2 reports gene models with evidence support (protein sequences or transcripts). Then why some gene models are hit derived while other models (with evidence support) are raw augustus prediction (even there are protein sequences or transcript evidence)?
> 
> BTW, is it true that generally the "maker-...-agustus..." gene models are more reliable than the "augustus_masked..." gene models?  
> 
> Thanks
> 
> Best
> Quanwei


From carsonhh at gmail.com  Tue Sep 19 15:41:40 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:41:40 -0600
Subject: [maker-devel] about min_protein
In-Reply-To: <CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
References: <CAOW6FSK5PcBbqnOt2s-ObUoEGFhOpXNYhcj=wxzGLWRqJpjusg@mail.gmail.com>
	<CADE62ED-68F2-4E48-9013-C9BF66E56CAD@gmail.com>
	<CAOW6FSLxd=oEHcpq73yGjDQjsdTbU=AE89u6HT3m8md=uiceSQ@mail.gmail.com>
Message-ID: <FFA05628-32ED-4036-9FDC-E6C7BC4EAE4C@gmail.com>

The value is arbitrary, but some submission databases like NCBI will flag entries under ~20-30 amino acids as errors if you try and submit them (I can?t remember the exact number).

?Carson


> On Sep 19, 2017, at 6:47 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Thank you Daniel. I wonder whether there is a suggested value for the ?min_protein" parameter (e.g., 20 amino acid, 50 amino acid?), that people often use. I am studying a rodent species. 
> 
> Thank you.
> 
> Best
> Quanwei
> 
> 2017-09-19 8:29 GMT-04:00 Daniel Ence <dandence at gmail.com <mailto:dandence at gmail.com>>:
> Hi Quanwei,
> 
> Increasing the ?min_protein" parameter should get ride of those very short predicted proteins.
> 
> 
> 
> > On Sep 19, 2017, at 12:14 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
> >
> > Hello:
> >
> > I am working on a rodent species and get 28k annotated genes, I wonder whether you have any suggestions about the "min_protein" parameter?
> >
> > I did not change the parameter in my current annotation. I get several very short predicted proteins (even those with only 1 amino acid).
> >
> > min_protein=0 #require at least this many amino acids in predicted proteins
> >
> > Thanks
> >
> > Best
> > Quanwei
> > _______________________________________________
> > maker-devel mailing list
> > maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/8b03be64/attachment-0003.html>

From carsonhh at gmail.com  Tue Sep 19 15:47:42 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 19 Sep 2017 15:47:42 -0600
Subject: [maker-devel] MAKER3 beta - EVM under predicting
In-Reply-To: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
References: <CAPpYYpyY_4kz4L10qMPU9RuRgmW+viFU8wVwOUaYV=WZnV=wHQ@mail.gmail.com>
Message-ID: <12FE3318-F0DE-485B-B43A-25A4A6EC9390@gmail.com>

If ab initio predictors and evidence alignments aren?t in high concordance, then EVM won?t produce results. This often indicates minor sequencing errors in the assembly (this is very common in draft assemblies). Ab initio predictors will slightly alter splicing and extend introns/exons to make a model work around these variations, but doing this does not always concord well with the alignment, so EVM produces nothing. In these cases it is often better just to train the predictor as well as you can, and then take the standard MAKER results.

?Carson


> On Sep 19, 2017, at 11:23 AM, Tuan Duong Anh <tuanduonganh at gmail.com> wrote:
> 
> Dear MAKER-devel group
> 
> I have been testing out MAKER3 beta version and found out that EVM always returns much less number of models. Did any one experience this before? I do expect that EVM will return less models when compare to other, but not to this extend (only 20% of the expected gene models). Any suggestion would be much appreciated.
> 
> ## Number of models obtained by each gene predictors:
> HLIG.all.maker.augustus_masked.proteins.fasta:11224
> HLIG.all.maker.evm.proteins.fasta:1974
> HLIG.all.maker.genemark.proteins.fasta:11352
> HLIG.all.maker.proteins.fasta:13672
> HLIG.all.maker.snap_masked.proteins.fasta:13404
> 
> ## maker_evm.ctl
> #-----Transcript weights
> evmtrans=10 #default weight for source unspecified est/alt_est alignments
> evmtrans:blastn=0 #weight for blastn sourced alignments
> evmtrans:est2genome=10 #weight for est2genome sourced alignments
> evmtrans:tblastx=0 #weight for tblastx sourced alignments
> evmtrans:cdna2genome=7 #weight for cdna2genome sourced alignments
> 
> #-----Protein weights
> evmprot=10 #default weight for source unspecified protein alignments
> evmprot:blastx=2 #weight for blastx sourced alignments
> evmprot:protein2genome=10 #weight for protein2genome sourced alignments
> 
> #-----Abinitio Prediction weights
> evmab=10 #default weight for source unspecified ab initio predictions
> evmab:snap=7 #weight for snap sourced predictions
> evmab:augustus=10 #weight for augustus sourced predictions
> evmab:fgenesh=10 #weight for fgenesh sourced predictions
> evmab:genemark=10 #weight for genemark sourced predictions
> 
> 
> Regards,
> Tuan
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170919/2c7d2669/attachment-0003.html>

From xvazquezc at gmail.com  Tue Sep 19 18:02:04 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Wed, 20 Sep 2017 10:02:04 +1000
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
Message-ID: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>

Thanks Carson.

Last quick question. After the first run (before using the gene predictors)
I ran fasta_merge to get an idea of the numbers I should be looking for.
In summary, I got 14000 genes, only using Swissprot and a close highly
curated reference genome to avoid any "fake" protein or partial proteins
from draft annotations, plus assembled RNA-seq from my genome.
How should I consider this as a guide? (if I can do so) ... Is this a
number I should be aiming as a minimum number of genes? maximum? something
around that?

PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few
possible fragments due assembly (seq errors aside)

On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com> wrote:

> Gene predictors tend to over predict, so I would not take the high numbers
> given by SNAP and GeneMark as true counts. You will probably end up with
> something like 7-10k in the final results. But now Augustus is giving a
> higher count, you should be good to start running MAKER.
>
> ?Carson
>
>
>
>
> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
> wrote:
>
> I did it that way and AUGUSTUS is predicting a more reasonable number of
> genes, about 12500 in Maker, but about 19000 in the model assessment step.
> In comparison, SNAP gives 16000 and GeneMark 19000.
>
> I haven't found any reference about but, would it be a good idea to train
> Augustus over the masked genome instead?
> Thanks,
>
>
>
> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com> wrote:
>
>> BUSCO may be generating too few models. BUSCO also identifies classes of
>> conserved short genes that may not represent enough training diversity for
>> your organism. Try running MAKER in protein2genome or est2genome mode, and
>> then train with those results.
>>
>> ?Carson
>>
>>
>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com>
>> wrote:
>>
>> Hi,
>> I have been annotating a fungal genome as usual, using Busco-trained
>> Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus
>> is predicting a mere 207 genes compared to 15-20k from the other two.
>> I've never had this problem. The genome has an unusual repeat content
>> close to 50%, not sure if that might suppose a problem.
>> Has anybody come up with any similar issue?
>> I also asked to Busco developers if they have any idea
>> https://gitlab.com/ezlab/busco/issues/49
>> Cheers,
>> Xabi
>>
>> --
>> Xabier V?zquez-Campos, *PhD*
>> *Research Associate*
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/ca7c08db/attachment-0003.html>

From himanimalhotra89 at gmail.com  Tue Sep 19 22:56:55 2017
From: himanimalhotra89 at gmail.com (himani malhotra)
Date: Wed, 20 Sep 2017 10:26:55 +0530
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
Message-ID: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>

---------- Forwarded message ----------
From: himani malhotra <himanimalhotra89 at gmail.com>
Date: Wed, Sep 20, 2017 at 10:24 AM
Subject: maker error
To: maker-devel-request at box290.bluehost.com


hello
I am using MAKER for gene prediction.I am getting error in Repbase
installation.I am sending you the error also,please help me.I have
installed repbase manually and unpacked its libraries in RepeatMasker
Library but still I am getting error.
Please help me.


Thanks

Himani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: makererror.png
Type: image/png
Size: 212522 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/b8709d7b/attachment-0003.png>

From munholl at uwindsor.ca  Wed Sep 20 08:53:04 2017
From: munholl at uwindsor.ca (Seth Munholland)
Date: Wed, 20 Sep 2017 10:53:04 -0400
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <CAL=sJwrjccQC0GdDa3Km1TojWMdN1aYoujntVsjdMjJ9ha2YUw@mail.gmail.com>

Hello,

When this happened to me it was a faulty pathing on my part when I
configured RepeatMasker (which I also manually installed).

Seth Munholland, B.Sc., Ph.D. Candidate
Department of Biological Sciences
Rm. 304 Biology Building
University of Windsor
401 Sunset Ave. N9B 3P4
T: (519) 253-3000 Ext: 4755

On Wed, Sep 20, 2017 at 12:56 AM, himani malhotra <
himanimalhotra89 at gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com
>
>
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase
> installation.I am sending you the error also,please help me.I have
> installed repbase manually and unpacked its libraries in RepeatMasker
> Library but still I am getting error.
> Please help me.
>
>
>
> Thanks
>
> Himani
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/c89e50fe/attachment-0003.html>

From Jimmy.Cross at uea.ac.uk  Wed Sep 20 08:02:53 2017
From: Jimmy.Cross at uea.ac.uk (James Cross (ITCS - Staff))
Date: Wed, 20 Sep 2017 14:02:53 +0000
Subject: [maker-devel] Maker MPI across nodes
Message-ID: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>

Hi Maker Developers,

We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core's so 56 Core's in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core's) as opposed to being run on a single node (28 Core's). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?

Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.

The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp).

The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker

Any help or advise you could give would be greatly appreciated.

Best Wishes
Jimmy
----------------------------------------------------------------------
Mr  James Cross
HPC Systems Developer
University of East Anglia
Norwich Research Park
ITCS
Norwich, Norfolk
NR4 7TJ

Information Services

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170920/e1e9d5cb/attachment-0003.html>

From patrick.tranvan at unil.ch  Thu Sep 21 03:26:52 2017
From: patrick.tranvan at unil.ch (Patrick Tran Van)
Date: Thu, 21 Sep 2017 09:26:52 +0000
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>,
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
Message-ID: <1505986013492.52354@unil.ch>

Hi Carson,


I have a doubt for the round 2, so in a previous reply you said:


" Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "


Does it means that I don't need to modify the section :


#-----Re-annotation Using MAKER Derived GFF3


?


If I let everything by default such as :


altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no


It will not look again for repeat and protein + transcriptome alignment ?

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com>
Sent: Monday, July 3, 2017 10:50 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] Advice on my pipeline

maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).

So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.

The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).

You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/

Thanks,
Carson


On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.

I have then use SNAP to train/filter it with:

maker2zff  specie.all.gff

Here are my results:

Number of gene after maker -> Number of gene after maker2zff

- Without corrected_est_fusion: 21621 -> 13875
- With corrected_est_fusion: 16850 -> 9098

1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
Normally I should find more genes with corrected_est_fusion right ?

2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?

 Thanks for your help


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 26, 2017 11:38 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Sorry the option is ?> correct_est_fusion

It is in the maker_opts.ctl file.

I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.

?Carson


On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Thanks for your answer.

1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
Because I am using autoAug for this and it tooks a while to compute ..

2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:

WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl

(I am using v 2.31.8 )


Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

________________________________
From: Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>>
Sent: Monday, June 5, 2017 8:29 PM
To: Patrick Tran Van
Cc: maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Advice on my pipeline

Your plan sounds good. A couple of related notes.

Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.

Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).

?Carson


On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch<mailto:Patrick.TranVan at unil.ch>> wrote:

Hello,

This is my first time running Maker for an insect genome annotation.

I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:


What I have:
- RNA evidence: transcriptome
- Proteine evidence: swissprot/uniprot + busco protein set of insect
- Cegma and busco results of my genome


1) Train SNAP with CEGMA

2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).

3) Create SNAP model from run A.

4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

5) Create SNAP model from run B.

6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).

7)  Create SNAP model from run C AND Create Augustus gene model from run C

8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1


Does it seems coherent ?

Cheers,

Patrick Tran Van

Groups Chapuisat, Robinson-Rechavi & Schwander
Department of Ecology and Evolution
University of Lausanne
Le Biophore
CH-1015 Lausanne
Switzerland
Office 3206

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170921/c54c44f5/attachment-0003.html>

From carsonhh at gmail.com  Fri Sep 22 11:57:56 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 11:57:56 -0600
Subject: [maker-devel] augustus underpredicting
In-Reply-To: <CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
References: <CAL0hg4HOSy=S7hJDAtQ15WOjQ3UzkBncSdMcB64uDOmpQUGmvg@mail.gmail.com>
	<E0AFDA01-3142-4A14-A4F0-E621E5BD4C2F@gmail.com>
	<CAL0hg4HWHjmEJ680+c4XiajWaE5WKqfbVK4oeNYVDWPuz_Oh+Q@mail.gmail.com>
	<3B1C5B03-58F9-4427-815A-1D725C740A04@gmail.com>
	<CAL0hg4FDD8JMXCauxuSjqsY0z=H92utgqMfmoH+s_Los6_c39g@mail.gmail.com>
Message-ID: <06E8D6C3-B278-4820-B309-5CF61186FDCB@gmail.com>

I don?t think you can use the protein2genome option to estimate gene count. It will turn any alignment that matches at east 50% into a gene model. So you can get a lot of partial models which will inflate gene count. It?s good enough for training but not so much annotation.

?Carson


> On Sep 19, 2017, at 6:02 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Thanks Carson.
> 
> Last quick question. After the first run (before using the gene predictors) I ran fasta_merge to get an idea of the numbers I should be looking for.
> In summary, I got 14000 genes, only using Swissprot and a close highly curated reference genome to avoid any "fake" protein or partial proteins from draft annotations, plus assembled RNA-seq from my genome. 
> How should I consider this as a guide? (if I can do so) ... Is this a number I should be aiming as a minimum number of genes? maximum? something around that?
> 
> PS my genome (fungus) is 80+ Mbp and just 150 contigs so I expect very few possible fragments due assembly (seq errors aside)
> 
> On 20 September 2017 at 07:34, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Gene predictors tend to over predict, so I would not take the high numbers given by SNAP and GeneMark as true counts. You will probably end up with something like 7-10k in the final results. But now Augustus is giving a higher count, you should be good to start running MAKER.
> 
> ?Carson
> 
> 
> 
> 
>> On Sep 17, 2017, at 7:12 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>> 
>> I did it that way and AUGUSTUS is predicting a more reasonable number of genes, about 12500 in Maker, but about 19000 in the model assessment step.
>> In comparison, SNAP gives 16000 and GeneMark 19000.
>> 
>> I haven't found any reference about but, would it be a good idea to train Augustus over the masked genome instead?
>> Thanks,
>> 
>> 
>> 
>> On 12 September 2017 at 02:50, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BUSCO may be generating too few models. BUSCO also identifies classes of conserved short genes that may not represent enough training diversity for your organism. Try running MAKER in protein2genome or est2genome mode, and then train with those results.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 10, 2017, at 7:03 PM, Xabier V?zquez-Campos <xvazquezc at gmail.com <mailto:xvazquezc at gmail.com>> wrote:
>>> 
>>> Hi,
>>> I have been annotating a fungal genome as usual, using Busco-trained Augustus (in addition to GeneMark and SNAP), but for some reason, Augustus is predicting a mere 207 genes compared to 15-20k from the other two.
>>> I've never had this problem. The genome has an unusual repeat content close to 50%, not sure if that might suppose a problem.
>>> Has anybody come up with any similar issue?
>>> I also asked to Busco developers if they have any idea https://gitlab.com/ezlab/busco/issues/49 <https://gitlab.com/ezlab/busco/issues/49>
>>> Cheers,
>>> Xabi
>>> 
>>> -- 
>>> Xabier V?zquez-Campos, PhD
>>> Research Associate
>>> NSW Systems Biology Initiative
>>> School of Biotechnology and Biomolecular Sciences
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>> 
>> 
>> 
>> 
>> -- 
>> Xabier V?zquez-Campos, PhD
>> Research Associate
>> NSW Systems Biology Initiative
>> School of Biotechnology and Biomolecular Sciences
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
> 
> 
> 
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/edabec82/attachment-0003.html>

From carsonhh at gmail.com  Fri Sep 22 13:47:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:47:36 -0600
Subject: [maker-devel] Fwd: maker error
In-Reply-To: <CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
References: <CACOCz95arkZmq+F_cVjdcCH1GZap6UYvSsTY3oLJZY1+gJsY1w@mail.gmail.com>
	<CACOCz94NFJoo2GmohrCnvNGi3YDtK8TV5dX-0En7XMHRfLiSdA@mail.gmail.com>
Message-ID: <5196E0C2-9FDC-4B6A-9D14-CA8514E002EF@gmail.com>

You have a couple of errors at the start indicating that you may have an issue with the perl forks module as well as RepeatMasker installations. I?d recommend redoing both installations. Also the screen shot you show is not the failure, it is MAKER giving up after failing 2 times. To capture the actual failure set the try count to 3, then rerun and see what comes up in STDERR. Redirect STDERR to a file using ?&>?
.
Example:
maker &> err.log

Thanks,
Carson


On Sep 19, 2017, at 10:56 PM, himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>> wrote:

> 
> ---------- Forwarded message ----------
> From: himani malhotra <himanimalhotra89 at gmail.com <mailto:himanimalhotra89 at gmail.com>>
> Date: Wed, Sep 20, 2017 at 10:24 AM
> Subject: maker error
> To: maker-devel-request at box290.bluehost.com <mailto:maker-devel-request at box290.bluehost.com>
> 
> 
> hello
> I am using MAKER for gene prediction.I am getting error in Repbase installation.I am sending you the error also,please help me.I have installed repbase manually and unpacked its libraries in RepeatMasker Library but still I am getting error.
> Please help me.
> 
> 
> 
> Thanks 
> 
> Himani
> 
> <makererror.png>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/fc4e340a/attachment-0003.html>

From carsonhh at gmail.com  Fri Sep 22 13:59:17 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 13:59:17 -0600
Subject: [maker-devel] Maker MPI across nodes
In-Reply-To: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
References: <DBXPR04MB413DD222F9446CDC493A376BD610@DBXPR04MB413.eurprd04.prod.outlook.com>
Message-ID: <BD2A6E4D-280B-4B38-AA1C-05C03503848C@gmail.com>

The "-mca btl ^openib? flag has the side affect of bypassing infiniband and using ethernet. But if alternate communicators are too slow, you can switch back to indirect infiniband by using '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?. That option will force IP over infiniband whichb instead of direct infiniband. OpenFabrics libraries used by infiniband has a know issue because of how it uses registered memory (it generates seg faults whenever a program does system calls - i.e. MAKER calling BLAST). So you can?t use direct infinband with MAKER. So try this instead ?>  '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0?

Also if it stays slow, it likely means you are hitting IO limits. If that is the case, make sure you are note setting TMP= to a network mounted disk location, and that whatever temp space exists on your cluster it needs to be per node real local mounted disk and not network mounted disk.

?Carson


> On Sep 20, 2017, at 8:02 AM, James Cross (ITCS - Staff) <Jimmy.Cross at uea.ac.uk> wrote:
> 
> Hi Maker Developers,
>  
> We are trying to run Maker with OpenMPI on our HPC cluster across two nodes (each node containing 28 Core?s so 56 Core?s in total). While Maker seems to be running correctly its going slower when split across two nodes (56 Core?s) as opposed to being run on a single node (28 Core?s). We are trying to increase the speed that Maker will complete its run in. Do you know of any reason for why Maker might slow down when split across two nodes?
>  
> Our cluster OS is: CentOS 6.7 and the HPC scheduler used is: LSF. We are running Open mpi on a Mellanox Infiniband network.
>  
> The genome data we wish to annotate is comprised of 1948 scaffolds with an average length of 324890bp (longest scaffold 6948830bp). 
>  
> The command in batch mode we are using is: mpiexec -mca btl ^openib -n 56 maker
>  
> Any help or advise you could give would be greatly appreciated. 
>  
> Best Wishes
> Jimmy
> ----------------------------------------------------------------------
> Mr  James Cross
> HPC Systems Developer
> University of East Anglia
> Norwich Research Park
> ITCS
> Norwich, Norfolk
> NR4 7TJ
>  
> Information Services
>  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/7fdc5720/attachment-0003.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 14:04:10 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:04:10 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>

MAKER won?t produce est2genome results for est_gff. This is partially because est2genome results are only used for training gene predictors. So you are essentially just getting protein2genome results from your runs. Once you get a gene predictor trained you will see a difference, as it will use the intron/exon structure of alignments as hints to improve gene predictor performance.

?Carson


> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> I have tried the proposed options for a small example (yeast).
> 
> I had 
> - proteins (fasta) from another yeast and 
> - transcript annotation (gff) from cufflinks and StringTie
> 
> I'd like to compare the maker results for 
> - proteins and StringTie
> Vs.
> - proteins and cufflinks
> 
> I used the default options, except:
> genome=<genome fasta>
> 
> protein=<protein fasta>
> est_gff=<transcript gff>
> 
> est2genome=1
> protein2genome=1
> 
> (An example is attached.)
> 
> Then I ran maker:
> 
> maker -RM_off -c 24
> find . -type f -name *.gff -exec cat {} + | grep maker > filtered-maker-prediction.gff
> 
> (The run seems to be okay. There were no FAILED, ... in the log. Cf. attachment)
> 
> Each maker run was started in a separate subdirectory.
> However, I realized that both maker runs yielded almost the same result (just one minor edit). This made me curious. 
> As far as I understood the files, I received the (filtered?) exonerate predictions for the proteins (from the other yeast). Is this correct? Why did I not receive any predictions (purely) based on the RNA-seq data? Did I something wrong?
> 
> I'm looking forward to your reply.
> 
> Best regards, Jens
> 
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Dienstag, 19. September 2017 23:37
>> An: Keilwagen, Jens
>> Betreff: Re: MAKER
>> 
>> MAKER cannot use the BAM directly, but you can use something like
>> stringtie or trinity to assemble a transcript fasta that can be given
>> to the est= option.
>> 
>> Ab initio gene prediction is only enabled if you specify an hmm or
>> species file to use.  If all you want is homology based annotation, you
>> can try the est2genome and protein2genome options. Note the final
>> models may be partial if the alignments do not cover the gene end to
>> end.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> thanks a lot for your last email that .
>>> 
>>> I was asked to do homology-based gene prediction using RNA-seq and
>> Maker was proposed as one option.
>>> Hence I'd like to ask how to do that in the best possible way.
>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>> related species. How can I integrate the RNA-seq data?
>>> 
>>> Is it possible to deactivate ab-initio gene prediction by Augustus or
>> SNAP?
>>> 
>>> Thanks a lot in advance.
>>> 
>>> Bets regards, Jens
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>> An: Keilwagen, Jens
>>>> Cc: Mark Yandell
>>>> Betreff: Re: MAKER
>>>> 
>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>> could give the GFF3 results to the pred_gff= option in MAKER (comma
>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>> the same coordinate space as the assembly being annotated (genome=
>> option).
>>>> Whatever you give to pred_gff will be treated as a raw predictions
>> by
>>>> MAKER and will only be accepted as a final model if there are
>>>> evidence alignments (protein/EST) that support the model, and if
>>>> there are multiple alternate models at the same locus, only the
>> model
>>>> that is best supported by the protein/transcript evidence is kept.
>>>> 
>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>> will cause even raw predictions with no evidence support to be
>> maintained.
>>>> In the event of multiple models with no evidence support, the model
>>>> best matching the consensus of alternate models will be maintained.
>>>> 
>>>> Alternatively you can use the model_gff= options (comma separated
>>>> list
>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>> confidence than pred_gff. At least one model will always be kept
>>>> regardless of evidence support (same rules as pred_gff selection for
>>>> which model to keep when there are multiple). But model_gff will
>> also
>>>> affect how evidence clusters are determined compared to pred_gff
>>>> (model_gff features are allowed to merge bridging evidence
>> clusters).
>>>> MAKER will also go to extra lengths to pull forward existing names
>>>> and other data in the GFF3 for model_gff features.
>>>> 
>>>> If you do not have GFF3 files in the right coordinate space, but do
>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>> you can supply these to the protein= and transcript= options in
>> MAKER
>>>> together with est2genome=1 or protein2genome=1. This will cause
>> MAKER
>>>> to place the models using exonerate. You would probably also need to
>>>> add est_forward=1 to the control files to have MAKER try and derive
>>>> model names from the name of evidence alignments they were derived
>>>> from if you go this route.
>>>> 
>>>> You can also try treating the GFF3 predictions as hints to
>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>> predictions inform the behavior of predictors like SNAP and
>>>> Augustus). Might be interesting. You would have to alter results to
>>>> be match/match_part
>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>> 
>>>> Let me know if you have any more questions, and I?ll do my best to
>>>> help.
>>>> 
>>>> Thanks,
>>>> Carson
>>>> 
>>>> 
>>>> 
>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>> <myandell at genetics.utah.edu> wrote:
>>>>> 
>>>>> 
>>>>> Mark Yandell
>>>>> Professor of Human Genetics
>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>> University of Utah
>>>>> 15 North 2030 East, Room 2100
>>>>> Salt Lake City, UT 84112-5330
>>>>> ph:801-587-7707
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens" <jens.keilwagen at jki.bund.de>
>>>> wrote:
>>>>> 
>>>>>> Dear Prof. Yandell,
>>>>>> 
>>>>>> we have published a homology-based gene prediction program today:
>>>>>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw092
>>>>>> and I'd like to ask how we can use MAKER to combine predictions of
>>>>>> GeMoMa using different reference organisms, i.e. we try to predict
>>>>>> the genes of an target organism (e.g. wheat) using the annotated
>>>>>> genes of other reference organisms (e.g. grasses). GeMoMa returns
>>>> for
>>>>>> each reference organism a GFF with the predicted gene models in
>> the
>>>> target organism.
>>>>>> 
>>>>>> It would be great if you or someone from your team could give us
>>>> some
>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>> 
>>>>>> Thanks a lot and best regards, Jens
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> Dr. Jens Keilwagen
>>>>>> 
>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>> Cultivated
>>>>>> Plants
>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>> 
>>>>>> Erwin-Baur-Stra?e 27
>>>>>> 06484 Quedlinburg
>>>>>> Germany
>>>>>> 
>>>>>> Phone: ++49 (0)3946 47 510
>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>> 
>>>>>> 
>>>>> 
>>> 
> 
> <maker_opts.ctl><slurm-278767.out>


From eennadi at gmail.com  Fri Sep 22 13:27:37 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Fri, 22 Sep 2017 20:27:37 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
Message-ID: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:

> Ok, thanks.
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
>
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>
>> It would need to be a new run. You won't be able to use the updated
>> contig names with the old run.
>>
>> --Carson
>>
>> Sent from my iPhone
>>
>> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>> genome.fasta
>>
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_
>> trimmed_\(paired\)_,
>>
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,
>>
>> 1. How can I effect the change when maker has produced some files from
>> the the old sequence?
>>
>> I have spent more than 24 hours running maker and it has produced some
>> folders already.
>>
>> How can I make this change?
>>
>> Thanks
>>
>>
>>
>>
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/
>> profile/Emmanuel_Nnadi/publications
>>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>
>>> BLAST which is used by MAKER can not handle really long contig names.
>>> MAKER tries to get around this by adding a secondary tag to the fasta
>>> header when long names are detected. Even then it would be better to change
>>> the IDs of your contigs to avoid downstream failures.
>>>
>>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_?
>>> from each contig name.
>>>
>>> Example command to do that ?>
>>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print'
>>> genome.fasta
>>>
>>> ?Carson
>>>
>>>
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>
>>> Hi Carson
>>> Thanks for your response its been helpful
>>>
>>> Please bear with me as I work through this
>>>
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it
>>> wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence
>>> identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>>
>>> examining contents of the fasta file and run log
>>>
>>>
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/
>>> profile/Emmanuel_Nnadi/publications
>>>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>
>>>> You can query valid species names using the queryTaxonomyDatabase.pl
>>>> script that comes with RepeatMasker. Try not to be too specific. In general
>>>> you should use the genus rather than the species for example (or even use
>>>> all of RepBase).
>>>>
>>>> Example ?>
>>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>>>
>>>> ?Carson
>>>>
>>>>
>>>>
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>
>>>> Hi Carson,
>>>>
>>>>  Thanks
>>>> I was able to start using maker.
>>>>
>>>> However I am working with a plant Genome novel. I had set the
>>>> repeatmasking to
>>>> 1. Dcotrep a names from the repbase release but maker returned it back
>>>> as not known to repeat masker
>>>>
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>>
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/
>>>> profile/Emmanuel_Nnadi/publications
>>>>
>>>>
>>>>
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com> wrote:
>>>>
>>>>> MAKER will read the genome= options from the maker_opts.ctl file in
>>>>> your current directory or the maker_opts.ctl you specified on the command
>>>>> line. The error means you have left the value empty. Perhaps you did not
>>>>> save the changes you made or you did not specify the location of
>>>>> the maker_opts.ctl file to use.
>>>>>
>>>>> You can check the contents of the file using cat. Example ?>
>>>>> cat maker_opts.ctl
>>>>>
>>>>> ?Carson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>>>>>
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running
>>>>> maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>>
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>>
>>>>> It gave the following *1_S7_assembly.fa I specified the name of the
>>>>> genome but when I ran maker in another tab it gave *
>>>>>
>>>>> #-----Genome (these are always required)
>>>>> genome=*1_S7_assembly.fa* #genome sequence (fasta file or fasta
>>>>> embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is
>>>>> eukaryotic
>>>>>
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 =
>>>>> no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>>
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate
>>>>> organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>>
>>>>> #-----Protein Homology Evidence (for best results provide a file for
>>>>> at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple
>>>>> oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3
>>>>> file
>>>>>
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in
>>>>> RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format
>>>>> for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta
>>>>> #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change
>>>>> this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e.
>>>>> seg and dust filtering)
>>>>>
>>>>>
>>>>> *I ran maker command on another tab and it returned the following*
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control
>>>>> files.
>>>>>
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>>
>>>>>
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same
>>>>> tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate
>>>>> protein fast for the de novo sequence and EST?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/
>>>>> profile/Emmanuel_Nnadi/publications
>>>>>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Here is a class on how to use MAKER taught a couple of years back ?>
>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/M
>>>>>> AKER_Tutorial_for_GMOD_Online_Training_2014
>>>>>>
>>>>>> There is also a linked video as well as an amazon image of the class
>>>>>> material where you can run the image in the cloud and follow along.
>>>>>>
>>>>>> Thanks,
>>>>>> Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Carson,
>>>>>> Thanks a lot
>>>>>>
>>>>>> I ran this command maker -h it returned the following
>>>>>>
>>>>>> The last thing I wish to ask you, how can I load my genome fine and
>>>>>> being annotation?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>>
>>>>>> MAKER version 2.31.9
>>>>>>
>>>>>> Usage:
>>>>>>
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>>
>>>>>>
>>>>>> Description:
>>>>>>
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format
>>>>>> using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can
>>>>>> be used to
>>>>>>      produce gene annotations for new genomes as well as update
>>>>>> annotations
>>>>>>      from existing genome databases.
>>>>>>
>>>>>>      The three input arguments are control files that specify how
>>>>>> MAKER should
>>>>>>      behave. All options for MAKER should be set in the control
>>>>>> files, but a
>>>>>>      few can also be set on the command line. Command line options
>>>>>> provide a
>>>>>>      convenient machanism to override commonly altered control file
>>>>>> values.
>>>>>>      MAKER will automatically search for the control files in the
>>>>>> current
>>>>>>      working directory if they are not specified on the command line.
>>>>>>
>>>>>>      Input files listed in the control options files must be in fasta
>>>>>> format
>>>>>>      unless otherwise specified. Please see MAKER documentation to
>>>>>> learn more
>>>>>>      about control file  configuration.  MAKER will automatically try
>>>>>> and
>>>>>>      locate the user control files in the current working directory
>>>>>> if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>>
>>>>>>      It is important to note that MAKER does not try and recalculated
>>>>>> data that
>>>>>>      it has already calculated.  For example, if you run an analysis
>>>>>> twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any
>>>>>> of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from
>>>>>> the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f
>>>>>> flag.
>>>>>>
>>>>>>      MAKER also supports parallelization via MPI on computer
>>>>>> clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support
>>>>>> must be
>>>>>>      configured during the MAKER installation process for this to
>>>>>> work though
>>>>>>
>>>>>>
>>>>>> Options:
>>>>>>
>>>>>>      -genome|g <file>    Overrides the genome file path in the
>>>>>> control files
>>>>>>
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>>
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep
>>>>>> directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>>
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and
>>>>>> lower)
>>>>>>
>>>>>>      -base    <string>   Set the base name MAKER uses to save output
>>>>>> files.
>>>>>>                          MAKER uses the input genome file name by
>>>>>> default.
>>>>>>
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of
>>>>>> tries.
>>>>>>
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST
>>>>>> analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>>
>>>>>>      -force|f            Forces MAKER to delete old files before
>>>>>> running again.
>>>>>> This will require all blast analyses to be rerun.
>>>>>>
>>>>>>      -again|a            recaculate all annotations and output files
>>>>>> even if no
>>>>>> settings have changed. Does not delete old analyses.
>>>>>>
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status
>>>>>> messages.
>>>>>>
>>>>>>      -qq                 Even more quiet. There are no status
>>>>>> messages.
>>>>>>
>>>>>>      -dsindex            Quickly generate datastore index file. Note
>>>>>> that this
>>>>>>                          will not check if run settings have changed
>>>>>> on contigs
>>>>>>
>>>>>>      -nolock             Turn off file locks. May be usful on some
>>>>>> file systems,
>>>>>>                          but can cause race conditions if running in
>>>>>> parallel.
>>>>>>
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>>
>>>>>>      -CTL                Generate empty control files in the current
>>>>>> directory.
>>>>>>
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>>
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>>
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>>
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for
>>>>>> web-based GUI
>>>>>>
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>>
>>>>>>      -version            Prints the MAKER version.
>>>>>>
>>>>>>      -help|?             Prints this usage statement.
>>>>>>
>>>>>>
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/
>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Path needs to be a list of directories to search (you specified an
>>>>>>> executable location).
>>>>>>>
>>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop
>>>>>>> /Gpm/maker/bin
>>>>>>>
>>>>>>> ?Carson
>>>>>>>
>>>>>>>
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> I tried to export PATH
>>>>>>>
>>>>>>> running
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>>
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaeme
>>>>>>> ka/Desktop/Gpm/maker/bin/maker
>>>>>>>
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>>
>>>>>>>
>>>>>>> secondly,
>>>>>>>
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>>
>>>>>>> nothing returned.
>>>>>>>
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> After install the executables will be in the ?/maker/bin directory.
>>>>>>>> Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>>>
>>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be
>>>>>>>> found just by typing ?maker'
>>>>>>>>
>>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_e
>>>>>>>> nv_var.html
>>>>>>>>
>>>>>>>> ?Carson
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu> wrote:
>>>>>>>>
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what
>>>>>>>> is the result of ?which maker??
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Daniel
>>>>>>>> The reply is
>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>> -bash: MAKER: command not found
>>>>>>>>
>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>> Department of Microbiology,
>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>
>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you
>>>>>>>>> used to try to run MAKER? Can you show the result of running ?MAKER -ctl??
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Ence,
>>>>>>>>> Thanks for your reply,
>>>>>>>>>
>>>>>>>>> This is the step and error received
>>>>>>>>>
>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>> Installing MAKER...
>>>>>>>>> Building MAKER
>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>
>>>>>>>>> The build status is
>>>>>>>>> =============================================================================
>>>>>>>>> STATUS MAKER v2.31.9==============================================================================
>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to
>>>>>>>>>> the mailing list the command and output (including errors) of the step that
>>>>>>>>>> didn?t work.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in
>>>>>>>>>> installing all prerequisites however running maker ./build install, it
>>>>>>>>>> showed that maker installed.
>>>>>>>>>>
>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>
>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/
>>>>>>>>>> profile/Emmanuel_Nnadi/publications
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> maker-devel mailing list
>>>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>>>> ell-lab.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>>>>> ell-lab.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/5d36dba0/attachment-0003.html>

From carsonhh at gmail.com  Fri Sep 22 14:06:06 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:06:06 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
Message-ID: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

> On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello all,
> Please how can I determine the following in maker:
> 1. The total number of chromosomes
> 2. The size of my genome
> 
> 
> Thanks
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> Ok, thanks. 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> 
>    
> 
> On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> It would need to be a new run. You won't be able to use the updated contig names with the old run. 
> 
> --Carson
> 
> Sent from my iPhone
> 
> On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
> 
>> Hi carson
>> Thanks for the tip
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_, 
>> 
>> 1. How can I effect the change when maker has produced some files from the the old sequence?
>> 
>> I have spent more than 24 hours running maker and it has produced some folders already.
>> 
>> How can I make this change?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.
>> 
>> I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.
>> 
>> Example command to do that ?> 
>> perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta
>> 
>> ?Carson
>> 
>> 
>>> On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>> 
>>> Hi Carson
>>> Thanks for your response its been helpful
>>> 
>>> Please bear with me as I work through this
>>> 
>>> 1. Please how do I generate EST for my novel sequences?
>>> 2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
>>> 3. One error in the contig just returned this value
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
>>>  at /usr/local/bin/RepeatMasker line 1464.
>>> ERROR: RepeatMasker failed
>>> --> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
>>> ERROR: Failed while doing repeat masking
>>> ERROR: Chunk failed at level:0, tier_type:1
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> ERROR: Chunk failed at level:2, tier_type:0
>>> FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2
>>> 
>>> examining contents of the fasta file and run log
>>> 
>>> 
>>> Nnadi Nnaemeka Emmanuel
>>> Department of Microbiology,
>>> Faculty of Natural and Applied Science,
>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>> On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>> You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).
>>> 
>>> Example ?>
>>> perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"
>>> 
>>> ?Carson
>>> 
>>> 
>>> 
>>>> On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>> 
>>>> Hi Carson,
>>>> 
>>>>  Thanks
>>>> I was able to start using maker.
>>>> 
>>>> However I am working with a plant Genome novel. I had set the repeatmasking to 
>>>> 1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker
>>>> 
>>>> How can I use specific known genomes for repeat masking
>>>> Thanks
>>>> 
>>>> Nnadi Nnaemeka Emmanuel
>>>> Department of Microbiology,
>>>> Faculty of Natural and Applied Science,
>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>> 
>>>>    
>>>> 
>>>> On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>> MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.
>>>> 
>>>> You can check the contents of the file using cat. Example ?> cat maker_opts.ctl
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
>>>>> I however encountered another problem
>>>>> 
>>>>> when I ran the command nano -c maker_opts.ctl
>>>>> 
>>>>> It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave 
>>>>> 
>>>>> #-----Genome (these are always required)
>>>>> genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
>>>>> organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
>>>>> 
>>>>> #-----Re-annotation Using MAKER Derived GFF3
>>>>> maker_gff= #MAKER derived GFF3 file
>>>>> est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
>>>>> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
>>>>> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
>>>>> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
>>>>> model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
>>>>> pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
>>>>> other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
>>>>> 
>>>>> #-----EST Evidence (for best results provide a file for at least one)
>>>>> est= #set of ESTs or assembled mRNA-seq in fasta format
>>>>> altest= #EST/cDNA sequence file in fasta format from an alternate organism
>>>>> est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
>>>>> altest_gff= #aligned ESTs from a closly relate species in GFF3 format
>>>>> 
>>>>> #-----Protein Homology Evidence (for best results provide a file for at least one)
>>>>> protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
>>>>> protein_gff=  #aligned protein homology evidence from an external GFF3 file
>>>>> 
>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>> model_org=all #select a model organism for RepBase masking in RepeatMasker
>>>>> rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
>>>>> repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
>>>>> rm_gff= #pre-identified repeat elements from an external GFF3 file
>>>>> prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
>>>>> softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
>>>>> 
>>>>> 
>>>>> I ran maker command on another tab and it returned the following
>>>>> STATUS: Parsing control files...
>>>>> ERROR: You have failed to provide a value for 'genome' in the control files.
>>>>> 
>>>>> --> rank=NA, hostname=emmannamekasMBP
>>>>> 
>>>>> 
>>>>> Questions
>>>>> 1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
>>>>> 2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Nnadi Nnaemeka Emmanuel
>>>>> Department of Microbiology,
>>>>> Faculty of Natural and Applied Science,
>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>> On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>> Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 <http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014>
>>>>> 
>>>>> There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.
>>>>> 
>>>>> Thanks,
>>>>> Carson
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Carson,
>>>>>> Thanks a lot 
>>>>>> 
>>>>>> I ran this command maker -h it returned the following
>>>>>> 
>>>>>> The last thing I wish to ask you, how can I load my genome fine and being annotation?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> emmannamekasMBP:maker emmannaemeka$ maker -h
>>>>>> 
>>>>>> MAKER version 2.31.9
>>>>>> 
>>>>>> Usage:
>>>>>> 
>>>>>>      maker [options] <maker_opts> <maker_bopts> <maker_exe>
>>>>>> 
>>>>>> 
>>>>>> Description:
>>>>>> 
>>>>>>      MAKER is a program that produces gene annotations in GFF3 format using
>>>>>>      evidence such as EST alignments and protein homology. MAKER can be used to
>>>>>>      produce gene annotations for new genomes as well as update annotations
>>>>>>      from existing genome databases.
>>>>>> 
>>>>>>      The three input arguments are control files that specify how MAKER should
>>>>>>      behave. All options for MAKER should be set in the control files, but a
>>>>>>      few can also be set on the command line. Command line options provide a
>>>>>>      convenient machanism to override commonly altered control file values.
>>>>>>      MAKER will automatically search for the control files in the current
>>>>>>      working directory if they are not specified on the command line.
>>>>>> 
>>>>>>      Input files listed in the control options files must be in fasta format
>>>>>>      unless otherwise specified. Please see MAKER documentation to learn more
>>>>>>      about control file  configuration.  MAKER will automatically try and
>>>>>>      locate the user control files in the current working directory if these
>>>>>>      arguments are not supplied when initializing MAKER.
>>>>>> 
>>>>>>      It is important to note that MAKER does not try and recalculated data that
>>>>>>      it has already calculated.  For example, if you run an analysis twice on
>>>>>>      the same dataset you will notice that MAKER does not rerun any of the
>>>>>>      BLAST analyses, but instead uses the blast analyses stored from the
>>>>>>      previous run. To force MAKER to rerun all analyses, use the -f flag.
>>>>>> 
>>>>>>      MAKER also supports parallelization via MPI on computer clusters. Just
>>>>>>      launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
>>>>>>      configured during the MAKER installation process for this to work though
>>>>>>      
>>>>>> 
>>>>>> Options:
>>>>>> 
>>>>>>      -genome|g <file>    Overrides the genome file path in the control files
>>>>>> 
>>>>>>      -RM_off|R           Turns all repeat masking options off.
>>>>>> 
>>>>>>      -datastore/         Forcably turn on/off MAKER's two deep directory
>>>>>>       nodatastore        structure for output.  Always on by default.
>>>>>> 
>>>>>>      -old_struct         Use the old directory styles (MAKER 2.26 and lower)
>>>>>> 
>>>>>>      -base    <string>   Set the base name MAKER uses to save output files.
>>>>>>                          MAKER uses the input genome file name by default.
>>>>>> 
>>>>>>      -tries|t <integer>  Run contigs up to the specified number of tries.
>>>>>> 
>>>>>>      -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
>>>>>>                          Note: this is for BLAST and not for MPI!
>>>>>> 
>>>>>>      -force|f            Forces MAKER to delete old files before running again.
>>>>>> 			 This will require all blast analyses to be rerun.
>>>>>> 
>>>>>>      -again|a            recaculate all annotations and output files even if no
>>>>>> 			 settings have changed. Does not delete old analyses.
>>>>>> 
>>>>>>      -quiet|q            Regular quiet. Only a handlful of status messages.
>>>>>> 
>>>>>>      -qq                 Even more quiet. There are no status messages.
>>>>>> 
>>>>>>      -dsindex            Quickly generate datastore index file. Note that this
>>>>>>                          will not check if run settings have changed on contigs
>>>>>> 
>>>>>>      -nolock             Turn off file locks. May be usful on some file systems,
>>>>>>                          but can cause race conditions if running in parallel.
>>>>>> 
>>>>>>      -TMP                Specify temporary directory to use.
>>>>>> 
>>>>>>      -CTL                Generate empty control files in the current directory.
>>>>>> 
>>>>>>      -OPTS               Generates just the maker_opts.ctl file.
>>>>>> 
>>>>>>      -BOPTS              Generates just the maker_bopts.ctl file.
>>>>>> 
>>>>>>      -EXE                Generates just the maker_exe.ctl file.
>>>>>> 
>>>>>>      -MWAS    <option>   Easy way to control mwas_server for web-based GUI
>>>>>> 
>>>>>>                               options:  STOP
>>>>>>                                         START
>>>>>>                                         RESTART
>>>>>> 
>>>>>>      -version            Prints the MAKER version.
>>>>>> 
>>>>>>      -help|?             Prints this usage statement.
>>>>>> 
>>>>>> 
>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>> Department of Microbiology,
>>>>>> Faculty of Natural and Applied Science,
>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>> On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>> Path needs to be a list of directories to search (you specified an executable location).
>>>>>> 
>>>>>> So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>> 
>>>>>> Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin
>>>>>> 
>>>>>> ?Carson
>>>>>> 
>>>>>> 
>>>>>>> On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Thanks 
>>>>>>> 
>>>>>>> I tried to export PATH
>>>>>>> 
>>>>>>> running 
>>>>>>> echo $PATH in the maker directory this returned
>>>>>>> 
>>>>>>> /usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker
>>>>>>> 
>>>>>>> 1. Does it mean that PATH has been exported?
>>>>>>> 
>>>>>>> 
>>>>>>> secondly,
>>>>>>> 
>>>>>>> I tried to run
>>>>>>> the command maker -h, which maker, maker -CTL
>>>>>>> 
>>>>>>> nothing returned.
>>>>>>> 
>>>>>>> 2. how do i start up maker?
>>>>>>> 3. Do I need to be in maker directory to start maker?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>> Department of Microbiology,
>>>>>>> Faculty of Natural and Applied Science,
>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>> On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>>>>> After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker
>>>>>>> 
>>>>>>> You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'
>>>>>>> 
>>>>>>> Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html <http://www.linfo.org/path_env_var.html>
>>>>>>> 
>>>>>>> ?Carson
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>> 
>>>>>>>> Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker?? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Daniel
>>>>>>>>> The reply is 
>>>>>>>>> emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
>>>>>>>>> -bash: MAKER: command not found
>>>>>>>>> 
>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>> Department of Microbiology,
>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>> On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>> Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl?? 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Ence
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Ence,
>>>>>>>>>> Thanks for your reply,
>>>>>>>>>> 
>>>>>>>>>> This is the step and error received
>>>>>>>>>> emmannamekasMBP:src emmannaemeka$ ./build install
>>>>>>>>>> Installing MAKER...
>>>>>>>>>> Building MAKER
>>>>>>>>>> Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)
>>>>>>>>>> 
>>>>>>>>>> The build status is
>>>>>>>>>> 
>>>>>>>>>> =============================================================================
>>>>>>>>>> STATUS MAKER v2.31.9
>>>>>>>>>> ==============================================================================
>>>>>>>>>> PERL Dependencies:  VERIFIED
>>>>>>>>>> External Programs:  VERIFIED
>>>>>>>>>> External C Libraries:   VERIFIED
>>>>>>>>>> MPI SUPPORT:        DISABLED
>>>>>>>>>> MWAS Web Interface: DISABLED
>>>>>>>>>> MAKER PACKAGE:      CONFIGURATION OK
>>>>>>>>>> 
>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>> Department of Microbiology,
>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
>>>>>>>>>> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Ence
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hello all,
>>>>>>>>>>> 
>>>>>>>>>>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>>>>>>>>>>> 
>>>>>>>>>>> However trying to run maker it wouldn't run.
>>>>>>>>>>> 
>>>>>>>>>>> Please how do I install maker to run on local computer?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> Nnadi Nnaemeka Emmanuel
>>>>>>>>>>> Department of Microbiology,
>>>>>>>>>>> Faculty of Natural and Applied Science,
>>>>>>>>>>> Plateau State University, Bokkos, Plateau State, Nigeria.
>>>>>>>>>>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>>>>>>>>>>> 
>>>>>>>>>>>    
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> maker-devel mailing list
>>>>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> maker-devel mailing list
>>>>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/64e7446c/attachment-0003.html>

From carsonhh at gmail.com  Fri Sep 22 14:08:36 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 22 Sep 2017 14:08:36 -0600
Subject: [maker-devel] Advice on my pipeline
In-Reply-To: <1505986013492.52354@unil.ch>
References: <6b029690bace4d3fbae77c0bb1bddce8@prdexch02.ad.unil.ch>
	<A39F4213-70A8-4E07-AB13-00C427E4F244@gmail.com>
	<1498470630221.84642@unil.ch>
	<696C51C6-5606-4ECB-A8B8-9C077182FFFA@gmail.com>
	<1498908228256.16549@unil.ch>
	<58E904BF-9AB8-4AC7-B10B-C902F414E03D@gmail.com>
	<1505986013492.52354@unil.ch>
Message-ID: <651D4267-0FD7-4A92-B778-8976B47353BB@gmail.com>

The gff3 passthrough options are there to help users get old data into MAKER when they have lost access to the original files. But for iterative running of the pipeline, it is more effective just to rerun in place so MAKER can access the raw alignment reports. The raw reports from the alignments have more detail than what is stored in the GFF3. Details that are lost when trying to use the GFF3 as input.

?Carson


> On Sep 21, 2017, at 3:26 AM, Patrick Tran Van <Patrick.TranVan at unil.ch> wrote:
> 
> Hi Carson,
> 
> I have a doubt for the round 2, so in a previous reply you said:
> 
> " Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory). "
>  
> Does it means that I don't need to modify the section :
> 
> #-----Re-annotation Using MAKER Derived GFF3
> 
> ?
> 
> If I let everything by default such as :
> 
> altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
> protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
> rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no 
> 
> 
> It will not look again for repeat and protein + transcriptome alignment ?
> 
> Patrick Tran Van
> 
> Groups Chapuisat, Robinson-Rechavi & Schwander
> Department of Ecology and Evolution
> University of Lausanne
> Le Biophore
> CH-1015 Lausanne
> Switzerland
> Office 3206
> 
> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
> Sent: Monday, July 3, 2017 10:50 PM
> To: Patrick Tran Van
> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] Advice on my pipeline
>  
> maker2zff is just for SNAP training and not for gene filtering (please do not use it for filtering, it does not do what you think).
> 
> So the final annotation set after maker with correct_est_fusion is 16,850. To decide which set is better, look at them in a browser (gene counts are not useful for guaging result). A well annotated genome will have evidence clusters that closely match the final models. A poorly annoted genome will have evidence clusters that are split or merged by the models.
> 
> The corrected_est_fusion does two things. It trims long overlapping UTR fragments, and it stops evidence clusters from being merged on BLASTP evidence alone (so gene predictors will get unmerged hint regions if clusters are split).
> 
> You may also find that using jaccard_clip with Trinity has reduced sensitivity for the transcript data (you may lose things that were there before, but now have better specificity, i.e. fewer false positives). Make sure you provided protein data from at least two related species to help maintain sensitivity lost form the transcript data. You can also add rejected genes models back in after the fact by using iprscan to identify unsupported models with identifiable protein domains ?> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/ <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286374/>
> 
> Thanks,
> Carson
> 
> 
> 
> 
>> On Jul 1, 2017, at 5:21 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>> 
>> So I have assembled my transcriptome with Trinity using the jaccard clip option and I have run maker with and without corrected_est_fusion.
>> 
>> I have then use SNAP to train/filter it with:
>> 
>> maker2zff  specie.all.gff
>> 
>> Here are my results:
>> 
>> Number of gene after maker -> Number of gene after maker2zff
>> 
>> - Without corrected_est_fusion: 21621 -> 13875
>> - With corrected_est_fusion: 16850 -> 9098
>> 
>> 1 )If I understand well how works corrected_est_fusion, because it prevents gene merging, shouldn't be the invert ?
>> Normally I should find more genes with corrected_est_fusion right ?
>> 
>> 2) I think I should find something like 13000-14000 genes for my specie. SHould I go with the "Without corrected_est_fusion" for the 2nd iteration of maker ?
>> 
>>  Thanks for your help 
>> 
>> 
>> 
>> Patrick Tran Van
>> 
>> Groups Chapuisat, Robinson-Rechavi & Schwander
>> Department of Ecology and Evolution
>> University of Lausanne
>> Le Biophore
>> CH-1015 Lausanne
>> Switzerland
>> Office 3206
>> 
>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>> Sent: Monday, June 26, 2017 11:38 PM
>> To: Patrick Tran Van
>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>> Subject: Re: [maker-devel] Advice on my pipeline
>>  
>> Sorry the option is ?> correct_est_fusion
>> 
>> It is in the maker_opts.ctl file.
>> 
>> I would use both SNAP and Augustus on a few large contigs then review the results manually. If one of them is not behaving well, then drop it. If both behave well (i.e. correlate well with evidence alignemnts) then keep them both.
>> 
>> ?Carson
>> 
>> 
>> 
>>> On Jun 26, 2017, at 3:48 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>> 
>>> Thanks for your answer.
>>> 
>>> 1) Do you think that adding a Augustus training in addition to SNAP at the step 3 and 5 will add more confidence (instead of adding Augustus only for the final round) ?
>>> Because I am using autoAug for this and it tooks a while to compute ..
>>> 
>>> 2) I don't see this option : 'avoid_est_fusion=1' . I have tried to add it but I got this error:
>>> 
>>> WARNING: Invalid option 'avoid_est_fusion' in control file maker_opts.ctl
>>> 
>>> (I am using v 2.31.8 )
>>> 
>>> 
>>> Patrick Tran Van
>>> 
>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>> Department of Ecology and Evolution
>>> University of Lausanne
>>> Le Biophore
>>> CH-1015 Lausanne
>>> Switzerland
>>> Office 3206
>>> 
>>> From: Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>>> Sent: Monday, June 5, 2017 8:29 PM
>>> To: Patrick Tran Van
>>> Cc: maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>>> Subject: Re: [maker-devel] Advice on my pipeline
>>>  
>>> Your plan sounds good. A couple of related notes.
>>> 
>>> Insect genomes tend to have high gene density, so gene merging will be the primary difficulty. You can avoid merging of mRNA-seq evidence by using options like jaccard_clip in Trinity. Then use avoid_est_fusion=1 inside of MAKER.
>>> 
>>> Also it is more convenient to do each run in the same directory rather than supplying the previous run as GFF3 input. MAKER will automatically recycle previous results archived in the run directory when you do this. Using the maker_gff option is really more for getting data into the run from jobs performed a long time ago (so they can?t be run in the same directory).
>>> 
>>> ?Carson
>>> 
>>> 
>>>> On Jun 2, 2017, at 3:56 AM, Patrick Tran Van <Patrick.TranVan at unil.ch <mailto:Patrick.TranVan at unil.ch>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> This is my first time running Maker for an insect genome annotation. 
>>>> 
>>>> I have found various resources and tried to make a consensus, I am looking for your thoughts and advices about my pipeline, if I can improve something or doing useless things:
>>>> 
>>>> 
>>>> What I have:
>>>> - RNA evidence: transcriptome
>>>> - Proteine evidence: swissprot/uniprot + busco protein set of insect
>>>> - Cegma and busco results of my genome
>>>> 
>>>> 
>>>> 1) Train SNAP with CEGMA
>>>> 
>>>> 2) Run (run A) maker with repeat masking with transcript, protein, the new SNAP file (from step 1) and augustus file (from busco).
>>>> 
>>>> 3) Create SNAP model from run A.
>>>> 
>>>> 4) Run (run B ) with the new SNAP (done at step 3) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_A.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 5) Create SNAP model from run B.
>>>> 
>>>> 6) Run (run C) with the new SNAP (done at step 5) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_B.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1).
>>>> 
>>>> 7)  Create SNAP model from run C AND Create Augustus gene model from run C
>>>> 
>>>> 8) Run (run D) with the new SNAP (done at step 7) + AUGUSTUS file (step 7) with options turned off (est2genome=0) and (protein2genome=0) data, provide gff file (maker_gff=run_C.gff), turn off repeat masking (rm_pass=1), and use previous mapping results (altest_pass=1 and protein_pass=1). + Use keep_preds=1
>>>> 
>>>> 
>>>> 
>>>> Does it seems coherent ?
>>>> 
>>>> Cheers,
>>>> 
>>>> Patrick Tran Van
>>>> 
>>>> Groups Chapuisat, Robinson-Rechavi & Schwander
>>>> Department of Ecology and Evolution
>>>> University of Lausanne
>>>> Le Biophore
>>>> CH-1015 Lausanne
>>>> Switzerland
>>>> Office 3206
>>>> 
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>> 
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170922/3b6b64af/attachment-0003.html>

From carson.holt at genetics.utah.edu  Fri Sep 22 14:19:22 2017
From: carson.holt at genetics.utah.edu (Carson Hinton Holt)
Date: Fri, 22 Sep 2017 20:19:22 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
	<1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>
Message-ID: <ADB216BF-2828-4906-A32F-58CC3989102F@genetics.utah.edu>

All est2genome and protein2genome do is take exonerate alignments of the fasta inputs and translate the longest ORF to get a rough base model that can be used to train a gene predictor. That is why we have it in the documentation that once the predictor is trained they should be turned off.

Once you get the gene predictor trained, MAKER will feed hints to the gene predictor derived from alignments and input GFF3. These hints greatly improve the performance of the gene predictors. MAKER will also use the alignemnts to filter out predictions htat do not match the evidence alignments.

?Carson


> On Sep 22, 2017, at 2:15 PM, Keilwagen, Jens <jens.keilwagen at julius-kuehn.de> wrote:
> 
> Hi Carson,
> 
> Thanks a lot for the information.
> 
> Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?
> 
> Could you confirm that?
> 
> Thanks a lot and best regards, Jens
> 
>> -----Urspr?ngliche Nachricht-----
>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>> Gesendet: Freitag, 22. September 2017 22:04
>> An: Keilwagen, Jens
>> Cc: Maker Mailing List
>> Betreff: Re: MAKER
>> 
>> MAKER won?t produce est2genome results for est_gff. This is partially
>> because est2genome results are only used for training gene predictors.
>> So you are essentially just getting protein2genome results from your
>> runs. Once you get a gene predictor trained you will see a difference,
>> as it will use the intron/exon structure of alignments as hints to
>> improve gene predictor performance.
>> 
>> ?Carson
>> 
>> 
>>> On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
>> kuehn.de> wrote:
>>> 
>>> Hi Carson,
>>> 
>>> I have tried the proposed options for a small example (yeast).
>>> 
>>> I had
>>> - proteins (fasta) from another yeast and
>>> - transcript annotation (gff) from cufflinks and StringTie
>>> 
>>> I'd like to compare the maker results for
>>> - proteins and StringTie
>>> Vs.
>>> - proteins and cufflinks
>>> 
>>> I used the default options, except:
>>> genome=<genome fasta>
>>> 
>>> protein=<protein fasta>
>>> est_gff=<transcript gff>
>>> 
>>> est2genome=1
>>> protein2genome=1
>>> 
>>> (An example is attached.)
>>> 
>>> Then I ran maker:
>>> 
>>> maker -RM_off -c 24
>>> find . -type f -name *.gff -exec cat {} + | grep maker >
>>> filtered-maker-prediction.gff
>>> 
>>> (The run seems to be okay. There were no FAILED, ... in the log. Cf.
>>> attachment)
>>> 
>>> Each maker run was started in a separate subdirectory.
>>> However, I realized that both maker runs yielded almost the same
>> result (just one minor edit). This made me curious.
>>> As far as I understood the files, I received the (filtered?)
>> exonerate predictions for the proteins (from the other yeast). Is this
>> correct? Why did I not receive any predictions (purely) based on the
>> RNA-seq data? Did I something wrong?
>>> 
>>> I'm looking forward to your reply.
>>> 
>>> Best regards, Jens
>>> 
>>> 
>>>> -----Urspr?ngliche Nachricht-----
>>>> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
>>>> Gesendet: Dienstag, 19. September 2017 23:37
>>>> An: Keilwagen, Jens
>>>> Betreff: Re: MAKER
>>>> 
>>>> MAKER cannot use the BAM directly, but you can use something like
>>>> stringtie or trinity to assemble a transcript fasta that can be
>> given
>>>> to the est= option.
>>>> 
>>>> Ab initio gene prediction is only enabled if you specify an hmm or
>>>> species file to use.  If all you want is homology based annotation,
>>>> you can try the est2genome and protein2genome options. Note the
>> final
>>>> models may be partial if the alignments do not cover the gene end to
>>>> end.
>>>> 
>>>> ?Carson
>>>> 
>>>> 
>>>> 
>>>>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
>> <jens.keilwagen at julius-
>>>> kuehn.de> wrote:
>>>>> 
>>>>> Hi Carson,
>>>>> 
>>>>> thanks a lot for your last email that .
>>>>> 
>>>>> I was asked to do homology-based gene prediction using RNA-seq and
>>>> Maker was proposed as one option.
>>>>> Hence I'd like to ask how to do that in the best possible way.
>>>>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
>>>> related species. How can I integrate the RNA-seq data?
>>>>> 
>>>>> Is it possible to deactivate ab-initio gene prediction by Augustus
>>>>> or
>>>> SNAP?
>>>>> 
>>>>> Thanks a lot in advance.
>>>>> 
>>>>> Bets regards, Jens
>>>>> 
>>>>>> -----Urspr?ngliche Nachricht-----
>>>>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
>>>>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
>>>>>> An: Keilwagen, Jens
>>>>>> Cc: Mark Yandell
>>>>>> Betreff: Re: MAKER
>>>>>> 
>>>>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
>>>>>> could give the GFF3 results to the pred_gff= option in MAKER
>> (comma
>>>>>> separated lists accepted). The GFF3 file of predictions must be in
>>>>>> the same coordinate space as the assembly being annotated (genome=
>>>> option).
>>>>>> Whatever you give to pred_gff will be treated as a raw predictions
>>>> by
>>>>>> MAKER and will only be accepted as a final model if there are
>>>>>> evidence alignments (protein/EST) that support the model, and if
>>>>>> there are multiple alternate models at the same locus, only the
>>>> model
>>>>>> that is best supported by the protein/transcript evidence is kept.
>>>>>> 
>>>>>> You can also set the keep_preds=1 option when using pred_gff. This
>>>>>> will cause even raw predictions with no evidence support to be
>>>> maintained.
>>>>>> In the event of multiple models with no evidence support, the
>> model
>>>>>> best matching the consensus of alternate models will be
>> maintained.
>>>>>> 
>>>>>> Alternatively you can use the model_gff= options (comma separated
>>>>>> list
>>>>>> ok) to input the GFF3 file.  model_gff features are given higher
>>>>>> confidence than pred_gff. At least one model will always be kept
>>>>>> regardless of evidence support (same rules as pred_gff selection
>>>>>> for which model to keep when there are multiple). But model_gff
>>>>>> will
>>>> also
>>>>>> affect how evidence clusters are determined compared to pred_gff
>>>>>> (model_gff features are allowed to merge bridging evidence
>>>> clusters).
>>>>>> MAKER will also go to extra lengths to pull forward existing names
>>>>>> and other data in the GFF3 for model_gff features.
>>>>>> 
>>>>>> If you do not have GFF3 files in the right coordinate space, but
>> do
>>>>>> have protein fasta or transcript fasta for the GeMoMa predictions,
>>>>>> you can supply these to the protein= and transcript= options in
>>>> MAKER
>>>>>> together with est2genome=1 or protein2genome=1. This will cause
>>>> MAKER
>>>>>> to place the models using exonerate. You would probably also need
>>>>>> to add est_forward=1 to the control files to have MAKER try and
>>>>>> derive model names from the name of evidence alignments they were
>>>>>> derived from if you go this route.
>>>>>> 
>>>>>> You can also try treating the GFF3 predictions as hints to
>>>>>> traditional ab initio gene finders like SNAP or Augustus by giving
>>>>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
>>>>>> predictions inform the behavior of predictors like SNAP and
>>>>>> Augustus). Might be interesting. You would have to alter results
>> to
>>>>>> be match/match_part
>>>>>> GFF3 features to give them to the est_gff or protein_gff options.
>>>>>> 
>>>>>> Let me know if you have any more questions, and I?ll do my best to
>>>>>> help.
>>>>>> 
>>>>>> Thanks,
>>>>>> Carson
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
>>>>>> <myandell at genetics.utah.edu> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Mark Yandell
>>>>>>> Professor of Human Genetics
>>>>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
>>>>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
>>>>>>> University of Utah
>>>>>>> 15 North 2030 East, Room 2100
>>>>>>> Salt Lake City, UT 84112-5330
>>>>>>> ph:801-587-7707
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
>>>>>>> <jens.keilwagen at jki.bund.de>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Dear Prof. Yandell,
>>>>>>>> 
>>>>>>>> we have published a homology-based gene prediction program
>> today:
>>>>>>>> 
>> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
>>>>>>>> 2 and I'd like to ask how we can use MAKER to combine
>> predictions
>>>>>>>> of GeMoMa using different reference organisms, i.e. we try to
>>>>>>>> predict the genes of an target organism (e.g. wheat) using the
>>>>>>>> annotated genes of other reference organisms (e.g. grasses).
>>>>>>>> GeMoMa returns
>>>>>> for
>>>>>>>> each reference organism a GFF with the predicted gene models in
>>>> the
>>>>>> target organism.
>>>>>>>> 
>>>>>>>> It would be great if you or someone from your team could give us
>>>>>> some
>>>>>>>> hints or point us to correct paragraph in the documentation.
>>>>>>>> 
>>>>>>>> Thanks a lot and best regards, Jens
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> 
>>>>>>>> Dr. Jens Keilwagen
>>>>>>>> 
>>>>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
>>>> Cultivated
>>>>>>>> Plants
>>>>>>>> 	Institute for Biosafety in Plant Biotechnology
>>>>>>>> 
>>>>>>>> Erwin-Baur-Stra?e 27
>>>>>>>> 06484 Quedlinburg
>>>>>>>> Germany
>>>>>>>> 
>>>>>>>> Phone: ++49 (0)3946 47 510
>>>>>>>> EMail: jens.keilwagen at jki.bund.de
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>> <maker_opts.ctl><slurm-278767.out>
> 


From jens.keilwagen at julius-kuehn.de  Fri Sep 22 14:15:23 2017
From: jens.keilwagen at julius-kuehn.de (Keilwagen, Jens)
Date: Fri, 22 Sep 2017 20:15:23 +0000
Subject: [maker-devel] MAKER
In-Reply-To: <B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
References: <80e44920c4d64d6e82d18e5535656a32@JKI-EXPF03.braunschweig.bba.intern>
	<D2EB4BE8.3069C%myandell@genetics.utah.edu>
	<3BE2DF2C-9443-4314-9524-F0109AAE42E8@genetics.utah.edu>
	<ddad51a1aab5452b816be741d2969d81@JKI-EXPF03.braunschweig.bba.intern>
	<9867D99C-E4AD-4673-BB0B-804A1551F9F3@genetics.utah.edu>
	<f9b68e0588dd4ce6a65dcd63cb4f7b4e@JKI-EXPF03.braunschweig.bba.intern>
	<B11ADE34-41A4-4EA7-A0AC-D4DFD649991D@genetics.utah.edu>
Message-ID: <1f033c3dcf414407a888bbef8201d469@JKI-EXPF03.braunschweig.bba.intern>

Hi Carson,

Thanks a lot for the information.

Just to be sure that I understand you right: It is impossible to obtain MAKER results based on RNA-seq and homology that differ from purely homology-based MAKER results?

Could you confirm that?

Thanks a lot and best regards, Jens

> -----Urspr?ngliche Nachricht-----
> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> Gesendet: Freitag, 22. September 2017 22:04
> An: Keilwagen, Jens
> Cc: Maker Mailing List
> Betreff: Re: MAKER
> 
> MAKER won?t produce est2genome results for est_gff. This is partially
> because est2genome results are only used for training gene predictors.
> So you are essentially just getting protein2genome results from your
> runs. Once you get a gene predictor trained you will see a difference,
> as it will use the intron/exon structure of alignments as hints to
> improve gene predictor performance.
> 
> ?Carson
> 
> 
> > On Sep 21, 2017, at 1:57 AM, Keilwagen, Jens <jens.keilwagen at julius-
> kuehn.de> wrote:
> >
> > Hi Carson,
> >
> > I have tried the proposed options for a small example (yeast).
> >
> > I had
> > - proteins (fasta) from another yeast and
> > - transcript annotation (gff) from cufflinks and StringTie
> >
> > I'd like to compare the maker results for
> > - proteins and StringTie
> > Vs.
> > - proteins and cufflinks
> >
> > I used the default options, except:
> > genome=<genome fasta>
> >
> > protein=<protein fasta>
> > est_gff=<transcript gff>
> >
> > est2genome=1
> > protein2genome=1
> >
> > (An example is attached.)
> >
> > Then I ran maker:
> >
> > maker -RM_off -c 24
> > find . -type f -name *.gff -exec cat {} + | grep maker >
> > filtered-maker-prediction.gff
> >
> > (The run seems to be okay. There were no FAILED, ... in the log. Cf.
> > attachment)
> >
> > Each maker run was started in a separate subdirectory.
> > However, I realized that both maker runs yielded almost the same
> result (just one minor edit). This made me curious.
> > As far as I understood the files, I received the (filtered?)
> exonerate predictions for the proteins (from the other yeast). Is this
> correct? Why did I not receive any predictions (purely) based on the
> RNA-seq data? Did I something wrong?
> >
> > I'm looking forward to your reply.
> >
> > Best regards, Jens
> >
> >
> >> -----Urspr?ngliche Nachricht-----
> >> Von: Carson Hinton Holt [mailto:carson.holt at genetics.utah.edu]
> >> Gesendet: Dienstag, 19. September 2017 23:37
> >> An: Keilwagen, Jens
> >> Betreff: Re: MAKER
> >>
> >> MAKER cannot use the BAM directly, but you can use something like
> >> stringtie or trinity to assemble a transcript fasta that can be
> given
> >> to the est= option.
> >>
> >> Ab initio gene prediction is only enabled if you specify an hmm or
> >> species file to use.  If all you want is homology based annotation,
> >> you can try the est2genome and protein2genome options. Note the
> final
> >> models may be partial if the alignments do not cover the gene end to
> >> end.
> >>
> >> ?Carson
> >>
> >>
> >>
> >>> On Sep 18, 2017, at 4:02 AM, Keilwagen, Jens
> <jens.keilwagen at julius-
> >> kuehn.de> wrote:
> >>>
> >>> Hi Carson,
> >>>
> >>> thanks a lot for your last email that .
> >>>
> >>> I was asked to do homology-based gene prediction using RNA-seq and
> >> Maker was proposed as one option.
> >>> Hence I'd like to ask how to do that in the best possible way.
> >>> I have mapped RNA-seq data (SAM/BAM) and a fasta of proteins from a
> >> related species. How can I integrate the RNA-seq data?
> >>>
> >>> Is it possible to deactivate ab-initio gene prediction by Augustus
> >>> or
> >> SNAP?
> >>>
> >>> Thanks a lot in advance.
> >>>
> >>> Bets regards, Jens
> >>>
> >>>> -----Urspr?ngliche Nachricht-----
> >>>> Von: Carson Holt [mailto:carson.holt at genetics.utah.edu]
> >>>> Gesendet: Donnerstag, 18. Februar 2016 19:03
> >>>> An: Keilwagen, Jens
> >>>> Cc: Mark Yandell
> >>>> Betreff: Re: MAKER
> >>>>
> >>>> GeMoMa sounds like an interesting tool.  If it produces GFF3, you
> >>>> could give the GFF3 results to the pred_gff= option in MAKER
> (comma
> >>>> separated lists accepted). The GFF3 file of predictions must be in
> >>>> the same coordinate space as the assembly being annotated (genome=
> >> option).
> >>>> Whatever you give to pred_gff will be treated as a raw predictions
> >> by
> >>>> MAKER and will only be accepted as a final model if there are
> >>>> evidence alignments (protein/EST) that support the model, and if
> >>>> there are multiple alternate models at the same locus, only the
> >> model
> >>>> that is best supported by the protein/transcript evidence is kept.
> >>>>
> >>>> You can also set the keep_preds=1 option when using pred_gff. This
> >>>> will cause even raw predictions with no evidence support to be
> >> maintained.
> >>>> In the event of multiple models with no evidence support, the
> model
> >>>> best matching the consensus of alternate models will be
> maintained.
> >>>>
> >>>> Alternatively you can use the model_gff= options (comma separated
> >>>> list
> >>>> ok) to input the GFF3 file.  model_gff features are given higher
> >>>> confidence than pred_gff. At least one model will always be kept
> >>>> regardless of evidence support (same rules as pred_gff selection
> >>>> for which model to keep when there are multiple). But model_gff
> >>>> will
> >> also
> >>>> affect how evidence clusters are determined compared to pred_gff
> >>>> (model_gff features are allowed to merge bridging evidence
> >> clusters).
> >>>> MAKER will also go to extra lengths to pull forward existing names
> >>>> and other data in the GFF3 for model_gff features.
> >>>>
> >>>> If you do not have GFF3 files in the right coordinate space, but
> do
> >>>> have protein fasta or transcript fasta for the GeMoMa predictions,
> >>>> you can supply these to the protein= and transcript= options in
> >> MAKER
> >>>> together with est2genome=1 or protein2genome=1. This will cause
> >> MAKER
> >>>> to place the models using exonerate. You would probably also need
> >>>> to add est_forward=1 to the control files to have MAKER try and
> >>>> derive model names from the name of evidence alignments they were
> >>>> derived from if you go this route.
> >>>>
> >>>> You can also try treating the GFF3 predictions as hints to
> >>>> traditional ab initio gene finders like SNAP or Augustus by giving
> >>>> them to the est_gff= or protein_gff= options (i.e. make GeMoMa
> >>>> predictions inform the behavior of predictors like SNAP and
> >>>> Augustus). Might be interesting. You would have to alter results
> to
> >>>> be match/match_part
> >>>> GFF3 features to give them to the est_gff or protein_gff options.
> >>>>
> >>>> Let me know if you have any more questions, and I?ll do my best to
> >>>> help.
> >>>>
> >>>> Thanks,
> >>>> Carson
> >>>>
> >>>>
> >>>>
> >>>>> On Feb 18, 2016, at 10:22 AM, Mark Yandell
> >>>> <myandell at genetics.utah.edu> wrote:
> >>>>>
> >>>>>
> >>>>> Mark Yandell
> >>>>> Professor of Human Genetics
> >>>>> H.A. & Edna Benning Presidential Endowed Chair Co-director USTAR
> >>>>> Center for Genetic Discovery Eccles Institute of Human Genetics
> >>>>> University of Utah
> >>>>> 15 North 2030 East, Room 2100
> >>>>> Salt Lake City, UT 84112-5330
> >>>>> ph:801-587-7707
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/18/16, 8:34 AM, "Keilwagen, Jens"
> >>>>> <jens.keilwagen at jki.bund.de>
> >>>> wrote:
> >>>>>
> >>>>>> Dear Prof. Yandell,
> >>>>>>
> >>>>>> we have published a homology-based gene prediction program
> today:
> >>>>>>
> https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw09
> >>>>>> 2 and I'd like to ask how we can use MAKER to combine
> predictions
> >>>>>> of GeMoMa using different reference organisms, i.e. we try to
> >>>>>> predict the genes of an target organism (e.g. wheat) using the
> >>>>>> annotated genes of other reference organisms (e.g. grasses).
> >>>>>> GeMoMa returns
> >>>> for
> >>>>>> each reference organism a GFF with the predicted gene models in
> >> the
> >>>> target organism.
> >>>>>>
> >>>>>> It would be great if you or someone from your team could give us
> >>>> some
> >>>>>> hints or point us to correct paragraph in the documentation.
> >>>>>>
> >>>>>> Thanks a lot and best regards, Jens
> >>>>>>
> >>>>>> ---
> >>>>>>
> >>>>>> Dr. Jens Keilwagen
> >>>>>>
> >>>>>> Julius K?hn-Institut (JKI) - Federal Research Centre for
> >> Cultivated
> >>>>>> Plants
> >>>>>> 	Institute for Biosafety in Plant Biotechnology
> >>>>>>
> >>>>>> Erwin-Baur-Stra?e 27
> >>>>>> 06484 Quedlinburg
> >>>>>> Germany
> >>>>>>
> >>>>>> Phone: ++49 (0)3946 47 510
> >>>>>> EMail: jens.keilwagen at jki.bund.de
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >
> > <maker_opts.ctl><slurm-278767.out>


From venyao at qq.com  Sun Sep 24 03:08:43 2017
From: venyao at qq.com (=?ISO-8859-1?B?V2VuIFlhbw==?=)
Date: Sun, 24 Sep 2017 17:08:43 +0800
Subject: [maker-devel] integrate gmap into Maker
Message-ID: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>

Dear Guys,


I am using Maker to annotate my genome sequence. However, it costs too much time.


By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.


I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the


alignment to Maker. If so, this may save much time, as GMAP is very fast.


Thanks!


Best regards,


Wen Yao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170924/8d42e58d/attachment-0003.html>

From eennadi at gmail.com  Sun Sep 24 15:24:10 2017
From: eennadi at gmail.com (Emmanuel Nnadi)
Date: Sun, 24 Sep 2017 22:24:10 +0100
Subject: [maker-devel] Maker not installing
In-Reply-To: <8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
Message-ID: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>

Hello,

Good day,

I am trying to assign putative gene function to the maker generated fasta.
I am using NCBI

I keep getting this error
  Command line argument error: Argument "query". File is not accessible:
`muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'

What do I do?

can I use blast2go in place of ncbi command line software?

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:
https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu> wrote:

> Hi Emmanuel, In order for anyone to help you, you need post to the mailing
> list the command and output (including errors) of the step that didn?t
> work.
>
> Thanks,
> Daniel Ence
>
>
> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
>
> Hello all,
>
> I downloaded Maker and tried to install it. I succeeded in installing all
> prerequisites however running maker ./build install, it showed that maker
> installed.
>
> However trying to run maker it wouldn't run.
>
> Please how do I install maker to run on local computer?
>
> Thanks
>
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/
> publications
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170924/90a7c717/attachment-0003.html>

From dandence at gmail.com  Mon Sep 25 08:11:31 2017
From: dandence at gmail.com (Daniel Ence)
Date: Mon, 25 Sep 2017 10:11:31 -0400
Subject: [maker-devel] integrate gmap into Maker
In-Reply-To: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
References: <tencent_C1B73CB316EAE11D6C6509793B207C5AE908@qq.com>
Message-ID: <7E5F06C8-05B2-447F-A695-DDE7673BDEFF@gmail.com>

Without commenting on the merits of GMAP vs Blastn or Exonerate, you can provide evidence alignments from any source in gff format in the maker control files. I think for GMAP this would mean converting the sam/bam outputs to a gff3 format, but I don?t know those steps of the top of my head. 

~Daniel 


> On Sep 24, 2017, at 5:08 AM, Wen Yao <venyao at qq.com> wrote:
> 
> Dear Guys,
> 
>  
> 
> I am using Maker to annotate my genome sequence. However, it costs too much time.
> 
> By default, Maker use Blastn and Exonerate to align EST or assembled transcripts to the genome.
> 
> I am wondering if I can use GMAP to align the assembled transcripts to the genome and then provide the
> 
> alignment to Maker. If so, this may save much time, as GMAP is very fast.
> 
> 
> 
> Thanks!
> 
>  
> 
> Best regards,
> 
>  
> 
> Wen Yao
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/143d3024/attachment-0003.html>

From carsonhh at gmail.com  Mon Sep 25 10:07:39 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Mon, 25 Sep 2017 10:07:39 -0600
Subject: [maker-devel] Maker not installing
In-Reply-To: <CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CUH4NjOev7cW2-d8EUKJPmQ1Ob_hLQX-mFG_jTSVfzD3Q@mail.gmail.com>
Message-ID: <07342091-897A-46C2-B000-76A283FE5FB1@gmail.com>

I?m not sure what you mean by NCBI. Do you mean BLAST? If so, you probably did not format and index your input database before running BLAST. See BLAST documentation.

Also the file you are using ?> muc1_genome_snap2.all.maker.snap_masked.proteins.fasta

That is not the maker result file. That is a reference fasta of raw SNAP results. The MAKER result file will have a name like this (see maker documentation) ?> muc1_genome_snap2.all.maker.proteins.fasta

?Carson


> On Sep 24, 2017, at 3:24 PM, Emmanuel Nnadi <eennadi at gmail.com> wrote:
> 
> Hello,
> 
> Good day,
> 
> I am trying to assign putative gene function to the maker generated fasta. I am using NCBI
> 
> I keep getting this error
>   Command line argument error: Argument "query". File is not accessible:  `muc1_genome_snap2.all.maker.snap_masked.proteins.fasta'
> 
> What do I do?
> 
> can I use blast2go in place of ncbi command line software?
> 
> Nnadi Nnaemeka Emmanuel
> Department of Microbiology,
> Faculty of Natural and Applied Science,
> Plateau State University, Bokkos, Plateau State, Nigeria.
> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
> On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu <mailto:d.ence at ufl.edu>> wrote:
> Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work. 
> 
> Thanks,
> Daniel Ence
> 
> 
>> On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com <mailto:eennadi at gmail.com>> wrote:
>> 
>> Hello all,
>> 
>> I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.
>> 
>> However trying to run maker it wouldn't run.
>> 
>> Please how do I install maker to run on local computer?
>> 
>> Thanks
>> 
>> Nnadi Nnaemeka Emmanuel
>> Department of Microbiology,
>> Faculty of Natural and Applied Science,
>> Plateau State University, Bokkos, Plateau State, Nigeria.
>> Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications <https://www.researchgate.net/profile/Emmanuel_Nnadi/publications>
>> 
>>    
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/c21cf4d8/attachment-0003.html>

From xvazquezc at gmail.com  Tue Sep 26 01:23:13 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Tue, 26 Sep 2017 17:23:13 +1000
Subject: [maker-devel] question about Maker-MPI
Message-ID: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>

Hi Carson,
We finally got Maker working with MPI (mpich, openmpi was a dead end...)
and I have a question about how Maker distributes the computation load.
I know, correct me if I'm wrong, that with MPI, Maker runs blast in
parallel (1 instance per thread) for protein2genome and est2genome. This
indeed improves enormously the speed for the initial run.
But, does it take advance of this at the time of running the gene
predictors? I think there is no benefit on multiple cpus in non-MPI mode
but I have no idea in MPI.
Thank you in advance,
Xabi

-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/f9539591/attachment-0003.html>

From carsonhh at gmail.com  Tue Sep 26 09:28:58 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 09:28:58 -0600
Subject: [maker-devel] question about Maker-MPI
In-Reply-To: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
References: <CAL0hg4EemGn5F25U6xQjBpG0i_ZYzSa=E3E2OXgBu8Q_2qm87A@mail.gmail.com>
Message-ID: <E29F4653-61A3-4E33-967A-4E1A9C8C4721@gmail.com>

MAKER parallelizes at multiple levels. For the ab initio predictors, it will run multiple contigs simultaneously (so each one will get their own ab initio predictor running). For large contigs it will further divide it into 10Mb chunks, and each will run simultaneously.

?Carson


> On Sep 26, 2017, at 1:23 AM, Xabier V?zquez-Campos <xvazquezc at gmail.com> wrote:
> 
> Hi Carson,
> We finally got Maker working with MPI (mpich, openmpi was a dead end...) and I have a question about how Maker distributes the computation load.
> I know, correct me if I'm wrong, that with MPI, Maker runs blast in parallel (1 instance per thread) for protein2genome and est2genome. This indeed improves enormously the speed for the initial run.
> But, does it take advance of this at the time of running the gene predictors? I think there is no benefit on multiple cpus in non-MPI mode but I have no idea in MPI.
> Thank you in advance,
> Xabi
> 
> -- 
> Xabier V?zquez-Campos, PhD
> Research Associate
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/52293014/attachment-0003.html>

From cjfields at illinois.edu  Mon Sep 25 08:53:39 2017
From: cjfields at illinois.edu (Fields, Christopher J)
Date: Mon, 25 Sep 2017 14:53:39 +0000
Subject: [maker-devel] Maker not installing
In-Reply-To: <78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
References: <CAOXM=CXp_YU=4kaNCFPVnoWKTzAP5we5XTFX6uFTWJSY_11fng@mail.gmail.com>
	<8576F599-08C0-455A-B0D7-8A1110C6F18F@mail.ufl.edu>
	<CAOXM=CWWENFHJUhYPawM5+F4571HYOyp3oWD1xju9H9LABq9VQ@mail.gmail.com>
	<113031DF-E733-4EEF-BDB7-405C15F0CA24@mail.ufl.edu>
	<CAOXM=CVrYeNV9i6TxYDn1B-dCA-OYh+VqsEmc-PsA2rP14UZ9A@mail.gmail.com>
	<546610AC-0B7B-4D02-BBF3-E847B95D7F0D@mail.ufl.edu>
	<C3F7BAD9-C134-4EC9-9792-F2C27A78FBFD@gmail.com>
	<CAOXM=CXeT-zmFubptPnWsAygnv8o+EQnCWD60iLf5eFFBkb6rg@mail.gmail.com>
	<8EAFE412-9EF7-4DB7-85A3-632BAC3372FD@gmail.com>
	<CAOXM=CVbJonkz=O5=N_NmB1L+524D6Uk9Fjh-ZBoJmu+9AsMFQ@mail.gmail.com>
	<C9D0377D-1C89-4910-A6BE-FA7DD3D8CDCB@gmail.com>
	<CAOXM=CX7yZTTbBALhjBzz0qtMxQGxBNVO0UEnkis3aDvVf=GLA@mail.gmail.com>
	<C06B6277-1CFA-4B8D-8E80-D08910EBD77C@gmail.com>
	<CAOXM=CXSBG+h1DYoOBrFHYr_JXZX6YCFyo415HRGHozMcmHicg@mail.gmail.com>
	<EC4577AD-A3C5-45CD-ADE7-5D19ED089833@gmail.com>
	<CAOXM=CVby_i3vfF_H46RTZWoqBLMCBGNqJoywxG_2KQJPc5Ttw@mail.gmail.com>
	<7440971C-8A18-4A07-91B6-AD16D17F0766@gmail.com>
	<CAOXM=CXUBcvbqZdT2nviYnFyC3X4Xbj3oREVt650n0c01HESyg@mail.gmail.com>
	<426E63C1-8C82-4809-B2A1-EF7A909E6712@gmail.com>
	<CAOXM=CUPj9EGj_LNEhBpOw0oCiQKoVw8JuPv8-WASo2paSXkug@mail.gmail.com>
	<CAOXM=CUxE-MSWsjZwMYeZ-aOu9Jc9bP6k1uzDLrEyfyqwR-B_Q@mail.gmail.com>
	<78A8137E-42B4-4766-9E4A-1B3C2F4FC578@gmail.com>
Message-ID: <ED8DB3BD-0981-4883-8CE0-E920BCEE0CC6@illinois.edu>

Emmanuel,

Look for anything that will help calculate basic assembly metrics, such as N50, NG50, L50, etc.; these almost always give overall assembly size, and total scaffolds/contigs.  For instance I?ve used this:

http://korflab.ucdavis.edu/datasets/Assemblathon/Assemblathon2/Basic_metrics/assemblathon_stats.pl

(it requires FALite, which is here: http://korflab.ucdavis.edu/Unix_and_Perl/FAlite.pm )

The Broad also has GAEMR (http://software.broadinstitute.org/software/gaemr/ ), but I haven?t tested it myself (I?ve heard it?s a bit finicky).

Also, see this: https://www.biostars.org/p/237591/ , which has a few more options.

chris

From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Carson Holt <carsonhh at gmail.com>
Date: Friday, September 22, 2017 at 3:09 PM
To: Emmanuel Nnadi <eennadi at gmail.com>
Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
Subject: Re: [maker-devel] Maker not installing

MAKER can?t give you those details. All MAKER does is try and identify gene models against the assembly you provide.

?Carson

On Sep 22, 2017, at 1:27 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,
Please how can I determine the following in maker:
1. The total number of chromosomes
2. The size of my genome


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 10:52 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Ok, thanks.
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Sep 1, 2017 10:50 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
It would need to be a new run. You won't be able to use the updated contig names with the old run.
--Carson

Sent from my iPhone

On Sep 1, 2017, at 3:41 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:
Hi carson
Thanks for the tip
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

It worked well however, when i ran it, it removed 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

I have ran maker with 1_S7_R1_001_\(paired\)_trimmed_\(paired\)_,

1. How can I effect the change when maker has produced some files from the the old sequence?

I have spent more than 24 hours running maker and it has produced some folders already.

How can I make this change?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Fri, Sep 1, 2017 at 4:54 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
BLAST which is used by MAKER can not handle really long contig names. MAKER tries to get around this by adding a secondary tag to the fasta header when long names are detected. Even then it would be better to change the IDs of your contigs to avoid downstream failures.

I would recommend removing '1_S7_R1_001_(paired)_trimmed_(paired)_? from each contig name.

Example command to do that ?>
perl -ane 's/1_S7_R1_001_\(paired\)_trimmed_\(paired\)_//g; print' genome.fasta

?Carson


On Aug 30, 2017, at 3:54 PM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson
Thanks for your response its been helpful

Please bear with me as I work through this

1. Please how do I generate EST for my novel sequences?
2. I am currently running maker without EST and protein sequences is it wrong? Can it predict properly?
3. One error in the contig just returned this value
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 )
 at /usr/local/bin/RepeatMasker line 1464.
ERROR: RepeatMasker failed
--> rank=NA, hostname=emmannaemekas-MacBook-Pro.local
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:1_S7_R1_001_(paired)_trimmed_(paired)_contig_2

examining contents of the fasta file and run log


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Wed, Aug 30, 2017 at 4:12 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
You can query valid species names using the queryTaxonomyDatabase.pl script that comes with RepeatMasker. Try not to be too specific. In general you should use the genus rather than the species for example (or even use all of RepBase).

Example ?>
perl ?/RepeatMasker/util/queryTaxonomyDatabase.pl -species ?drosophila"

?Carson


On Aug 30, 2017, at 9:05 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,

 Thanks
I was able to start using maker.

However I am working with a plant Genome novel. I had set the repeatmasking to
1. Dcotrep a names from the repbase release but maker returned it back as not known to repeat masker

How can I use specific known genomes for repeat masking
Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


On Aug 29, 2017 4:26 PM, "Carson Holt" <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
MAKER will read the genome= options from the maker_opts.ctl file in your current directory or the maker_opts.ctl you specified on the command line. The error means you have left the value empty. Perhaps you did not save the changes you made or you did not specify the location of the maker_opts.ctl file to use.

You can check the contents of the file using cat. Example ?> cat maker_opts.ctl

?Carson


On Aug 29, 2017, at 5:11 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot for yesterday. I was able to resolve the issue of running maker and i followed the commands in the tutorial.
I however encountered another problem

when I ran the command nano -c maker_opts.ctl

It gave the following 1_S7_assembly.fa I specified the name of the genome but when I ran maker in another tab it gave

#-----Genome (these are always required)
genome=1_S7_assembly.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/Users/emmannaemeka/Desktop/Gpm/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)


I ran maker command on another tab and it returned the following
STATUS: Parsing control files...
ERROR: You have failed to provide a value for 'genome' in the control files.

--> rank=NA, hostname=emmannamekasMBP


Questions
1. Specifying the genome location, do I need to run maker on the same tab or open another bash tab?
2. My genome is novel and do not have proteins, how do I generate protein fast for the de novo sequence and EST?


Thanks

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:47 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Here is a class on how to use MAKER taught a couple of years back ?> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014

There is also a linked video as well as an amazon image of the class material where you can run the image in the cloud and follow along.

Thanks,
Carson


On Aug 28, 2017, at 11:43 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Carson,
Thanks a lot

I ran this command maker -h it returned the following

The last thing I wish to ask you, how can I load my genome fine and being annotation?

Thanks

emmannamekasMBP:maker emmannaemeka$ maker -h

MAKER version 2.31.9

Usage:

     maker [options] <maker_opts> <maker_bopts> <maker_exe>


Description:

     MAKER is a program that produces gene annotations in GFF3 format using
     evidence such as EST alignments and protein homology. MAKER can be used to
     produce gene annotations for new genomes as well as update annotations
     from existing genome databases.

     The three input arguments are control files that specify how MAKER should
     behave. All options for MAKER should be set in the control files, but a
     few can also be set on the command line. Command line options provide a
     convenient machanism to override commonly altered control file values.
     MAKER will automatically search for the control files in the current
     working directory if they are not specified on the command line.

     Input files listed in the control options files must be in fasta format
     unless otherwise specified. Please see MAKER documentation to learn more
     about control file  configuration.  MAKER will automatically try and
     locate the user control files in the current working directory if these
     arguments are not supplied when initializing MAKER.

     It is important to note that MAKER does not try and recalculated data that
     it has already calculated.  For example, if you run an analysis twice on
     the same dataset you will notice that MAKER does not rerun any of the
     BLAST analyses, but instead uses the blast analyses stored from the
     previous run. To force MAKER to rerun all analyses, use the -f flag.

     MAKER also supports parallelization via MPI on computer clusters. Just
     launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be
     configured during the MAKER installation process for this to work though


Options:

     -genome|g <file>    Overrides the genome file path in the control files

     -RM_off|R           Turns all repeat masking options off.

     -datastore/         Forcably turn on/off MAKER's two deep directory
      nodatastore        structure for output.  Always on by default.

     -old_struct         Use the old directory styles (MAKER 2.26 and lower)

     -base    <string>   Set the base name MAKER uses to save output files.
                         MAKER uses the input genome file name by default.

     -tries|t <integer>  Run contigs up to the specified number of tries.

     -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
                         Note: this is for BLAST and not for MPI!

     -force|f            Forces MAKER to delete old files before running again.
This will require all blast analyses to be rerun.

     -again|a            recaculate all annotations and output files even if no
settings have changed. Does not delete old analyses.

     -quiet|q            Regular quiet. Only a handlful of status messages.

     -qq                 Even more quiet. There are no status messages.

     -dsindex            Quickly generate datastore index file. Note that this
                         will not check if run settings have changed on contigs

     -nolock             Turn off file locks. May be usful on some file systems,
                         but can cause race conditions if running in parallel.

     -TMP                Specify temporary directory to use.

     -CTL                Generate empty control files in the current directory.

     -OPTS               Generates just the maker_opts.ctl file.

     -BOPTS              Generates just the maker_bopts.ctl file.

     -EXE                Generates just the maker_exe.ctl file.

     -MWAS    <option>   Easy way to control mwas_server for web-based GUI

                              options:  STOP
                                        START
                                        RESTART

     -version            Prints the MAKER version.

     -help|?             Prints this usage statement.


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 6:36 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
Path needs to be a list of directories to search (you specified an executable location).

So not this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin/maker

Instead it needs to be this ?> /Users/emmannaemeka/Desktop/Gpm/maker/bin

?Carson


On Aug 28, 2017, at 11:32 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>>
wrote:

Thanks

I tried to export PATH

running
echo $PATH in the maker directory this returned

/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Users/emmannaemeka/Desktop/Gpm/maker/bin/maker


1. Does it mean that PATH has been exported?


secondly,

I tried to run
the command maker -h, which maker, maker -CTL

nothing returned.

2. how do i start up maker?
3. Do I need to be in maker directory to start maker?

Thanks


Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 4:49 PM, Carson Holt <carsonhh at gmail.com<mailto:carsonhh at gmail.com>> wrote:
After install the executables will be in the ?/maker/bin directory. Example (if you did the install in your home directory) ?> ~/maker/bin/maker

You need to add the ?/maker/bin directory to your PATH for it to be found just by typing ?maker'

Explanation of the Linux PATH ?> http://www.linfo.org/path_env_var.html

?Carson


On Aug 28, 2017, at 8:07 AM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:

Sorry I should have typed ?maker -CTL?. If that doesn?t work, what is the result of ?which maker??


On Aug 28, 2017, at 10:00 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Daniel
The reply is
emmannamekasMBP:maker emmannaemeka$ MAKER -ctl
-bash: MAKER: command not found

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:57 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi, It looks like MAKER installed ok. What is the command that you used to try to run MAKER? Can you show the result of running ?MAKER -ctl??

Thanks,
Daniel Ence


On Aug 28, 2017, at 9:24 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hi Ence,
Thanks for your reply,

This is the step and error received

emmannamekasMBP:src emmannaemeka$ ./build install

Installing MAKER...

Building MAKER

Skip /Users/emmannaemeka/desktop/Gpm/maker/src/../perl/config-darwin-thread-multi-2level-5.018002 (unchanged)


The build status is


=============================================================================

STATUS MAKER v2.31.9

==============================================================================

PERL Dependencies:  VERIFIED

External Programs:  VERIFIED

External C Libraries:   VERIFIED

MPI SUPPORT:        DISABLED

MWAS Web Interface: DISABLED

MAKER PACKAGE:      CONFIGURATION OK

Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications

On Mon, Aug 28, 2017 at 2:00 PM, Ence,daniel <d.ence at ufl.edu<mailto:d.ence at ufl.edu>> wrote:
Hi Emmanuel, In order for anyone to help you, you need post to the mailing list the command and output (including errors) of the step that didn?t work.

Thanks,
Daniel Ence


On Aug 27, 2017, at 10:16 AM, Emmanuel Nnadi <eennadi at gmail.com<mailto:eennadi at gmail.com>> wrote:

Hello all,

I downloaded Maker and tried to install it. I succeeded in installing all prerequisites however running maker ./build install, it showed that maker installed.

However trying to run maker it wouldn't run.

Please how do I install maker to run on local computer?

Thanks
Nnadi Nnaemeka Emmanuel
Department of Microbiology,
Faculty of Natural and Applied Science,
Plateau State University, Bokkos, Plateau State, Nigeria.
Publications:  https://www.researchgate.net/profile/Emmanuel_Nnadi/publications


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com<mailto:maker-devel at box290.bluehost.com>
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170925/2ac6b193/attachment-0003.html>

From tfallon at mit.edu  Tue Sep 26 11:40:21 2017
From: tfallon at mit.edu (Tim Fallon)
Date: Tue, 26 Sep 2017 13:40:21 -0400
Subject: [maker-devel] MAKER changelog?
Message-ID: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>

Hi there,

I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?

All the best,
-Tim

Timothy R. Fallon
PhD candidate
Laboratory of Jing-Ke Weng
Department of Biology
MIT

tfallon at mit.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1853 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/498cc616/attachment-0003.p7s>

From carsonhh at gmail.com  Tue Sep 26 12:34:16 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Tue, 26 Sep 2017 12:34:16 -0600
Subject: [maker-devel] MAKER changelog?
In-Reply-To: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
References: <DF018D85-759C-4C02-8D97-C45AADACF9A0@mit.edu>
Message-ID: <C32D3C31-125B-4D3D-8E0B-CD4ED629E541@gmail.com>

Here you go.

*updated the locations for repbase and augustus
*make library install more portable for newer perl versions
*fix for cdna2genome single exon strand
*updates for beter hints in augustus (exact rather than partial intron match)
*added allow_overlap for UTR in fungi and prokaryotes
*uri escape snap name in zff conversion
*fix for BioPerl-live related error (also submitted fix to BioPerl)
*jaccard cluster and bug fixes for cigar string
*Added zff2genebank script for training augustus (adapted from Jason Stajich's zff2augustus_gbk.pl)

?Carson


> On Sep 26, 2017, at 11:40 AM, Tim Fallon <tfallon at mit.edu> wrote:
> 
> Hi there,
> 
> I recently noticed the MAKER 3.0 beta version incremented from 3.0.0., to 3.01.1.  Is there a changelog which describes the updates between the two releases?
> 
> All the best,
> -Tim
> 
> Timothy R. Fallon
> PhD candidate
> Laboratory of Jing-Ke Weng
> Department of Biology
> MIT
> 
> tfallon at mit.edu <mailto:tfallon at mit.edu>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170926/a7ae24bf/attachment-0003.html>

From qwzhang0601 at gmail.com  Wed Sep 27 08:30:28 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 10:30:28 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
Message-ID: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>

Hello:

Thank you for all your previous comments and suggestions. We annotated a
new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
annotation (about 250k scaffolds).

For repeats masking, we also build a species specific library. We used both
transcriptome and protein sequences as evidences (including 10k reviewed
Mammalian and 340k predicted rodent protein sequences from uniprot). We
predicted 28800 genes with AED<1 (the "default" gene set).

For the 28800 predicted proteins, about 90% have AED value less than 0.5,
and 74% have domains by "InterProScan". It seems the genome was well
annotated, but I still feel  28800 protein coding genes are too many for a
rodent species. Do you think this gene set is good for downstream analysis
(e.g., gene family expansion analysis, positive selection analysis)? Or can
I do further filtering to make the number of genes closer to estimated
number (e.g., 22,000)?

Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b07f2f47/attachment-0003.html>

From dandence at gmail.com  Wed Sep 27 08:54:30 2017
From: dandence at gmail.com (Daniel Ence)
Date: Wed, 27 Sep 2017 10:54:30 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
Message-ID: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>

Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 

I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.

Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 

Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 

Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 

Hope this helps, 
Daniel

> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
> 
> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
> 
> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
> 
> Thanks
> 
> Best
> Quanwei
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/17cf26fd/attachment-0003.html>

From michael.s.campbell1 at gmail.com  Wed Sep 27 09:34:11 2017
From: michael.s.campbell1 at gmail.com (Michael Campbell)
Date: Wed, 27 Sep 2017 11:34:11 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
Message-ID: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>

Hi Quanwei,

The first thing that comes to mind with too many genes is undermasked repeats. You could check the Pfam donmains for things like integrase, GAG proteins, and other transposon related domains. I would also look a bit closer at the genes with AEDs greater than 0.5. Looking and things like average numner of exons per transcript and average gene and transcript lengths can help pick out dodgy genes. You could also do some filtering on the QI values output by MAKER. It is defensible to create a ?higher quality? set by limiting it to genes with AEDs less than 0.5 and puting some requirement on the fractions of splice sites confirmed by EST/mRNA-seq alignments. 

Take care,
Mike
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
> 
> Hi Quanwei, I think that your genome assembly probably contains many contigs that are too small to contain full gene sequences. Rather than 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful threshold. This is mentioned in the maker_opts.ctl file with the min_contig paramter: ?skip genome contigs below this length (under 10kbp are often useless)?. 
> 
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and contigs but excluding those contigs would probably reduce your gene count. These may be fragments or duplicates of genes present on these sequences that weren?t assembled properly.
> 
> Also using predicted protein sequences from uniprot as evidence in your annotation is probably not advisable since those sequences are not from genes with experiment evidence. This is the trEMBL vs swiss-prot issue that that you asked about earlier. 
> 
> Additionally requiring a minimum protein length as you asked about earlier could also reduce the gene count. 
> 
> Ultimately, you may do whatever filtering you find necessary and justifiable for your annotation depending on the biology of your organism and the methods that generated your assembly, and your annotation. 
> 
> Hope this helps, 
> Daniel
> 
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com <mailto:qwzhang0601 at gmail.com>> wrote:
>> 
>> Hello:
>> 
>> Thank you for all your previous comments and suggestions. We annotated a new rodent species using the maker2 pipeline. The assembly is about 3.2Gb with N50 24.3Mb. I included all scaffolds longer than 300bp for gene annotation (about 250k scaffolds). 
>> 
>> For repeats masking, we also build a species specific library. We used both transcriptome and protein sequences as evidences (including 10k reviewed Mammalian and 340k predicted rodent protein sequences from uniprot). We predicted 28800 genes with AED<1 (the "default" gene set). 
>> 
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5, and 74% have domains by "InterProScan". It seems the genome was well annotated, but I still feel  28800 protein coding genes are too many for a rodent species. Do you think this gene set is good for downstream analysis (e.g., gene family expansion analysis, positive selection analysis)? Or can I do further filtering to make the number of genes closer to estimated number (e.g., 22,000)?
>> 
>> Thanks
>> 
>> Best
>> Quanwei
>> 
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/b72e2514/attachment-0003.html>

From xvazquezc at gmail.com  Wed Sep 27 18:32:30 2017
From: xvazquezc at gmail.com (=?UTF-8?Q?Xabier_V=C3=A1zquez=2DCampos?=)
Date: Thu, 28 Sep 2017 10:32:30 +1000
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
Message-ID: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>

Hi Quanwei,
Following Michael comment, even if you use Swissprot, there are over 2700
transposases in it. If there is some undermasking, they will show up as
evidence.
Cheers,
Xabi

On 28 September 2017 at 01:34, Michael Campbell <
michael.s.campbell1 at gmail.com> wrote:

> Hi Quanwei,
>
> The first thing that comes to mind with too many genes is undermasked
> repeats. You could check the Pfam donmains for things like integrase, GAG
> proteins, and other transposon related domains. I would also look a bit
> closer at the genes with AEDs greater than 0.5. Looking and things like
> average numner of exons per transcript and average gene and transcript
> lengths can help pick out dodgy genes. You could also do some filtering on
> the QI values output by MAKER. It is defensible to create a ?higher
> quality? set by limiting it to genes with AEDs less than 0.5 and puting
> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
> alignments.
>
> Take care,
> Mike
>
> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>
> Hi Quanwei, I think that your genome assembly probably contains many
> contigs that are too small to contain full gene sequences. Rather than
> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
> paramter: ?skip genome contigs below this length (under 10kbp are often
> useless)?.
>
> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
> contigs but excluding those contigs would probably reduce your gene count.
> These may be fragments or duplicates of genes present on these sequences
> that weren?t assembled properly.
>
> Also using predicted protein sequences from uniprot as evidence in your
> annotation is probably not advisable since those sequences are not from
> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
> that you asked about earlier.
>
> Additionally requiring a minimum protein length as you asked about earlier
> could also reduce the gene count.
>
> Ultimately, you may do whatever filtering you find necessary and
> justifiable for your annotation depending on the biology of your organism
> and the methods that generated your assembly, and your annotation.
>
> Hope this helps,
> Daniel
>
> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Hello:
>
> Thank you for all your previous comments and suggestions. We annotated a
> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
> annotation (about 250k scaffolds).
>
> For repeats masking, we also build a species specific library. We used
> both transcriptome and protein sequences as evidences (including 10k
> reviewed Mammalian and 340k predicted rodent protein sequences from
> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>
> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
> and 74% have domains by "InterProScan". It seems the genome was well
> annotated, but I still feel  28800 protein coding genes are too many for a
> rodent species. Do you think this gene set is good for downstream analysis
> (e.g., gene family expansion analysis, positive selection analysis)? Or can
> I do further filtering to make the number of genes closer to estimated
> number (e.g., 22,000)?
>
> Thanks
>
> Best
> Quanwei
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>


-- 
Xabier V?zquez-Campos, *PhD*
*Research Associate*
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170928/1a63a2ec/attachment-0003.html>

From qwzhang0601 at gmail.com  Wed Sep 27 20:04:43 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Wed, 27 Sep 2017 22:04:43 -0400
Subject: [maker-devel] Suggestions if too many predicted genes
In-Reply-To: <CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
References: <CAOW6FSLpSEfgyutvK=3oTn=71GsX=TUGXvM4mMbeZj2sV+TuuQ@mail.gmail.com>
	<42F0CCB8-20DB-4E6F-85CB-CBE939F8208E@gmail.com>
	<546F5254-0C46-4BB6-93E8-80B5EC7E40D1@gmail.com>
	<CAL0hg4EXwhZBSBE23X4auu8UW0hHW+=Vghwt8dMttfSbgriF=w@mail.gmail.com>
Message-ID: <CAOW6FSJPZBiriKh9L5knuGp_ZCSEVxw4+eftyddk+o3kFwTTCw@mail.gmail.com>

Thank you all for your comments and suggestions. Yes, even when I only use
Swissprot I still have 26.5k protein coding genes. As you mentioned one
reason may be related to repeat masking, and another one may be because of
inclusion of short scaffolds, which further lead to protein fragments.

About the repeat masking, I use the latest Repeatmaker and Repbase
(selected Mammalian), I also build species specific repeat libraries
following
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic.
About transposases I know the Maker pipe line already provided
"transposable element proteins". I do not know what else I can do.

About the short scaffolds, in  fact among the 26.5k genes, only about 400
genes are predicted from scaffolds shorter than 10kb. Besides, I know there
are some very short proteins (e.g.,the mouse protein RL41 (60s ribosomal
protein) has lengh 25). I think short scaffolds may also include some short
proteins.

Now, I plan to start from the 26.5k protein coding genes. I think the less
reliable ones will be filtered out in downstream analysis. For example,
when we construct the gene families, those fragments or falsely predicted
proteins will more like to be excluded from gene families.

Thank you all for your suggestions.

Best
Qaunwei


2017-09-27 20:32 GMT-04:00 Xabier V?zquez-Campos <xvazquezc at gmail.com>:

> Hi Quanwei,
> Following Michael comment, even if you use Swissprot, there are over 2700
> transposases in it. If there is some undermasking, they will show up as
> evidence.
> Cheers,
> Xabi
>
> On 28 September 2017 at 01:34, Michael Campbell <
> michael.s.campbell1 at gmail.com> wrote:
>
>> Hi Quanwei,
>>
>> The first thing that comes to mind with too many genes is undermasked
>> repeats. You could check the Pfam donmains for things like integrase, GAG
>> proteins, and other transposon related domains. I would also look a bit
>> closer at the genes with AEDs greater than 0.5. Looking and things like
>> average numner of exons per transcript and average gene and transcript
>> lengths can help pick out dodgy genes. You could also do some filtering on
>> the QI values output by MAKER. It is defensible to create a ?higher
>> quality? set by limiting it to genes with AEDs less than 0.5 and puting
>> some requirement on the fractions of splice sites confirmed by EST/mRNA-seq
>> alignments.
>>
>> Take care,
>> Mike
>>
>> On Sep 27, 2017, at 10:54 AM, Daniel Ence <dandence at gmail.com> wrote:
>>
>> Hi Quanwei, I think that your genome assembly probably contains many
>> contigs that are too small to contain full gene sequences. Rather than
>> 300bp, a minimum scaffold length of 5kbp or 10kbp is a more useful
>> threshold. This is mentioned in the maker_opts.ctl file with the min_contig
>> paramter: ?skip genome contigs below this length (under 10kbp are often
>> useless)?.
>>
>> I don?t know how many genes are annotated on small (<10kbp) scaffolds and
>> contigs but excluding those contigs would probably reduce your gene count.
>> These may be fragments or duplicates of genes present on these sequences
>> that weren?t assembled properly.
>>
>> Also using predicted protein sequences from uniprot as evidence in your
>> annotation is probably not advisable since those sequences are not from
>> genes with experiment evidence. This is the trEMBL vs swiss-prot issue that
>> that you asked about earlier.
>>
>> Additionally requiring a minimum protein length as you asked about
>> earlier could also reduce the gene count.
>>
>> Ultimately, you may do whatever filtering you find necessary and
>> justifiable for your annotation depending on the biology of your organism
>> and the methods that generated your assembly, and your annotation.
>>
>> Hope this helps,
>> Daniel
>>
>> On Sep 27, 2017, at 10:30 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>> wrote:
>>
>> Hello:
>>
>> Thank you for all your previous comments and suggestions. We annotated a
>> new rodent species using the maker2 pipeline. The assembly is about 3.2Gb
>> with N50 24.3Mb. I included all scaffolds longer than 300bp for gene
>> annotation (about 250k scaffolds).
>>
>> For repeats masking, we also build a species specific library. We used
>> both transcriptome and protein sequences as evidences (including 10k
>> reviewed Mammalian and 340k predicted rodent protein sequences from
>> uniprot). We predicted 28800 genes with AED<1 (the "default" gene set).
>>
>> For the 28800 predicted proteins, about 90% have AED value less than 0.5,
>> and 74% have domains by "InterProScan". It seems the genome was well
>> annotated, but I still feel  28800 protein coding genes are too many for a
>> rodent species. Do you think this gene set is good for downstream analysis
>> (e.g., gene family expansion analysis, positive selection analysis)? Or can
>> I do further filtering to make the number of genes closer to estimated
>> number (e.g., 22,000)?
>>
>> Thanks
>>
>> Best
>> Quanwei
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>
> --
> Xabier V?zquez-Campos, *PhD*
> *Research Associate*
> NSW Systems Biology Initiative
> School of Biotechnology and Biomolecular Sciences
> The University of New South Wales
> Sydney NSW 2052 AUSTRALIA
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170927/4b9e4898/attachment-0003.html>

From qwzhang0601 at gmail.com  Thu Sep 28 06:05:19 2017
From: qwzhang0601 at gmail.com (Quanwei Zhang)
Date: Thu, 28 Sep 2017 08:05:19 -0400
Subject: [maker-devel] gene annotation for a better genome
Message-ID: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>

Hello:

Recently, we got a new version of NMR genome, whose genome had been
assembled and annotated a few years ago. We can download the gene
annotation from NCBI.

Now we want to annotate the new genome using Maker2 pipeline. I wonder how
can I fully make use of existing annotations. On the other hand, since the
previous genome is not very well assemblies, some genes annotation maybe
false positives. I hope those false positive genes in previous annotation
won't mislead Maker2 for current gene annotation.

Do you have any suggestions. Thanks

Best
Quanwei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170928/4192c41f/attachment-0003.html>

From carsonhh at gmail.com  Fri Sep 29 10:36:09 2017
From: carsonhh at gmail.com (Carson Holt)
Date: Fri, 29 Sep 2017 10:36:09 -0600
Subject: [maker-devel] gene annotation for a better genome
In-Reply-To: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
References: <CAOW6FSKn__0Gcco+7V+rtv9of4hbTp=LE4J+V51LPFZ1gnm7OQ@mail.gmail.com>
Message-ID: <5AFEDD05-DF02-463F-A6EE-1619A9BB968D@gmail.com>

You can try using the est2genome=1 option to map the old models forward onto the new assembly as if they were ESTs (add a line that says est_forward=1 to the control file to maintain old naming and set est=1 to the old model transcript file). Then provide the final models as a pred_gff for a subsuquent run (i.e. a traditional MAKER run where you are annotating the new assembly with transcript and protein evidence and ab initio predictors). Don?t supply the old models to est= on that run.

The idea behind doing it this way is:
1. You need to get old models onto the new assembly so coordinates will change. So by doing it this way, you will at least be able to move many models forward based on homology.
2. By providing the models to pred_gff on a subsequent MAKER run, you are just letting old models compete against new annotations. They will be rejected if they have no evidence support, or can be kept if they score better than alternate models from SNAP/Augustus. That way you have the chance to integrate old models while at the same time rejecting some old models that have no evidence overlap.

?Carson


> On Sep 28, 2017, at 6:05 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
> 
> Hello:
> 
> Recently, we got a new version of NMR genome, whose genome had been assembled and annotated a few years ago. We can download the gene annotation from NCBI. 
> 
> Now we want to annotate the new genome using Maker2 pipeline. I wonder how can I fully make use of existing annotations. On the other hand, since the previous genome is not very well assemblies, some genes annotation maybe false positives. I hope those false positive genes in previous annotation won't mislead Maker2 for current gene annotation.
> 
> Do you have any suggestions. Thanks
> 
> Best
> Quanwei  
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


From willett4 at email.unc.edu  Fri Sep 29 11:20:46 2017
From: willett4 at email.unc.edu (Willett, Christopher S)
Date: Fri, 29 Sep 2017 17:20:46 +0000
Subject: [maker-devel] question on gene numbers with quality_filter.pl
Message-ID: <16C1890A-2042-4BE1-93CE-8A8DC0C18151@ad.unc.edu>

Hello-

We are getting to the final stages (hopefully) of a reannotation of a new assembly of a copepod genome using MAKER and we had some questions about which set of genes to use.  Our latest runs were using Pfam domains to define default vs standard set using the quality_filter.pl script and I had a question about stringency of the filters for this script. It appears that the default is more stringent than the output that we get from MAKER without using this script (all with AED max set to 1). Are there additional filters in this script beyond AED that would cause this?

Here is what we are seeing if more details would be helpful. With a run with or without the keep_pred turned our final MAKER run gives ~21500 predicted genes with or 15200 without the keep predictions turned on. What I was wondering about was why this 15200 is higher than the default set  (which gives ~14500 genes) after we filter the gff using the -d setting in quality_filter.pl. For completeness the standard set (-s setting) is retaining ~14800 genes and if I filter the 15200 gff file with the default parameters that yields ~14100 genes. So I was curious what else was going on in the filter script beyond AED that would trim out genes?

The genes sets look pretty good overall and seem like reasonable numbers so we were debating which set to use as our final set. I am also trying a few other analyses in InterProScan to see if that identifies additional genes beyond Pfam for retention but that seems a bit independent from the question above.

Thanks for your help,

Best,

Chris Willett


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Research Associate Professor
Department of Biology
CB#3280 Coker Hall
University of North Carolina, Chapel Hill
Chapel Hill, NC, 27599-3280

Office: 2252 Genome Science Building
phone: 919-843-8663
fax: 919-962-1625


http://labs.bio.unc.edu/Willett/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170929/740b9569/attachment-0003.html>