[maker-devel] Maker on Amazon EC2 Using Starcluster
Carson Holt
carsonhh at gmail.com
Thu Jan 29 11:47:11 MST 2015
I believe this may be caused by the latency of ansyncrounous operations on your network shared drive (which could have a lot of lag between operations when running in the cloud). Try using a single AWS instance in your test using the local drive as the working directory. Next try with two instances where one id the NFS server and you run MAKER on the other instance but on the network mounted drive. Then try gradually increasing the number of instances hitting the network shared drive.
—Carson
> On Jan 27, 2015, at 2:30 PM, Jason Gallant <jgallant at msu.edu> wrote:
>
> Carson,
>
> Thanks for the input and the test script— I was successfully able to run Maker using OpenMPI on Starcluster. However, I am still receiving error messages fairly commonly— this is the error I described earlier in this thread. It seems to appear regardless of whether I use OpenMPI or MPICH2.
>
> Essentially, there seems to be an error collapsing BLAST reports. This error essentially causes maker to stop accepting new contigs on that machine (in this case node060), and maker continues to report every contig following this error as “failed”. Otherwise, the other nodes seem to be working normally, but this error seems to be able to happen on other nodes as well, so the issue can compound.
>
> [1,15]<stderr>:deleted:-60 hits
> [1,15]<stderr>:collecting blastx reports
> [1,15]<stderr>:ERROR: Could not colapse BLAST reports
> [1,15]<stderr>: at /root/maker/bin/../lib/GI.pm line 2524 thread 1.
> [1,15]<stderr>: GI::combine_blast_report(FastaChunk=HASH(0x1781acd8), ARRAY(0xc1e4fa8), ARRAY(0x15ab20d0), runlog=HASH(0xb87f878)) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 2760 thread 1
> [1,15]<stderr>: Process::MpiChunk::__ANON__() called at /root/maker/bin/../lib/Error.pm line 415 thread 1
> [1,15]<stderr>: eval {...} called at /root/maker/bin/../lib/Error.pm line 407 thread 1
> [1,15]<stderr>: Error::subs::try(CODE(0x198e22f8), HASH(0x9c9b65c0)) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 4224 thread 1
> [1,15]<stderr>: Process::MpiChunk::_go(Process::MpiChunk=HASH(0x1b8a7cd0), "run", HASH(0x15e3e1a0), 9, 3) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 341 thread 1
> [1,15]<stderr>: Process::MpiChunk::run(Process::MpiChunk=HASH(0x1b8a7cd0), 15) called at /root/maker/bin/maker line 1457 thread 1
> [1,15]<stderr>: main::node_thread("/mnt/data/paramormyrops_new_annotation/supercontigs.maker.out"...) called at /usr/local/lib/perl/5.14.2/forks.pm line 799 thread 1
> [1,15]<stderr>: eval {...} called at /usr/local/lib/perl/5.14.2/forks.pm line 799 thread 1
> [1,15]<stderr>: threads::new("threads", CODE(0x36c9a98), "/mnt/data/paramormyrops_new_annotation/supercontigs.maker.out"...) called at /root/maker/bin/maker line 917 thread 1
> [1,15]<stderr>:--> rank=15, hostname=node015
> [1,15]<stderr>:ERROR: Failed while collecting blastx reports
> [1,15]<stderr>:ERROR: Chunk failed at level:9, tier_type:3
> [1,15]<stderr>:FAILED CONTIG:Scaffold66
> [1,15]<stderr>:
> [1,15]<stderr>:ERROR: Chunk failed at level:4, tier_type:0
> [1,15]<stderr>:FAILED CONTIG:Scaffold66
> [1,15]<stderr>:
> [1,15]<stderr>:examining contents of the fasta file and run log
> [1,15]<stderr>:ERROR: could not make datastore directory
> [1,15]<stderr>:--> rank=15, hostname=node015
> [1,15]<stderr>:ERROR: Failed while examining contents of the fasta file and run log
> [1,15]<stderr>:ERROR: Chunk failed at level:0, tier_type:0
> [1,15]<stderr>:FAILED CONTIG:Scaffold483
>
> —
> Dr. Jason R. Gallant
> Assistant Professor
> Room 38 Natural Sciences
> Department of Zoology
> Michigan State University
> East Lansing, MI 48824
> jgallant at msu.edu
> office: 517-884-7756
>
>
> On Fri, Jan 23, 2015 at 3:25 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>
> The complaining is because there is more than one MAKER process running and they are not connected via MPI. So the problem is OpenMPI. Try installing a small MPI script (like the one attached) and using that to test OpenMPI. Once it is configured correctly then each separate processes will communicate with each other (pay attention to comm size and rank messages).
>
> —Carson
>
> <mpi_test>
>
>
>
>> On Jan 23, 2015, at 1:15 PM, Jason Gallant <jgallant at msu.edu <mailto:jgallant at msu.edu>> wrote:
>>
>> Hi Carson,
>>
>> Yes, I’ve tried that and still have the issue of maker complaining about multiple processes in the same directory. Other ideas?
>>
>> Best,
>> Jason
>>
>> —
>> Dr. Jason R. Gallant
>> Assistant Professor
>> Room 38 Natural Sciences
>> Department of Zoology
>> Michigan State University
>> East Lansing, MI 48824
>> jgallant at msu.edu <mailto:jgallant at msu.edu>
>> office: 517-884-7756
>>
>>
>> On Fri, Jan 23, 2015 at 3:14 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>
>> If using OpenMPI, make sure to set LD_PRELOAD to the location of libmpi.so before even trying to install MAKER. It must also be set before running MAKER (or any program that uses OpenMPI's shared libraries), so it's best just to add it to your ~/.bash_profile. (i.e. export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so).
>>
>>
>> For OpenMPI you may also want to set OMPI_MCA_mpi_warn_on_fork=0 in your ~/.bash_profile to turn off certain nonfatal warnings. Also if jobs hang or freeze when using mpiexec under OpenMPI try adding the '-mca btl ^openib' flag to mpiexec command when running MAKER.
>>
>> Example: mpiexec -mca btl ^openib -n 20 maker
>>
>> —Carson
>>
>>
>>
>>> On Jan 23, 2015, at 1:08 PM, Jason Gallant <jgallant at msu.edu <mailto:jgallant at msu.edu>> wrote:
>>>
>>> Hi Carson,
>>>
>>> Yes, STARCLUSTER enables a global storage space, which is via NFS to an EBS drive that I’ve created.
>>>
>>> I’m using the local disk space on each instance for the /tmp directory, however.
>>>
>>> It occurred to me on reading the forums that MPICH2 doesn’t scale as well as OPENMPI, however when I try to configure Maker for openmpi and run it, I get complaints from maker that multiple makers are running in the same directory?
>>>
>>> Thanks for your advice!
>>>
>>> Best,
>>> Jason
>>>
>>> —
>>> Dr. Jason R. Gallant
>>> Assistant Professor
>>> Room 38 Natural Sciences
>>> Department of Zoology
>>> Michigan State University
>>> East Lansing, MI 48824
>>> jgallant at msu.edu <mailto:jgallant at msu.edu>
>>> office: 517-884-7756
>>>
>>>
>>> On Fri, Jan 23, 2015 at 3:01 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>
>>> MAKER needs a global storage location. You probably need to set up one of your instances up to act as a shared storage server. AWS has lustre implementations for the cloud, perhaps you can try that. Also use OpenMPI instead of MPICH2. It’s more stable.
>>>
>>> I look forward to seeing how your experiment with AWS, MPI, and MAKER works out.
>>>
>>> —Carson
>>>
>>>
>>>
>>> > On Jan 21, 2015, at 6:56 AM, Jason Gallant <jgallant at msu.edu <mailto:jgallant at msu.edu>> wrote:
>>> >
>>> > Hi Everyone,
>>> >
>>> > I’m attempting to run Maker on Amazon EC2 using MIT’s starcluster— I’ve started a 200 node cluster, and enabled MPICH2 (Starcluster by default uses OpenMPI). I plan on documenting this setup once I’ve figured out how to run things reliably.
>>> >
>>> > I’m having a persistent issue where something fails on one of the nodes, and std error is flooded with:
>>> >
>>> > examining contents of the fasta file and run log
>>> > [67] ERROR: could not make datastore directory
>>> > [67] --> rank=67, hostname=node067
>>> > [67] ERROR: Failed while examining contents of the fasta file and run log
>>> > [67] ERROR: Chunk failed at level:0, tier_type:0
>>> > [67] FAILED CONTIG:Scaffold261
>>> >
>>> > This error repeats for each “next” scaffold for some time. When I go back to find the “source” of the error in the log, the following is the first error message on that node:
>>> >
>>> > 67] #-------------------------------#
>>> > [67] deleted:-60 hits
>>> > [67] collecting blastx reports
>>> > [67] ERROR: Could not colapse BLAST reports
>>> > [67] at /root/maker/bin/../lib/GI.pm line 2524 thread 1.
>>> > [67] GI::combine_blast_report(FastaChunk=HASH(0x108e1a90), ARRAY(0x1b874938), ARRAY(0xf127ad8), runlog=HASH(0x4d54ed8)) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 2760 thread 1
>>> > [67] Process::MpiChunk::__ANON__() called at /root/maker/bin/../lib/Error.pm line 415 thread 1
>>> > [67] eval {...} called at /root/maker/bin/../lib/Error.pm line 407 thread 1
>>> > [67] Error::subs::try(CODE(0x1514eb00), HASH(0x9cbeb568)) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 4215 thread 1
>>> > [67] Process::MpiChunk::_go(Process::MpiChunk=HASH(0x13976308), "run", HASH(0x12e04268), 9, 3) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 341 thread 1
>>> > [67] Process::MpiChunk::run(Process::MpiChunk=HASH(0x13976308), 67) called at /root/maker/bin/maker line 1457 thread 1
>>> > [67] main::node_thread("/mnt/data/paramormyrops_new_annotation/supercontigs.maker.out"...) called at /usr/local/lib/perl/5.14.2/forks.pm line 799 thread 1
>>> > [67] eval {...} called at /usr/local/lib/perl/5.14.2/forks.pm line 799 thread 1
>>> > [67] threads::new("threads", CODE(0x3dc5b38), "/mnt/data/paramormyrops_new_annotation/supercontigs.maker.out"...) called at /root/maker/bin/maker line 917 thread 1
>>> > [67] --> rank=67, hostname=node067
>>> > [67] ERROR: Failed while collecting blastx reports
>>> > [67] ERROR: Chunk failed at level:9, tier_type:3
>>> > [67] FAILED CONTIG:Scaffold66
>>> > [67]
>>> > [67] ERROR: Chunk failed at level:4, tier_type:0
>>> > [67] FAILED CONTIG:Scaffold66
>>> >
>>> >
>>> > I’ve attempted to ignore the error to see if things will proceed on the other 199 processors. When I returned to the “master” node after the evening, Maker keeps repeating the same error code over and over (same scaffold):
>>> > ] examining contents of the fasta file and run log
>>> > [67] ERROR: could not make datastore directory
>>> > [67] --> rank=67, hostname=node067
>>> > [67] ERROR: Failed while examining contents of the fasta file and run log
>>> > [67] ERROR: Chunk failed at level:0, tier_type:0
>>> > [67] FAILED CONTIG:Scaffold1589
>>> >
>>> > I stop the job, and restart, and after only a few minutes of running, the same error is reported, this time on a new scaffold. Strangely here, the error is reported in the MPI tag of node001, but the error originates at node137:
>>> >
>>> > ERROR: Could not colapse BLAST reports
>>> > [1] at /root/maker/bin/../lib/GI.pm line 2524.
>>> > [1] GI::combine_blast_report(FastaChunk=HASH(0xf4aa9b8), ARRAY(0xf628f90), ARRAY(0x325fea78), runlog=HASH(0x133cc8e8)) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 2760
>>> > [1] Process::MpiChunk::__ANON__() called at /root/maker/bin/../lib/Error.pm line 415
>>> > [1] eval {...} called at /root/maker/bin/../lib/Error.pm line 407
>>> > [1] Error::subs::try(CODE(0x352c9b8), HASH(0xdab3b690)) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 4215
>>> > [1] Process::MpiChunk::_go(Process::MpiChunk=HASH(0x3545d90), "run", HASH(0x30aa710), 9, 3) called at /root/maker/bin/../lib/Process/MpiChunk.pm line 341
>>> > [1] Process::MpiChunk::run(Process::MpiChunk=HASH(0x3545d90), 137) called at /root/maker/bin/maker line 979
>>> > [1] --> rank=137, hostname=node137
>>> > [1] ERROR: Failed while collecting blastx reports
>>> > [1] ERROR: Chunk failed at level:9, tier_type:3
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] ERROR: Chunk failed at level:4, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> > [1]
>>> > [1] examining contents of the fasta file and run log
>>> > [1] ERROR: could not make datastore directory
>>> > [1] --> rank=1, hostname=node001
>>> > [1] ERROR: Failed while examining contents of the fasta file and run log
>>> > [1] ERROR: Chunk failed at level:0, tier_type:0
>>> > [1] FAILED CONTIG:Scaffold249
>>> >
>>> > I’d appreciate any guidance as how best to diagnose this error!
>>> >
>>> > Many thanks,
>>> > Jason Gallant
>>> >
>>> >
>>> >
>>> >
>>> > —
>>> > Dr. Jason R. Gallant
>>> > Assistant Professor
>>> > Room 38 Natural Sciences
>>> > Department of Zoology
>>> > Michigan State University
>>> > East Lansing, MI 48824
>>> > jgallant at msu.edu <mailto:jgallant at msu.edu>
>>> > office: 517-884-7756
>>> > _______________________________________________
>>> > maker-devel mailing list
>>> > maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> > http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150129/ecc3299a/attachment-0003.html>
More information about the maker-devel
mailing list