[maker-devel] (no subject)
Mark Ebbert
me.mark at gmail.com
Wed Sep 14 12:11:46 MDT 2016
Hey Carson!
I’m getting a new issue. I think I need to recompile Maker with MPICH instead of openmpi. I’m getting the following errors when I try to run “mpiexec -n 10 maker -help”. I tried running “./Build clean” followed by “./Build install” after updated LD_PRELOAD with the path to MPICH, but I’m still getting the error. I was also trying to access Maker documentation at
http://weatherby.genetics.utah.edu/MAKER/wiki/index.php
to review detailed installation instructions (I think it’s there), but the website is down.
I appreciate your help.
“Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0xa0a5d620, rank=0x7ffd20bb8d9c) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0x51f83620, rank=0x7ffc6023b7fc) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0x8b342620, rank=0x7ffde14f02fc) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0xf8f24620, rank=0x7ffe71c9a5bc) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0x8c074620, rank=0x7ffc70e50b6c) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0xdac15620, rank=0x7ffc67bf0e2c) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0xbb65620, rank=0x7ffc17a1d1bc) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0x2aa3b620, rank=0x7fff551201dc) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0xd2453620, rank=0x7fffaebe21cc) failed
PMPI_Comm_rank(68).: Invalid communicator
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(110): MPI_Comm_rank(comm=0xb24e8620, rank=0x7ffdd838bbfc) failed
PMPI_Comm_rank(68).: Invalid communicator
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 2462 RUNNING AT m7int02
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
==================================================================================="
Mark T. W. Ebbert
On Thu, Sep 01, 2016 at 10:03 AM Carson Holt
<
mailto:Carson Holt <carsonhh at gmail.com>
> wrote:
a, pre, code, a:link, body { word-wrap: break-word !important; }
MAKER will use locks to divide up work between simultaneously running jobs. So submitting five 200 CPU jobs, will give you the same throughput, and will be more stable. The jobs will probably move through the queue faster as well.
—Carson
On Sep 1, 2016, at 10:00 AM, Mark Ebbert <
mailto:me.mark at gmail.com
> wrote:
Bummer. It worked at 720 the first time. Thanks again!
Mark T. W. Ebbert
On Thu, Sep 01, 2016 at 9:57 AM Carson Holt
<> wrote:
-n 1000 is probably too high for mpich3. It’s communication manager is not that robust. You can go that high with OpenMPI or MVAPICH2, but I’ve found that MPICH3 tops out at 100-200. Just submit multiple jobs at the lower count.
—Carson
On Sep 1, 2016, at 8:47 AM, Mark Ebbert <
mailto:me.mark at gmail.com
> wrote:
Thanks Carson! The help message only printed once, so everything seemed fine. I deleted all of the lock files with the following command: “find . -name *.NFSLock* -exec rm {} \;”
I restarted the job and got the following segfault:
“Module mpi/mpich-3.1.4_intel-15.0.3 requires compiler_intel/15.0.3. Loading it now.
Module compiler_intel/15.0.3 requires mkl/11.2.0. Loading it now.
mpdboot_m7-1-2 (handle_mpd_output 1000): from mpd on m7-1-2, invalid port info:
mpd_uncaught_except_tb handling:
<type 'exceptions.IndexError'>: list index out of range
/apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 264 pin_Uni_num
if list.index(list[i]) == i:
/apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 1449 pin_Cpuinfo
info['cache1'] = pin_Uni_num(info['cache1_id'], info['lcpu'])
/apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 1658 run
self.CpuInfo = pin_Cpuinfo(self.PinCase,self.Arch)
/apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 3676 <module>
mpd.run()
/var/spool/slurmd/job11326444/slurm_script: line 27: 29365 Segmentation fault mpiexec -n 1000 maker”
Any ideas?
Mark T. W. Ebbert
On Tue, Aug 30, 2016 at 10:54 AM Carson Holt
<> wrote:
Run 'maker -help’ with mpiexec.
Example:
mpiexec -n 10 maker -help
If the MPI communication ring is working correctly, then it will print the help message only once (from the root process). If it is not working, it will print the help message 10 time because each of the 10 MPI processes will think they are the root process. It is a simple test that can identify if it is an MPI issue or not.
If it is not an MPI issue, you can just search for the NFSLock files using find and delete them,.
—Carson
On Aug 30, 2016, at 10:10 AM, Mark Ebbert <
mailto:me.mark at gmail.com
> wrote:
Good day everyone!
I’m getting the error stating: “WARNING: Multiple MAKER processes have been started in the same directory.” Everything I’ve seen mentions version issues with MPICH. The difference in my situation is that my initial run ran just fine, but died because of the cluster time constraints. We’re only allowed 3 days.
There are a bunch of .NFSLock files in the output directory. I’m guessing Maker wasn’t able to clear the locks when the jobs died? Can I safely delete those lock files? What’s the best way to handle this going forward since I can only run jobs for 3 days at a time?
Thanks!
Mark T. W. Ebbert
_______________________________________________
maker-devel mailing list
mailto:maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20160914/d289e6b2/attachment-0003.html>
More information about the maker-devel
mailing list