[maker-devel] (no subject)

Carson Holt carsonhh at gmail.com
Thu Sep 1 09:57:58 MDT 2016


-n 1000 is probably too high for mpich3. It’s communication manager is not that robust. You can go that high with OpenMPI or MVAPICH2, but I’ve found that MPICH3 tops out at 100-200. Just submit multiple jobs at the lower count.

—Carson



> On Sep 1, 2016, at 8:47 AM, Mark Ebbert <me.mark at gmail.com> wrote:
> 
> 
> Thanks Carson! The help message only printed once, so everything seemed fine. I deleted all of the lock files with the following command: “find . -name *.NFSLock* -exec rm {} \;”
> 
> I restarted the job and got the following segfault:
> 
> “Module mpi/mpich-3.1.4_intel-15.0.3 requires compiler_intel/15.0.3. Loading it now.
> Module compiler_intel/15.0.3 requires mkl/11.2.0. Loading it now.
> mpdboot_m7-1-2 (handle_mpd_output 1000): from mpd on m7-1-2, invalid port info:
> mpd_uncaught_except_tb handling:
> <type 'exceptions.IndexError'>: list index out of range
> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py  264  pin_Uni_num
> if list.index(list[i]) == i:
> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py  1449  pin_Cpuinfo
> info['cache1'] = pin_Uni_num(info['cache1_id'], info['lcpu'])
> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py  1658  run
> self.CpuInfo = pin_Cpuinfo(self.PinCase,self.Arch)
> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py  3676  <module>
> mpd.run()
> /var/spool/slurmd/job11326444/slurm_script: line 27: 29365 Segmentation fault      mpiexec -n 1000 maker”
> 
> Any ideas?
> 
> Mark T. W. Ebbert
> 
> On Tue, Aug 30, 2016 at 10:54 AM Carson Holt <> wrote:
> Run 'maker -help’ with mpiexec.
> 
> Example:
> mpiexec -n 10 maker -help
> 
> If the MPI communication ring is working correctly, then it will print the help message only once (from the root process). If it is not working, it will print the help message 10 time because each of the 10 MPI processes will think they are the root process. It is a simple test that can identify if it is an MPI issue or not.
> 
> If it is not an MPI issue, you can just search for the NFSLock files using find and delete them,.
> 
> —Carson
> 
> 
>> On Aug 30, 2016, at 10:10 AM, Mark Ebbert <me.mark at gmail.com <mailto:me.mark at gmail.com>> wrote:
>> 
>> 
>> Good day everyone!
>> 
>> I’m getting the error stating: “WARNING: Multiple MAKER processes have been started in the same directory.” Everything I’ve seen mentions version issues with MPICH. The difference in my situation is that my initial run ran just fine, but died because of the cluster time constraints. We’re only allowed 3 days. 
>> 
>> There are a bunch of .NFSLock files in the output directory. I’m guessing Maker wasn’t able to clear the locks when the jobs died? Can I safely delete those lock files? What’s the best way to handle this going forward since I can only run jobs for 3 days at a time?
>> 
>> Thanks!
>> 
>> Mark T. W. Ebbert
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20160901/75336a0d/attachment-0002.html>


More information about the maker-devel mailing list