[maker-devel] (no subject)
Carson Holt
carsonhh at gmail.com
Thu Sep 1 10:03:21 MDT 2016
MAKER will use locks to divide up work between simultaneously running jobs. So submitting five 200 CPU jobs, will give you the same throughput, and will be more stable. The jobs will probably move through the queue faster as well.
—Carson
> On Sep 1, 2016, at 10:00 AM, Mark Ebbert <me.mark at gmail.com> wrote:
>
>
> Bummer. It worked at 720 the first time. Thanks again!
>
> Mark T. W. Ebbert
>
> On Thu, Sep 01, 2016 at 9:57 AM Carson Holt <> wrote:
> -n 1000 is probably too high for mpich3. It’s communication manager is not that robust. You can go that high with OpenMPI or MVAPICH2, but I’ve found that MPICH3 tops out at 100-200. Just submit multiple jobs at the lower count.
>
> —Carson
>
>
>
>> On Sep 1, 2016, at 8:47 AM, Mark Ebbert <me.mark at gmail.com <mailto:me.mark at gmail.com>> wrote:
>>
>>
>> Thanks Carson! The help message only printed once, so everything seemed fine. I deleted all of the lock files with the following command: “find . -name *.NFSLock* -exec rm {} \;”
>>
>> I restarted the job and got the following segfault:
>>
>> “Module mpi/mpich-3.1.4_intel-15.0.3 requires compiler_intel/15.0.3. Loading it now.
>> Module compiler_intel/15.0.3 requires mkl/11.2.0. Loading it now.
>> mpdboot_m7-1-2 (handle_mpd_output 1000): from mpd on m7-1-2, invalid port info:
>> mpd_uncaught_except_tb handling:
>> <type 'exceptions.IndexError'>: list index out of range
>> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 264 pin_Uni_num
>> if list.index(list[i]) == i:
>> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 1449 pin_Cpuinfo
>> info['cache1'] = pin_Uni_num(info['cache1_id'], info['lcpu'])
>> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 1658 run
>> self.CpuInfo = pin_Cpuinfo(self.PinCase,self.Arch)
>> /apps/intel_parallel_studio_xe/2015_update3/mpirt/bin/intel64/mpd.py 3676 <module>
>> mpd.run()
>> /var/spool/slurmd/job11326444/slurm_script: line 27: 29365 Segmentation fault mpiexec -n 1000 maker”
>>
>> Any ideas?
>>
>> Mark T. W. Ebbert
>>
>> On Tue, Aug 30, 2016 at 10:54 AM Carson Holt <> wrote:
>> Run 'maker -help’ with mpiexec.
>>
>> Example:
>> mpiexec -n 10 maker -help
>>
>> If the MPI communication ring is working correctly, then it will print the help message only once (from the root process). If it is not working, it will print the help message 10 time because each of the 10 MPI processes will think they are the root process. It is a simple test that can identify if it is an MPI issue or not.
>>
>> If it is not an MPI issue, you can just search for the NFSLock files using find and delete them,.
>>
>> —Carson
>>
>>
>>> On Aug 30, 2016, at 10:10 AM, Mark Ebbert <me.mark at gmail.com <mailto:me.mark at gmail.com>> wrote:
>>>
>>>
>>> Good day everyone!
>>>
>>> I’m getting the error stating: “WARNING: Multiple MAKER processes have been started in the same directory.” Everything I’ve seen mentions version issues with MPICH. The difference in my situation is that my initial run ran just fine, but died because of the cluster time constraints. We’re only allowed 3 days.
>>>
>>> There are a bunch of .NFSLock files in the output directory. I’m guessing Maker wasn’t able to clear the locks when the jobs died? Can I safely delete those lock files? What’s the best way to handle this going forward since I can only run jobs for 3 days at a time?
>>>
>>> Thanks!
>>>
>>> Mark T. W. Ebbert
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20160901/540ce553/attachment-0003.html>
More information about the maker-devel
mailing list