[maker-devel] maker MPI problem

Carson Holt carsonhh at gmail.com
Thu Aug 17 09:36:20 MDT 2017


This is the causal error —> can't open /lscratch/47455932/mpiavG_z

It kills one process and causes everything else to die in an ugly way.

There are several possible causes:

1.  /lscratch/47455932/ is not actually locally mounted. It may be a virtual directory created at run time that exists on the network but not as a true locally mounted disk. If this is the case, there can be a slight IO delay under heavy IO load (common on NFS) that can cause directories and files to appear to not exist. This is one of the reasons TMP= must be sent to a true locally mounted disk. The IO load MAKER can produce can swamp network mounted disks creating strange errors.

2.  /lscratch/47455932/ may only exist on the head node and not other nodes for the job.  True local temporary storage is not available across nodes. It is only available on the node it is attached to. So if you are creating the location as part of your job, it may only exist on the head node and not the other nodes. Usually this value is set to /tmp because each machine should have it’s own independent /tmp location.

3. /lscratch/47455932/ exists on all nodes, but is full on one of them.

—Carson





—Carson

> On Aug 17, 2017, at 7:39 AM, zl c <chzelin at gmail.com> wrote:
> 
> I use '--mca btl ^openib' and it runs on multiple nodes. It works and I see some sequences is done for the test run. 
> 
> Then I make another run using the large nr database and use local space on the computer cluster, which fails. 
> Submit CMD:
> sbatch --gres=lscratch:100 --time=168:00:00 --partition=multinode --constraint=x2680 --mem-per-cpu=64g --ntasks=8 --ntasks-per-core=1 --job-name run05.mpi -o log.mpi.00/run05.mpi.o%A run05.mpi.sh <http://run05.mpi.sh/>
> Error message:
> #--------- command -------------#
> Widget::tblastx:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db /lscratch/47455932/maker_BLLXNq/rna%2Efasta.mpi.10.21 -query /lscratch/47455932/maker_BLLXNq/50/tig00017383_arrow.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 1 -lcase_masking -seg yes -soft_masking true -show_gis -out /gpfs/gsfs6/users/chenz11/goldfish/11549472/sergey_canu70x/arrow/maker5/goldfish.arrow.renamed.maker.output/goldfish.arrow.renamed_datastore/5A/65/tig00017383_arrow//theVoid.tig00017383_arrow/0/tig00017383_arrow.0.rna%2Efasta.tblastx.temp_dir/rna%2Efasta.mpi.10.21.tblastx
> #-------------------------------#
> Thread 1 terminated abnormally: can't open /lscratch/47455932/mpiavG_z: No such file or directory at /home/chenz11/program/maker_mpi/bin/maker line 1460 thread 1.
> --> rank=37, hostname=cn4120
> FATAL: Thread terminated, causing all processes to fail
> --> rank=37, hostname=cn4120
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> SIGTERM received
> SIGTERM received
> SIGTERM received
> running  blast search.
> #--------- command -------------#
> Widget::tblastx:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db /lscratch/47455932/maker_Zsg_Gg/rna%2Efasta.mpi.10.8 -query /lscratch/47455932/maker_Zsg_Gg/62/tig00001111_arrow.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 1 -lcase_masking -seg yes -soft_masking true -show_gis -out /gpfs/gsfs6/users/chenz11/goldfish/11549472/sergey_canu70x/arrow/maker5/goldfish.arrow.renamed.maker.output/goldfish.arrow.renamed_datastore/B8/A5/tig00001111_arrow//theVoid.tig00001111_arrow/0/tig00001111_arrow.0.rna%2Efasta.tblastx.temp_dir/rna%2Efasta.mpi.10.8.tblastx
> #-------------------------------#
> Perl exited with active threads:
>     1 running and unjoined
>     0 finished and unjoined
>     0 running and detached
> Perl exited with active threads:
>     1 running and unjoined
>     0 finished and unjoined
>     0 running and detached
> --------------------------------------------------------------------------
> An MPI communication peer process has unexpectedly disconnected.  This
> usually indicates a failure in the peer process (e.g., a crash or
> otherwise exiting without calling MPI_FINALIZE first).
> 
> Although this local MPI process will likely now behave unpredictably
> (it may even hang or crash), the root cause of this problem is the
> failure of the peer -- that is what you need to investigate.  For
> example, there may be a core file that you can examine.  More
> generally: such peer hangups are frequently caused by application bugs
> or other external events.
> 
>   Local host: cn4130
>   Local PID:  18831
>   Peer host:  cn3683
> --------------------------------------------------------------------------
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> formating database...
> #--------- command -------------#
> Widget::formater:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/makeblastdb -dbtype prot -in /lscratch/47455932/maker_rNzO3X/27/blastprep/protein2%2Efasta.mpi.10.25
> #-------------------------------#
> SIGTERM received
> SIGTERM received
> SIGTERM received
> SIGTERM received
> ...
> SIGTERM received
> SIGTERM received
> SIGTERM received
> Perl exited with active threads:
>     1 running and unjoined
>     0 finished and unjoined
>     0 running and detached
> Perl exited with active threads:
>     1 running and unjoined
>     0 finished and unjoined
>     0 running and detached
> Perl exited with active threads:
>     1 running and unjoined
>     0 finished and unjoined
>     0 running and detached
> 
> ...
> [cn3683:36010] 59 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
> [cn3683:36010] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> --------------------------------------------------------------------------
> mpiexec detected that one or more processes exited with non-zero status, thus causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[352,1],37]
>   Exit code:    255
> --------------------------------------------------------------------------
>  
> 
> I rebuild the mpi_blast and rerun it again, also get the error:
> #--------- command -------------#
> Widget::formater:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/makeblastdb -dbtype nucl -in /lscratch/47559740/maker_k6a7Hy/32/blastprep/rna%2Efasta.mpi.10.3
> #-------------------------------#
> Thread 1 terminated abnormally: can't open /lscratch/47559740/mpiS84Ju: No such file or directory at /home/chenz11/program/maker_mpi/bin/maker line 1460 thread 1.
> --> rank=27, hostname=cn4115
> FATAL: Thread terminated, causing all processes to fail
> --> rank=27, hostname=cn4115
> deleted:276 hits
> doing tblastx of alt-ESTs
> formating database...
> #--------- command -------------#
> Widget::formater:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/makeblastdb -dbtype nucl -in /lscratch/47559740/maker_nCKTgE/2/blastprep/rna%2Efasta.mpi.10.11
> #-------------------------------#
> running  blast search.
> #--------- command -------------#
> Widget::tblastx:
> /usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db /lscratch/47559740/maker_0kWZTA/rna%2Efasta.mpi.10.20 -query /lscratch/47559740/maker_0kWZTA/35/tig00027947_arrow.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp 500000000 -num_threads 1 -lcase_masking -seg yes -soft_masking true -show_gis -out /gpfs/gsfs6/users/chenz11/goldfish/11549472/sergey_canu70x/arrow/maker5/goldfish.arrow.renamed.maker.output/goldfish.arrow.renamed_datastore/86/7F/tig00027947_arrow//theVoid.tig00027947_arrow/0/tig00027947_arrow.0.rna%2Efasta.tblastx.temp_dir/rna%2Efasta.mpi.10.20.tblastx
> #-------------------------------#
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> SIGTERM received
> SIGTERM received
> SIGTERM received
> Perl exited with active threads:
>     1 running and unjoined
>     0 finished and unjoined
>     0 running and detached
> Perl exited with active threads:
>     1 running and unjoined
>     0 finished and unjoined
>     0 running and detached
> 
> Thanks,
> Zelin
> 
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
> 
> NIH/NHGRI
> Building 50, Room 5531
> 50 SOUTH DR, MSC 8004 
> BETHESDA, MD 20892-8004
> 
> On Tue, Aug 15, 2017 at 5:13 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Some notes:
> 
> First, the mpiexec command still needs the --mca parameters (either  '--mca btl ^openib' or '--mca btl vader,tcp,self --mca btl_tcp_if_include ib0’). Otherwise if you have infiniband on the nodes it will try and use OpenFabrics compatible libraries which will kill code doing system calls (like MAKER does).
> 
> Second, try using a higher count than 2 in your batch. One process is always sacrificed by maker to act only for message management among processes, so with -n 2, you have one process working and one managing data. So only one contig will run at a time. If you set it to a higher number the issue will go away. The message manger process starts to get saturated at ~200 CPUs, so anything above that processor count becomes less beneficial to the job.
> 
> Thanks,
> Carson
> 
> 
> 
> 
>> On Aug 15, 2017, at 3:05 PM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>> 
>> I submit a job:
>> sbatch --gres=lscratch:100 --time=8:00:00 --mem-per-cpu=8g -N 1-1 --ntasks=2 --ntasks-per-core=1 --job-name run06.mpi -o log/run06.mpi.o%A run06.maker.mpi.sh <http://run06.maker.mpi.sh/>
>> 
>> CMD in run06.maker.mpi.sh <http://run06.maker.mpi.sh/>
>> mpiexec -n $SLURM_NTASKS maker -c 1 -base genome -g genome.fasta
>> 
>> Another question:
>> How much temporary space and memory should I use for a ~10Mb sequences and large database like nr and uniref90.
>> 
>> Thanks,
>> zelin
>> 
>> --------------------------------------------
>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>> 
>> 
>> On Tue, Aug 15, 2017 at 4:50 PM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> What is your command line? Are you running interactively or as a submitted batch? If it's a batch job what options did you give it?
>> 
>> --Carson
>> 
>> Sent from my iPhone
>> 
>> On Aug 15, 2017, at 2:47 PM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>> 
>>> Hi Carson,  Christopher, Daniel,
>>> 
>>> Thank you for your kind help.
>>> 
>>> Now it works without any other options on one nodes and 4 CPUs. I set the number of task to 2, but there's only one contigs in running. Should it be two contigs running at the same time?
>>> 
>>> Zelin
>>> 
>>> --------------------------------------------
>>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>>> 
>>> 
>>> NIH/NHGRI
>>> Building 50, Room 5531
>>> 50 SOUTH DR, MSC 8004 
>>> BETHESDA, MD 20892-8004
>>> 
>>> On Tue, Aug 15, 2017 at 11:47 AM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>> Did it die or did you just get a warning?
>>> 
>>> Here is a list of flags to add that suppress warnings and other issues with OpenMPI. You can add them all or one at a time depending on issues you get.
>>> 
>>> #add if MPI not using all CPU given
>>> --oversubscribe --bind-to none
>>> 
>>> #workaround for infinaband (use instead of --mca ^openib)
>>> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0
>>> 
>>> #add to stop certain other warnings
>>> --mca orte_base_help_aggregate 0
>>> 
>>> #stop fork warnings
>>> --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0
>>> 
>>> —Carson
>>> 
>>> 
>>> 
>>>> On Aug 15, 2017, at 9:34 AM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>>>> 
>>>> Here are some latest message:
>>>> 
>>>> [cn3360:57176] 1 more process has sent help message help-opal-runtime.txt / opal_init:warn-fork
>>>> [cn3360:57176] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>>> 
>>>> --------------------------------------------
>>>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>>>> 
>>>> 
>>>> 
>>>> On Tue, Aug 15, 2017 at 10:39 AM, Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>>>> You may need to delete the .../maker/perl directory before doing the reinstall if not doing a brand new installation. Otherwise you can ignore the subroutine redefined warnings during compile.
>>>> 
>>>> Have you been able to test the alternate flags on the command line for MPI? How about an alternate perl without threads?
>>>> 
>>>> --Carson
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Aug 15, 2017, at 8:27 AM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>>>> 
>>>>> When I installed by './Build install', I got following some messages:
>>>>> Configuring MAKER with MPI support
>>>>> Installing MAKER...
>>>>> Configuring MAKER with MPI support
>>>>> Subroutine dl_load_flags redefined at (eval 125) line 8.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_ANY_SOURCE redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_ANY_TAG redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_SUCCESS redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_Init redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_Finalize redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_Comm_rank redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_Comm_size redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_Send redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::C_MPI_Recv redefined at (eval 125) line 9.
>>>>> Subroutine Parallel::Application::MPI::_comment redefined at (eval 125) line 9.
>>>>> 
>>>>> I'm not sure whether it's correctly installed.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> --------------------------------------------
>>>>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]
>>>>> 
>>>>> NIH/NHGRI
>>>>> Building 50, Room 5531
>>>>> 50 SOUTH DR, MSC 8004 
>>>>> BETHESDA, MD 20892-8004
>>>>> 
>>>>> On Mon, Aug 14, 2017 at 9:23 PM, Fields, Christopher J <cjfields at illinois.edu <mailto:cjfields at illinois.edu>> wrote:
>>>>> Carson,
>>>>> 
>>>>>  
>>>>> 
>>>>> It was attached to the initial message (named ‘run05.mpi.o47346077’).  It looks like a Perl issue with threads, though I don’t see why this would crash a cluster.  The fact there is a log file would suggest it just ended the job.
>>>>> 
>>>>>  
>>>>> 
>>>>> chris
>>>>> 
>>>>>  
>>>>> 
>>>>> From: maker-devel <maker-devel-bounces at yandell-lab.org <mailto:maker-devel-bounces at yandell-lab.org>> on behalf of Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>
>>>>> Date: Monday, August 14, 2017 at 2:18 PM
>>>>> To: zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>>
>>>>> Cc: "maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>>
>>>>> Subject: Re: [maker-devel] maker MPI problem
>>>>> 
>>>>>  
>>>>> 
>>>>> This is rather vague —> “crashed the computer cluster”
>>>>> 
>>>>>  
>>>>> 
>>>>> Do you have a specific error?
>>>>> 
>>>>>  
>>>>> 
>>>>> —Carson
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>> On Aug 14, 2017, at 12:59 PM, zl c <chzelin at gmail.com <mailto:chzelin at gmail.com>> wrote:
>>>>> 
>>>>>  
>>>>> 
>>>>> Hello,
>>>>> 
>>>>>  
>>>>> 
>>>>> I ran maker 3.0 with openmpi 2.0.2 and it crashed the computer cluster. I attached the log file. Could you help me to solve the problem?
>>>>> 
>>>>>  
>>>>> 
>>>>> CMD:
>>>>> 
>>>>> export LD_PRELOAD=/usr/local/OpenMPI/2.0.2/gcc-6.3.0/lib/libmpi.so
>>>>> 
>>>>> export OMPI_MCA_mpi_warn_on_fork=0
>>>>> 
>>>>> mpiexec -mca btl ^openib -n $SLURM_NTASKS maker -c 1 –base genome  -g genome.fasta
>>>>> 
>>>>>  
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Zelin Chen
>>>>> 
>>>>>  
>>>>> 
>>>>> --------------------------------------------
>>>>> 
>>>>> Zelin Chen [chzelin at gmail.com <mailto:chzelin at gmail.com>]  Ph.D.
>>>>> 
>>>>>  
>>>>> 
>>>>> NIH/NHGRI
>>>>> 
>>>>> Building 50, Room 5531
>>>>> 50 SOUTH DR, MSC 8004 
>>>>> BETHESDA, MD 20892-8004
>>>>> 
>>>>> <run05.mpi.o47346077>_______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at box290.bluehost.com <mailto:maker-devel at box290.bluehost.com>
>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org <http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org>
>>>>>  
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170817/c8dc0604/attachment-0003.html>


More information about the maker-devel mailing list