[maker-devel] maker MPI problem

zl c chzelin at gmail.com
Thu Aug 17 07:39:29 MDT 2017


I use '--mca btl ^openib' and it runs on multiple nodes. It works and I see
some sequences is done for the test run.

Then I make another run using the large nr database and use local space on
the computer cluster, which fails.
Submit CMD:

sbatch --gres=lscratch:100 --time=168:00:00 --partition=multinode
--constraint=x2680 --mem-per-cpu=64g --ntasks=8 --ntasks-per-core=1
--job-name run05.mpi -o log.mpi.00/run05.mpi.o%A run05.mpi.sh
Error message:

#--------- command -------------#

Widget::tblastx:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db
/lscratch/47455932/maker_BLLXNq/rna%2Efasta.mpi.10.21 -query
/lscratch/47455932/maker_BLLXNq/50/tig00017383_arrow.0 -num_alignments
10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp
500000000 -num_threads 1 -lcase_masking -seg yes -soft_masking true
-show_gis -out /gpfs/gsfs6/users/chenz11/goldfish/11549472/sergey_
canu70x/arrow/maker5/goldfish.arrow.renamed.maker.output/
goldfish.arrow.renamed_datastore/5A/65/tig00017383_
arrow//theVoid.tig00017383_arrow/0/tig00017383_arrow.0.
rna%2Efasta.tblastx.temp_dir/rna%2Efasta.mpi.10.21.tblastx

#-------------------------------#

Thread 1 terminated abnormally: can't open /lscratch/47455932/mpiavG_z: No
such file or directory at /home/chenz11/program/maker_mpi/bin/maker line
1460 thread 1.

--> rank=37, hostname=cn4120

FATAL: Thread terminated, causing all processes to fail

--> rank=37, hostname=cn4120

-------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

-------------------------------------------------------

SIGTERM received

SIGTERM received

SIGTERM received

running  blast search.

#--------- command -------------#

Widget::tblastx:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db
/lscratch/47455932/maker_Zsg_Gg/rna%2Efasta.mpi.10.8 -query
/lscratch/47455932/maker_Zsg_Gg/62/tig00001111_arrow.0 -num_alignments
10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp
500000000 -num_threads 1 -lcase_masking -seg yes -soft_masking true
-show_gis -out /gpfs/gsfs6/users/chenz11/goldfish/11549472/sergey_
canu70x/arrow/maker5/goldfish.arrow.renamed.maker.output/
goldfish.arrow.renamed_datastore/B8/A5/tig00001111_
arrow//theVoid.tig00001111_arrow/0/tig00001111_arrow.0.
rna%2Efasta.tblastx.temp_dir/rna%2Efasta.mpi.10.8.tblastx

#-------------------------------#

Perl exited with active threads:

    1 running and unjoined

    0 finished and unjoined

    0 running and detached

Perl exited with active threads:

    1 running and unjoined

    0 finished and unjoined

    0 running and detached

--------------------------------------------------------------------------

An MPI communication peer process has unexpectedly disconnected.  This

usually indicates a failure in the peer process (e.g., a crash or

otherwise exiting without calling MPI_FINALIZE first).


Although this local MPI process will likely now behave unpredictably

(it may even hang or crash), the root cause of this problem is the

failure of the peer -- that is what you need to investigate.  For

example, there may be a core file that you can examine.  More

generally: such peer hangups are frequently caused by application bugs

or other external events.


  Local host: cn4130

  Local PID:  18831

  Peer host:  cn3683

--------------------------------------------------------------------------

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

formating database...

#--------- command -------------#

Widget::formater:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/makeblastdb -dbtype prot -in
/lscratch/47455932/maker_rNzO3X/27/blastprep/protein2%2Efasta.mpi.10.25

#-------------------------------#

SIGTERM received

SIGTERM received

SIGTERM received

SIGTERM received

...

SIGTERM received

SIGTERM received

SIGTERM received

Perl exited with active threads:

    1 running and unjoined

    0 finished and unjoined

    0 running and detached

Perl exited with active threads:

    1 running and unjoined

    0 finished and unjoined

    0 running and detached

Perl exited with active threads:

    1 running and unjoined

    0 finished and unjoined

    0 running and detached


...

[cn3683:36010] 59 more processes have sent help message
help-mpi-btl-tcp.txt / peer hung up

[cn3683:36010] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages

--------------------------------------------------------------------------

mpiexec detected that one or more processes exited with non-zero status,
thus causing

the job to be terminated. The first process to do so was:


  Process name: [[352,1],37]

  Exit code:    255

--------------------------------------------------------------------------


I rebuild the mpi_blast and rerun it again, also get the error:

#--------- command -------------#

Widget::formater:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/makeblastdb -dbtype nucl -in
/lscratch/47559740/maker_k6a7Hy/32/blastprep/rna%2Efasta.mpi.10.3

#-------------------------------#

Thread 1 terminated abnormally: can't open /lscratch/47559740/mpiS84Ju: No
such file or directory at /home/chenz11/program/maker_mpi/bin/maker line
1460 thread 1.

--> rank=27, hostname=cn4115

FATAL: Thread terminated, causing all processes to fail

--> rank=27, hostname=cn4115

deleted:276 hits

doing tblastx of alt-ESTs

formating database...

#--------- command -------------#

Widget::formater:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/makeblastdb -dbtype nucl -in
/lscratch/47559740/maker_nCKTgE/2/blastprep/rna%2Efasta.mpi.10.11

#-------------------------------#

running  blast search.

#--------- command -------------#

Widget::tblastx:

/usr/local/apps/blast/ncbi-blast-2.5.0+/bin/tblastx -db
/lscratch/47559740/maker_0kWZTA/rna%2Efasta.mpi.10.20 -query
/lscratch/47559740/maker_0kWZTA/35/tig00027947_arrow.0 -num_alignments
10000 -num_descriptions 10000 -evalue 1e-10 -dbsize 1000 -searchsp
500000000 -num_threads 1 -lcase_masking -seg yes -soft_masking true
-show_gis -out
/gpfs/gsfs6/users/chenz11/goldfish/11549472/sergey_canu70x/arrow/maker5/goldfish.arrow.renamed.maker.output/goldfish.arrow.renamed_datastore/86/7F/tig00027947_arrow//theVoid.tig00027947_arrow/0/tig00027947_arrow.0.rna%2Efasta.tblastx.temp_dir/rna%2Efasta.mpi.10.20.tblastx

#-------------------------------#

-------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

-------------------------------------------------------

SIGTERM received

SIGTERM received

SIGTERM received

Perl exited with active threads:

    1 running and unjoined

    0 finished and unjoined

    0 running and detached

Perl exited with active threads:

    1 running and unjoined

    0 finished and unjoined

    0 running and detached

Thanks,
Zelin

--------------------------------------------
Zelin Chen [chzelin at gmail.com]

NIH/NHGRI
Building 50, Room 5531
50 SOUTH DR, MSC 8004
BETHESDA, MD 20892-8004

On Tue, Aug 15, 2017 at 5:13 PM, Carson Holt <carsonhh at gmail.com> wrote:

> Some notes:
>
> First, the mpiexec command still needs the --mca parameters (either
>  '--mca btl ^openib' or '--mca btl vader,tcp,self --mca btl_tcp_if_include
> ib0’). Otherwise if you have infiniband on the nodes it will try and use
> OpenFabrics compatible libraries which will kill code doing system calls
> (like MAKER does).
>
> Second, try using a higher count than 2 in your batch. One process is
> always sacrificed by maker to act only for message management among
> processes, so with -n 2, you have one process working and one managing
> data. So only one contig will run at a time. If you set it to a higher
> number the issue will go away. The message manger process starts to get
> saturated at ~200 CPUs, so anything above that processor count becomes less
> beneficial to the job.
>
> Thanks,
> Carson
>
>
>
>
> On Aug 15, 2017, at 3:05 PM, zl c <chzelin at gmail.com> wrote:
>
> I submit a job:
> sbatch --gres=lscratch:100 --time=8:00:00 --mem-per-cpu=8g -N 1-1
> --ntasks=2 --ntasks-per-core=1 --job-name run06.mpi -o log/run06.mpi.o%A
> run06.maker.mpi.sh
>
> CMD in run06.maker.mpi.sh
> mpiexec -n $SLURM_NTASKS maker -c 1 -base genome -g genome.fasta
>
> Another question:
> How much temporary space and memory should I use for a ~10Mb sequences and
> large database like nr and uniref90.
>
> Thanks,
> zelin
>
> --------------------------------------------
> Zelin Chen [chzelin at gmail.com]
>
>
> On Tue, Aug 15, 2017 at 4:50 PM, Carson Holt <carsonhh at gmail.com> wrote:
>
>> What is your command line? Are you running interactively or as a
>> submitted batch? If it's a batch job what options did you give it?
>>
>> --Carson
>>
>> Sent from my iPhone
>>
>> On Aug 15, 2017, at 2:47 PM, zl c <chzelin at gmail.com> wrote:
>>
>> Hi Carson,  Christopher, Daniel,
>>
>> Thank you for your kind help.
>>
>> Now it works without any other options on one nodes and 4 CPUs. I set
>> the number of task to 2, but there's only one contigs in running. Should it
>> be two contigs running at the same time?
>>
>> Zelin
>>
>> --------------------------------------------
>> Zelin Chen [chzelin at gmail.com]
>>
>>
>> NIH/NHGRI
>> Building 50, Room 5531
>> 50 SOUTH DR, MSC 8004
>> BETHESDA, MD 20892-8004
>>
>> On Tue, Aug 15, 2017 at 11:47 AM, Carson Holt <carsonhh at gmail.com> wrote:
>>
>>> Did it die or did you just get a warning?
>>>
>>> Here is a list of flags to add that suppress warnings and other issues
>>> with OpenMPI. You can add them all or one at a time depending on issues you
>>> get.
>>>
>>> #add if MPI not using all CPU given
>>> --oversubscribe --bind-to none
>>>
>>> #workaround for infinaband (use instead of --mca ^openib)
>>> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0
>>>
>>> #add to stop certain other warnings
>>> --mca orte_base_help_aggregate 0
>>>
>>> #stop fork warnings
>>> --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0
>>>
>>> —Carson
>>>
>>>
>>>
>>> On Aug 15, 2017, at 9:34 AM, zl c <chzelin at gmail.com> wrote:
>>>
>>> Here are some latest message:
>>>
>>> [cn3360:57176] 1 more process has sent help message
>>> help-opal-runtime.txt / opal_init:warn-fork
>>> [cn3360:57176] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>> all help / error messages
>>>
>>> --------------------------------------------
>>> Zelin Chen [chzelin at gmail.com]
>>>
>>>
>>>
>>> On Tue, Aug 15, 2017 at 10:39 AM, Carson Holt <carsonhh at gmail.com>
>>> wrote:
>>>
>>>> You may need to delete the .../maker/perl directory before doing the
>>>> reinstall if not doing a brand new installation. Otherwise you can ignore
>>>> the subroutine redefined warnings during compile.
>>>>
>>>> Have you been able to test the alternate flags on the command line for
>>>> MPI? How about an alternate perl without threads?
>>>>
>>>> --Carson
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Aug 15, 2017, at 8:27 AM, zl c <chzelin at gmail.com> wrote:
>>>>
>>>> When I installed by './Build install', I got following some messages:
>>>> Configuring MAKER with MPI support
>>>> Installing MAKER...
>>>> Configuring MAKER with MPI support
>>>> Subroutine dl_load_flags redefined at (eval 125) line 8.
>>>> Subroutine Parallel::Application::MPI::C_MPI_ANY_SOURCE redefined at
>>>> (eval 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_ANY_TAG redefined at
>>>> (eval 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_SUCCESS redefined at
>>>> (eval 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_Init redefined at (eval
>>>> 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_Finalize redefined at
>>>> (eval 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_Comm_rank redefined at
>>>> (eval 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_Comm_size redefined at
>>>> (eval 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_Send redefined at (eval
>>>> 125) line 9.
>>>> Subroutine Parallel::Application::MPI::C_MPI_Recv redefined at (eval
>>>> 125) line 9.
>>>> Subroutine Parallel::Application::MPI::_comment redefined at (eval
>>>> 125) line 9.
>>>>
>>>> I'm not sure whether it's correctly installed.
>>>>
>>>> Thanks,
>>>>
>>>> --------------------------------------------
>>>> Zelin Chen [chzelin at gmail.com]
>>>>
>>>> NIH/NHGRI
>>>> Building 50, Room 5531
>>>> 50 SOUTH DR, MSC 8004
>>>> BETHESDA, MD 20892-8004
>>>>
>>>> On Mon, Aug 14, 2017 at 9:23 PM, Fields, Christopher J <
>>>> cjfields at illinois.edu> wrote:
>>>>
>>>>> Carson,
>>>>>
>>>>>
>>>>>
>>>>> It was attached to the initial message (named ‘run05.mpi.o47346077’).
>>>>> It looks like a Perl issue with threads, though I don’t see why this would
>>>>> crash a cluster.  The fact there is a log file would suggest it just ended
>>>>> the job.
>>>>>
>>>>>
>>>>>
>>>>> chris
>>>>>
>>>>>
>>>>>
>>>>> *From: *maker-devel <maker-devel-bounces at yandell-lab.org> on behalf
>>>>> of Carson Holt <carsonhh at gmail.com>
>>>>> *Date: *Monday, August 14, 2017 at 2:18 PM
>>>>> *To: *zl c <chzelin at gmail.com>
>>>>> *Cc: *"maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
>>>>> *Subject: *Re: [maker-devel] maker MPI problem
>>>>>
>>>>>
>>>>>
>>>>> This is rather vague —> “crashed the computer cluster”
>>>>>
>>>>>
>>>>>
>>>>> Do you have a specific error?
>>>>>
>>>>>
>>>>>
>>>>> —Carson
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 14, 2017, at 12:59 PM, zl c <chzelin at gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>>
>>>>> I ran maker 3.0 with openmpi 2.0.2 and it crashed the computer
>>>>> cluster. I attached the log file. Could you help me to solve the problem?
>>>>>
>>>>>
>>>>>
>>>>> CMD:
>>>>>
>>>>> export LD_PRELOAD=/usr/local/OpenMPI/2.0.2/gcc-6.3.0/lib/libmpi.so
>>>>>
>>>>> export OMPI_MCA_mpi_warn_on_fork=0
>>>>>
>>>>> mpiexec -mca btl ^openib -n $SLURM_NTASKS maker -c 1 –base genome  -g
>>>>> genome.fasta
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Zelin Chen
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------
>>>>>
>>>>> Zelin Chen [chzelin at gmail.com]  Ph.D.
>>>>>
>>>>>
>>>>>
>>>>> NIH/NHGRI
>>>>>
>>>>> Building 50, Room 5531
>>>>> 50 SOUTH DR, MSC 8004
>>>>> BETHESDA, MD 20892-8004
>>>>>
>>>>> <run05.mpi.o47346077>_______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at box290.bluehost.com
>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yand
>>>>> ell-lab.org
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170817/ffdaf888/attachment-0003.html>


More information about the maker-devel mailing list