[maker-devel] MAKER mpi running wrong
Carson Holt
carsonhh at gmail.com
Thu Jan 8 08:47:29 MST 2015
When running large jobs in MPI semi-random issues can arise as well as tuning issues where hardware configuration, IO performance, buffer sizes etc. all play a role. Using one of the NIH flagship clusters from XSEDE for example, I can run on over 2000 CPUs without issue. But the IT specialists with XSEDE have also spent a lot of time tuning MPI by enabling and disabling certain options for their hardware and network configuration (The IT specialists for the XSEDE project are actually the developers for many of the MPI flavors available, so they actually wrote MPI to work really well on this specific cluster). On other clusters I can’t go over 200 cpus on a single job. Or on another XSEDE cluster I can run on exactly 1424 CPUs. If I increase by a single CPU, the jobs always fails. For these kinds of issues you would have to delve into some of the more obscure parameters of OpenMPI via trial and error (http://www.open-mpi.org/doc/ <http://www.open-mpi.org/doc/>). What happens under the hood in OpenMPI is that different buffer sizes and network communication strategies are triggered as the number of nodes increases, so you can often identify a specific CPU count that is stable, and going one over that number causes a failure. You then look in the documentation for a parameter that matches that trigger value and alter it higher or lower. Or if you can identify the stable CPU count, then just submit multiple jobs at exactly that CPU count.
—Carson
> On Jan 8, 2015, at 8:27 AM, 赵越 <jerryzhaosjtu at gmail.com> wrote:
>
> Hi Carson,
>
> After using the flag in your example, the warning after runing MAKER was gone, yet after running with MPI in 512 threads for 2 hours, MAKER 'Exited with exit code 1' The stdout info is as followed:
>
> [node206][[7968,1],269][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node206][[7968,1],269][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> SIGTERM received
> Perl exited with active threads:
> 1 running and unjoined
> 0 finished and unjoined
> 0 running and detached
>
> Also, my job submission is like:
>
> #BSUB -J maker_mpi
> #BSUB -n 512
> #BSUB -R "span[ptile=16]"
> module purge && module load gcc/4.9.1 openmpi/gcc/1.6.5
> mpiexec -mca btl ^openib -n 512 perl /lustre/home/clswcc/yzhao/MAKER/maker/bin/maker -fix_nucleotides
>
>
> Could you help me find out where is going wrong? The stdout at first is normal as followd :
> STATUS: Parsing control files...
> STATUS: Processing and indexing input FASTA files...
> STATUS: Setting up database for any GFF3 input...
> A data structure will be created for you at:
> /lustre/home/clswcc/SOP_1Krice/gene_prediction/mpi/unaln.maker.output/unaln_datastore
>
> To access files for individual sequences use the datastore index:
> /lustre/home/clswcc/SOP_1Krice/gene_prediction/mpi/unaln.maker.output/unaln_master_datastore_index.log
>
> STATUS: Now running MAKER...
>
>
>
>
> Regards,
> yue
>
> --
> Yue Zhao (Jerry)
> Bachelor Candidate of Plant Biotechnology
> Researcher in UCLA-CSST program
> Shanghai Jiao Tong University, Shanghai
> jerryzhaosjtu at gmail.com <mailto:jerryzhaosjtu at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150108/c5d24a0f/attachment-0002.html>
More information about the maker-devel
mailing list