[maker-devel] Maker crash on increasingly small contigs
Carson Holt
carsonhh at gmail.com
Thu Jan 29 08:22:57 MST 2015
In my experience NFS is the most likely cause. A lot of very small contigs means that MAKER would produce a lot of very small files very quickly, which creates far more stress for NFS than high IO read/write bandwidth does. There can then be several seconds of lag time between a file being created and the file being available for reading because the asynchronous setting allows the system to return true for IO operation even though the operations have not yet been completed but are only buffered on the NFS server. So when the process tries to read the file it supposably just created, the file doesn’t exist.
MAKER tries to offload most small file creation operations that can result in this condition to a temporary directory (indicated by TMP= in the maker_opts.ctl file), so it is critical that this location be set to a local drive and not an NFS location. But running a lot of very small contigs would still result in more frequent file creation on the NFS mount.
The only way around this type of NFS issue is either to run on fewer nodes to reduce file creation frequency, turn off asynchronous mode for NFS (which results in serious IO performance degradation) or to just let MAKER retry until it works (brute force) which is the default and in my experience the most effective approach. NFS issues were in fact the reason we put retry and restart capabilities into MAKER in the first place.
—Carson
> On Jan 29, 2015, at 2:37 AM, Mikael Brandström Durling <mikael.durling at slu.se> wrote:
>
> Hi,
>
> are you running the NFS servers in synchronous or asynchronous mode? I have seen cases when maker fails with the nfs server in async mode, but the failures are random and I can’t really reproduce them. In the end, I have continued running maker on NFS in async mode, since the speed gains are significant, at the cost of occasional reruns. (And yes, nfsstats shows no signs of errors).
>
> Mikael
>
>
>> 29 jan 2015 kl. 08:34 skrev Marc P. Hoeppner <marc.hoeppner at imbim.uu.se>:
>>
>> Hi,
>>
>> thanks for the feedback. If I resume maker enough times, it will eventually run through an complete all contigs. The question is whether there is any way to debug why it drops at random times , most commonly when running on small contigs (which is probably more due to the increasing frequency of starting/finishing jobs rather than their size). I guess Maker has no debug mode or any other way to find out why it dies? Any idea what could make Maker drop like that? I was thinking NFS, but the nfsstat looks fine, nothing in the log and NFS function is generally good - so I can't identify a good point to look for the problem.
>>
>> Regards,
>>
>> Marc
>>
>> On 2015-01-28 17:22, Daniel Ence wrote:
>>> Hi Marc, so a few things on the maker side to check out.
>>>
>>> Did you have the min_contig set to 1000, to set the lower limit on contig size?
>>> Did maker do anything with the 1kb contigs? Or did it just skip them?
>>> You can check that in the master_datastore_index.log or in the void directories for the small contigs.
>>> That will tell us whether maker is functioning correctly, even though it’s giving those messages.
>>>
>>> With the newer versions of makers, I get messages identical to what you sent as part of the normal thread termination, even when maker is functioning normally.
>>>
>>> Thanks,
>>> Daniel
>>>
>>>
>>>
>>>> On Jan 28, 2015, at 12:01 AM, Marc Höppner <marc.hoeppner at imbim.uu.se> wrote:
>>>>
>>>> Hi,
>>>>
>>>> this is probably a long shot, but I was hoping that someone on the list may have some advice as to how to debug an error that has been popping up when running Maker on our 10 node cluster. So, what is the issue?
>>>>
>>>> Maker runs fine on several assemblies that w have processed in the past, but I recently started on a fairly fragment (low N50) mammalian assembly and the collaborator was keen to have all contigs annotated, down to 1kb (I guess it is more about the repeats and blast matches in those small bits). Anyway, As the contigs get smaller, Maker starts crashing in MPI mode with the following error (no other message given prior to that):
>>>>
>>>> perl:13424 terminated with signal 11 at PC=3d47095012 SP=7f8ac076e530. Backtrace:
>>>> /usr/lib64/perl5/CORE/libperl.so(Perl_csighandler+0x22)[0x3d47095012]
>>>> /lib64/libpthread.so.0[0x358ae0f710]
>>>> /usr/lib64/perl5/CORE/libperl.so(Perl_csighandler+0x0)[0x3d47094ff0]
>>>> /lib64/libpthread.so.0[0x358ae0f710]
>>>> /lib64/libc.so.6(__poll+0x53)[0x358aadf343]
>>>> /sw/openmpi/1.8.3/lib/libopen-pal.so.6(+0x6af4a)[0x7f8ac0a29f4a]
>>>> /sw/openmpi/1.8.3/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x221)[0x7f8ac0a21961]
>>>> /sw/openmpi/1.8.3/lib/libopen-rte.so.7(+0x52f8e)[0x7f8ac0ce5f8e]
>>>> /lib64/libpthread.so.0[0x358ae079d1]
>>>> /lib64/libc.so.6(clone+0x6d)[0x358aae8b6d]
>>>> SIGTERM received
>>>>
>>>> A few words about the setup:
>>>>
>>>> We have 10 nodes, 160 cores and the shared file system is exported via Infiniband from a ‘standard’ NFS server. As OS we run Scientific Linux 6.5. Tests so far don’t point to congestion issues or anything like that, the bandwidth usage is actually fairly low. I
>>>>
>>>> So far I tried:
>>>>
>>>> - running the MPI processes through both the ethernet network as well as over IPoIB, same problem.
>>>> - installing a more recent version of perl through perlbrew, with all the required modules, and re-compiled Maker
>>>> - ran some (albeit simple) network checks to for retransmissions, lost packages etc - nothing popped up
>>>> - running Maker in a subset of nodes to eliminate the possibility of a bad node
>>>>
>>>> The error message is a bit cryptic to me and it would be very helpful to know if Maker has a problem with accessing a file, or whether OpenMPI has a communication problem etc - but I am not able to tell from the information I have been able to extract so far. Any ideas?
>>>>
>>>> So
>>>>
>>>> Cheers,
>>>>
>>>> Marc
>>>>
>>>>
>>>> Marc P. Hoeppner, PhD
>>>> Team Leader
>>>> BILS Genome Annotation Platform
>>>> Department for Medical Biochemistry and Microbiology
>>>> Uppsala University, Sweden
>>>> marc.hoeppner at imbim.uu.se
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
More information about the maker-devel
mailing list