[maker-devel] short scaffolds finish, long scaffolds (almost always) fail

Wed Feb 26 13:18:34 MST 2020

For Intel MPI, export an environmental variable right before running MAKER —> "export I_MPI_FABRICS=shm:tcp"

Intel MPI has a similar infiniband segfault issue as OpenMPI when running Perl scripts, but a different workaround.

—Carson

> On Feb 26, 2020, at 1:15 PM, Devon O'Rourke <devon.orourke at gmail.com> wrote:
> 
> Much appreciated Carson,
> I've submitted a job using the parameters you've suggested and will post the outcome. We definitely have two of three MPI options you've described on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to advise my cluster admins to use whichever software you prefer (should there be one).
> Thanks,
> Devon
> 
> On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
> Try adding these a few options right after ‘mpiexec’ in your batch script (this will fix infiniband related segfaults as well as some fork related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0
> 
> Also remove the -q in the maker command to get full command lines for subprocesses in the STDERR (allows you to run some commands outside of MAKER to test the source of failures if for example BLASt or Exonerate is causing the segfault).
> 
> Example —>
> mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu -fix_nucleotides 
> 
> 
> One alternate possibility is that OpenMPI is the problem, I’ve seen a few systems where it has an issue with perl itself, and the only way to get around it is to install your own version of perl without perl threads enabled and install MAKER with that version of Perl (then OpenMPI seems to be ok again). If that’s the case it is often easier to switch to MPICH2 or Intel MPI as the MPI launcher if they are available and then reinstall MAKER with that MPI flavor.
> 
> —Carson 
> 
> 
> 
>> On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <devon.orourke at gmail.com <mailto:devon.orourke at gmail.com>> wrote:
>> 
>> Thanks very much for the reply Carson,
>> I've attached few files file of the most recently failed run: the shell script submitted to Slurm, the _opts.ctl file, and the pair of log files generated from the job. The reason there are a 1a and 1b pair of files is that I had initially set the number of cpus in the _opts.ctl file to "60", but then tried re-running it after setting it to "28". Both seem to have the same result.
>> I certainly have access to more memory if needed. I'm using a pretty typical (I think?) cluster that controls jobs with Slurm using a Lustre file system - it's the main high performance computing center at our university. I have access to plenty of nodes that contain about 120-150g of RAM each with between 24-28 cpus each, as well a handful of higher memory nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over 32 cpus; if that fails, I could certainly run again with even more memory.
>> Appreciate your insights; hope the weather in UT is filled with sun or snow or both.
>> Devon
>> 
>> On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>> wrote:
>> If running under MPI, the reason for a failure may be further back in the STDERR (failures tend snowball other failures, so the initial cause is often way back). If you can capture the STDERR and send it, that would be the most informative. If its memory, you can also set all the blast_depth parameters in maker_botpts.ctl to a value like 20.
>> 
>> —Carson
>> 
>> 
>> 
>>> On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <devon.orourke at gmail.com <mailto:devon.orourke at gmail.com>> wrote:
>>> 
>>> Hello,
>>> 
>>> I apologize for not posting directly to the archived forum but it appears that the option to enter new posts is disabled. Perhaps this is by design so emails go directly to this address. I hope this is what you are looking for.
>>> 
>>> Thank you for your continued support of Maker and your responses to the forum posts. I have been running Maker (V3.01.02-beta) to annotate a mammalian genome that consists of 22 chromosome-length scaffolds (between ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length. In my various tests in running Maker, the vast majority of the smaller fragments are annotated successfully, but nearly all the large scaffolds fail with the same error code when I look at the 'run.log.child.0' file:
>>> ```
>>> DIED	RANK	0:6:0:0
>>> DIED	COUNT	2
>>> ```
>>> (the master 'run.log' file just shows "DIED	COUNT	2")
>>> 
>>> I struggled to find this exact error code anywhere on the forum and was hoping you might be able to help me determine where I should start troubleshooting. I thought perhaps it was an error concerning memory requirements, so I altered the chunk size from the default to a few larger sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the same outcome). I've tried running the program with parallel support using either openMPI or mpich. I've tried running on a single node using 24 cpus and 120g of RAM. It always stalls at the same step.
>>> 
>>> Interestingly, one of the 22 large scaffolds always finishes and produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but the other 21 of 22 large scaffolds fail. This makes me think perhaps it's not a memory issue?
>>> 
>>> In the case of both the completed and failed scaffolds, the "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out, .specific.ori.out, .specific.cat.gz, .specific.out, te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest *fasta.tblastx, and protein *fasta.blastx files are all present (and appear finished from what I can tell).
>>> However, the particular contents in the parent directory to the "theVoid.scaffold" folder differ. For the failed scaffolds, the contents generally always look something like this (that is, they stall with the same kind of files produced):
>>> ```
>>> 0
>>> evidence_0.gff
>>> query.fasta
>>> query.masked.fasta
>>> query.masked.fasta.index
>>> query.masked.gff
>>> run.log.child.0
>>> scaffold22.0.final.section
>>> scaffold22.0.pred.raw.section
>>> scaffold22.0.raw.section
>>> scaffold22.gff.ann
>>> scaffold22.gff.def
>>> scaffold22.gff.seq
>>> ```
>>> 
>>> For the completed scaffold, there are many more files created:
>>> ```
>>> 0
>>> 10
>>> 100
>>> 20
>>> 30
>>> 40
>>> 50
>>> 60
>>> 70
>>> 80
>>> 90
>>> evidence_0.gff
>>> evidence_10.gff
>>> evidence_1.gff
>>> evidence_2.gff
>>> evidence_3.gff
>>> evidence_4.gff
>>> evidence_5.gff
>>> evidence_6.gff
>>> evidence_7.gff
>>> evidence_8.gff
>>> evidence_9.gff
>>> query.fasta
>>> query.masked.fasta
>>> query.masked.fasta.index
>>> query.masked.gff
>>> run.log.child.0
>>> run.log.child.1
>>> run.log.child.10
>>> run.log.child.2
>>> run.log.child.3
>>> run.log.child.4
>>> run.log.child.5
>>> run.log.child.6
>>> run.log.child.7
>>> run.log.child.8
>>> run.log.child.9
>>> scaffold4.0-1.raw.section
>>> scaffold4.0.final.section
>>> scaffold4.0.pred.raw.section
>>> scaffold4.0.raw.section
>>> scaffold4.10.final.section
>>> scaffold4.10.pred.raw.section
>>> scaffold4.10.raw.section
>>> scaffold4.1-2.raw.section
>>> scaffold4.1.final.section
>>> scaffold4.1.pred.raw.section
>>> scaffold4.1.raw.section
>>> scaffold4.2-3.raw.section
>>> scaffold4.2.final.section
>>> scaffold4.2.pred.raw.section
>>> scaffold4.2.raw.section
>>> scaffold4.3-4.raw.section
>>> scaffold4.3.final.section
>>> scaffold4.3.pred.raw.section
>>> scaffold4.3.raw.section
>>> scaffold4.4-5.raw.section
>>> scaffold4.4.final.section
>>> scaffold4.4.pred.raw.section
>>> scaffold4.4.raw.section
>>> scaffold4.5-6.raw.section
>>> scaffold4.5.final.section
>>> scaffold4.5.pred.raw.section
>>> scaffold4.5.raw.section
>>> scaffold4.6-7.raw.section
>>> scaffold4.6.final.section
>>> scaffold4.6.pred.raw.section
>>> scaffold4.6.raw.section
>>> scaffold4.7-8.raw.section
>>> scaffold4.7.final.section
>>> scaffold4.7.pred.raw.section
>>> scaffold4.7.raw.section
>>> scaffold4.8-9.raw.section
>>> scaffold4.8.final.section
>>> scaffold4.8.pred.raw.section
>>> scaffold4.8.raw.section
>>> scaffold4.9-10.raw.section
>>> scaffold4.9.final.section
>>> scaffold4.9.pred.raw.section
>>> scaffold4.9.raw.section
>>> ```
>>> 
>>> Thanks for any troubleshooting tips you can offer. 
>>> 
>>> Cheers,
>>> Devon
>>> 
>>> -- 
>>> Devon O'Rourke
>>> Postdoctoral researcher, Northern Arizona University
>>> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ <https://fozlab.weebly.com/>
>>> twitter: @thesciencedork
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at yandell-lab.org <mailto:maker-devel at yandell-lab.org>
>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org <http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org>
>> 
>> 
>> 
>> -- 
>> Devon O'Rourke
>> Postdoctoral researcher, Northern Arizona University
>> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ <https://fozlab.weebly.com/>
>> twitter: @thesciencedork
>> <fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>
> 
> 
> 
> -- 
> Devon O'Rourke
> Postdoctoral researcher, Northern Arizona University
> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/ <https://fozlab.weebly.com/>
> twitter: @thesciencedork

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200226/0c22e875/attachment-0004.html>