[maker-devel] short scaffolds finish, long scaffolds (almost always) fail

Sat Feb 29 10:27:16 MST 2020

Hi once again Carson,
Our administrators tried installing Maker with a different version of
OpenMPI, and the change allowed the job to complete normally. The change
was from a newer version (3.1.3) to an older version (1.6.5) of OpenMPI. I
needed to make one tweak to the various MPI arguments you provided after
that downgrade in version number, as v-1.6.5 didn't use Vader yet. Other
than that, the terms appeared to allow the job to run to completion.
Thanks for your assistance,
Devon

On Fri, Feb 28, 2020 at 7:50 AM Devon O'Rourke <devon.orourke at gmail.com>
wrote:

> Hi Carson,
> I had previously tried sending this email yesterday but received a
> notification about the text body size being too large. I thought perhaps it
> was related to the attached log file I sent in the earlier message. You can
> see the same file here: https://osf.io/cuxg8/download.
> Thanks!
>
> (previous message below)
>
> ....
>
> Two steps forward, one step back, I suppose?
> After incorporating the additional MPI-related parameters the job moved
> further ahead than previous iterations, however it still failed prior to
> completing the job. It appears that all but the six longest scaffolds were
> annotated (except for a small few short scaffolds which simply weren't
> finished by the time the error triggered the entire run to stop).
> I've attached the .log file in hopes that you might find any additional
> nuggets to help diagnose the problem. Very much appreciate your help.
> Devon
>
> On Wed, Feb 26, 2020 at 3:18 PM Carson Holt <carsonhh at gmail.com> wrote:
>
>> For Intel MPI, export an environmental variable right before running
>> MAKER —> "export I_MPI_FABRICS=shm:tcp"
>>
>> Intel MPI has a similar infiniband segfault issue as OpenMPI when running
>> Perl scripts, but a different workaround.
>>
>> —Carson
>>
>>
>> On Feb 26, 2020, at 1:15 PM, Devon O'Rourke <devon.orourke at gmail.com>
>> wrote:
>>
>> Much appreciated Carson,
>> I've submitted a job using the parameters you've suggested and will post
>> the outcome. We definitely have two of three MPI options you've described
>> on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to
>> advise my cluster admins to use whichever software you prefer (should there
>> be one).
>> Thanks,
>> Devon
>>
>> On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <carsonhh at gmail.com> wrote:
>>
>>> Try adding these a few options right after ‘mpiexec’ in your batch
>>> script (this will fix infiniband related segfaults as well as some fork
>>> related segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include
>>> ib0 --mca orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1
>>> --mca mpi_warn_on_fork 0
>>>
>>> Also remove the -q in the maker command to get full command lines for
>>> subprocesses in the STDERR (allows you to run some commands outside of
>>> MAKER to test the source of failures if for example BLASt or Exonerate is
>>> causing the segfault).
>>>
>>> Example —>
>>> mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca
>>> orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca
>>> mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base
>>> lu -fix_nucleotides
>>>
>>>
>>> One alternate possibility is that OpenMPI is the problem, I’ve seen a
>>> few systems where it has an issue with perl itself, and the only way to get
>>> around it is to install your own version of perl without perl threads
>>> enabled and install MAKER with that version of Perl (then OpenMPI seems to
>>> be ok again). If that’s the case it is often easier to switch to MPICH2 or
>>> Intel MPI as the MPI launcher if they are available and then reinstall
>>> MAKER with that MPI flavor.
>>>
>>> —Carson
>>>
>>>
>>>
>>> On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <devon.orourke at gmail.com>
>>> wrote:
>>>
>>> Thanks very much for the reply Carson,
>>> I've attached few files file of the most recently failed run: the shell
>>> script submitted to Slurm, the _opts.ctl file, and the pair of log files
>>> generated from the job. The reason there are a 1a and 1b pair of files is
>>> that I had initially set the number of cpus in the _opts.ctl file to "60",
>>> but then tried re-running it after setting it to "28". Both seem to have
>>> the same result.
>>> I certainly have access to more memory if needed. I'm using a pretty
>>> typical (I think?) cluster that controls jobs with Slurm using a Lustre
>>> file system - it's the main high performance computing center at our
>>> university. I have access to plenty of nodes that contain about 120-150g of
>>> RAM each with between 24-28 cpus each, as well a handful of higher memory
>>> nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a
>>> similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over
>>> 32 cpus; if that fails, I could certainly run again with even more memory.
>>> Appreciate your insights; hope the weather in UT is filled with sun or
>>> snow or both.
>>> Devon
>>>
>>> On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <carsonhh at gmail.com> wrote:
>>>
>>>> If running under MPI, the reason for a failure may be further back in
>>>> the STDERR (failures tend snowball other failures, so the initial cause is
>>>> often way back). If you can capture the STDERR and send it, that would be
>>>> the most informative. If its memory, you can also set all the blast_depth
>>>> parameters in maker_botpts.ctl to a value like 20.
>>>>
>>>> —Carson
>>>>
>>>>
>>>>
>>>> On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <devon.orourke at gmail.com>
>>>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I apologize for not posting directly to the archived forum but it
>>>> appears that the option to enter new posts is disabled. Perhaps this is by
>>>> design so emails go directly to this address. I hope this is what you are
>>>> looking for.
>>>>
>>>> Thank you for your continued support of Maker and your responses to the
>>>> forum posts. I have been running Maker (V3.01.02-beta) to annotate a
>>>> mammalian genome that consists of 22 chromosome-length scaffolds (between
>>>> ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length.
>>>> In my various tests in running Maker, the vast majority of the smaller
>>>> fragments are annotated successfully, but nearly all the large scaffolds
>>>> fail with the same error code when I look at the 'run.log.child.0' file:
>>>> ```
>>>> DIED RANK 0:6:0:0
>>>> DIED COUNT 2
>>>> ```
>>>> (the master 'run.log' file just shows "DIED COUNT 2")
>>>>
>>>> I struggled to find this exact error code anywhere on the forum and was
>>>> hoping you might be able to help me determine where I should start
>>>> troubleshooting. I thought perhaps it was an error concerning memory
>>>> requirements, so I altered the chunk size from the default to a few larger
>>>> sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the
>>>> same outcome). I've tried running the program with parallel support using
>>>> either openMPI or mpich. I've tried running on a single node using 24 cpus
>>>> and 120g of RAM. It always stalls at the same step.
>>>>
>>>> Interestingly, one of the 22 large scaffolds always finishes and
>>>> produces the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff
>>>> files, but the other 21 of 22 large scaffolds fail. This makes me think
>>>> perhaps it's not a memory issue?
>>>>
>>>> In the case of both the completed and failed scaffolds, the
>>>> "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out,
>>>> .specific.ori.out, .specific.cat.gz, .specific.out,
>>>> te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest
>>>> *fasta.tblastx, and protein *fasta.blastx files are all present (and appear
>>>> finished from what I can tell).
>>>> However, the particular contents in the parent directory to the
>>>> "theVoid.scaffold" folder differ. For the failed scaffolds, the contents
>>>> generally always look something like this (that is, they stall with the
>>>> same kind of files produced):
>>>> ```
>>>> 0
>>>> evidence_0.gff
>>>> query.fasta
>>>> query.masked.fasta
>>>> query.masked.fasta.index
>>>> query.masked.gff
>>>> run.log.child.0
>>>> scaffold22.0.final.section
>>>> scaffold22.0.pred.raw.section
>>>> scaffold22.0.raw.section
>>>> scaffold22.gff.ann
>>>> scaffold22.gff.def
>>>> scaffold22.gff.seq
>>>> ```
>>>>
>>>> For the completed scaffold, there are many more files created:
>>>> ```
>>>> 0
>>>> 10
>>>> 100
>>>> 20
>>>> 30
>>>> 40
>>>> 50
>>>> 60
>>>> 70
>>>> 80
>>>> 90
>>>> evidence_0.gff
>>>> evidence_10.gff
>>>> evidence_1.gff
>>>> evidence_2.gff
>>>> evidence_3.gff
>>>> evidence_4.gff
>>>> evidence_5.gff
>>>> evidence_6.gff
>>>> evidence_7.gff
>>>> evidence_8.gff
>>>> evidence_9.gff
>>>> query.fasta
>>>> query.masked.fasta
>>>> query.masked.fasta.index
>>>> query.masked.gff
>>>> run.log.child.0
>>>> run.log.child.1
>>>> run.log.child.10
>>>> run.log.child.2
>>>> run.log.child.3
>>>> run.log.child.4
>>>> run.log.child.5
>>>> run.log.child.6
>>>> run.log.child.7
>>>> run.log.child.8
>>>> run.log.child.9
>>>> scaffold4.0-1.raw.section
>>>> scaffold4.0.final.section
>>>> scaffold4.0.pred.raw.section
>>>> scaffold4.0.raw.section
>>>> scaffold4.10.final.section
>>>> scaffold4.10.pred.raw.section
>>>> scaffold4.10.raw.section
>>>> scaffold4.1-2.raw.section
>>>> scaffold4.1.final.section
>>>> scaffold4.1.pred.raw.section
>>>> scaffold4.1.raw.section
>>>> scaffold4.2-3.raw.section
>>>> scaffold4.2.final.section
>>>> scaffold4.2.pred.raw.section
>>>> scaffold4.2.raw.section
>>>> scaffold4.3-4.raw.section
>>>> scaffold4.3.final.section
>>>> scaffold4.3.pred.raw.section
>>>> scaffold4.3.raw.section
>>>> scaffold4.4-5.raw.section
>>>> scaffold4.4.final.section
>>>> scaffold4.4.pred.raw.section
>>>> scaffold4.4.raw.section
>>>> scaffold4.5-6.raw.section
>>>> scaffold4.5.final.section
>>>> scaffold4.5.pred.raw.section
>>>> scaffold4.5.raw.section
>>>> scaffold4.6-7.raw.section
>>>> scaffold4.6.final.section
>>>> scaffold4.6.pred.raw.section
>>>> scaffold4.6.raw.section
>>>> scaffold4.7-8.raw.section
>>>> scaffold4.7.final.section
>>>> scaffold4.7.pred.raw.section
>>>> scaffold4.7.raw.section
>>>> scaffold4.8-9.raw.section
>>>> scaffold4.8.final.section
>>>> scaffold4.8.pred.raw.section
>>>> scaffold4.8.raw.section
>>>> scaffold4.9-10.raw.section
>>>> scaffold4.9.final.section
>>>> scaffold4.9.pred.raw.section
>>>> scaffold4.9.raw.section
>>>> ```
>>>>
>>>> Thanks for any troubleshooting tips you can offer.
>>>>
>>>> Cheers,
>>>> Devon
>>>>
>>>> --
>>>> Devon O'Rourke
>>>> Postdoctoral researcher, Northern Arizona University
>>>> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
>>>> twitter: @thesciencedork
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at yandell-lab.org
>>>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>>
>>>>
>>>
>>> --
>>> Devon O'Rourke
>>> Postdoctoral researcher, Northern Arizona University
>>> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
>>> twitter: @thesciencedork
>>> <fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>
>>>
>>>
>>>
>>
>> --
>> Devon O'Rourke
>> Postdoctoral researcher, Northern Arizona University
>> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
>> twitter: @thesciencedork
>>
>>
>>
>
> --
> Devon O'Rourke
> Postdoctoral researcher, Northern Arizona University
> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
> twitter: @thesciencedork
>

-- 
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200229/69ddff4c/attachment-0004.html>