[maker-devel] short scaffolds finish, long scaffolds (almost always) fail
Devon O'Rourke
devon.orourke at gmail.com
Wed Feb 26 13:15:08 MST 2020
Much appreciated Carson,
I've submitted a job using the parameters you've suggested and will post
the outcome. We definitely have two of three MPI options you've described
on our cluster (OpenMPI and MPICH2); I'll check on Intel MPI. Happy to
advise my cluster admins to use whichever software you prefer (should there
be one).
Thanks,
Devon
On Wed, Feb 26, 2020 at 2:54 PM Carson Holt <carsonhh at gmail.com> wrote:
> Try adding these a few options right after ‘mpiexec’ in your batch script
> (this will fix infiniband related segfaults as well as some fork related
> segfaults) —> --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca
> orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca
> mpi_warn_on_fork 0
>
> Also remove the -q in the maker command to get full command lines for
> subprocesses in the STDERR (allows you to run some commands outside of
> MAKER to test the source of failures if for example BLASt or Exonerate is
> causing the segfault).
>
> Example —>
> mpiexec --mca btl vader,tcp,self --mca btl_tcp_if_include ib0 --mca
> orte_base_help_aggregate 0 --mca btl_openib_want_fork_support 1 --mca
> mpi_warn_on_fork 0 -n 28 /packages/maker/3.01.02-beta/bin/maker -base lu
> -fix_nucleotides
>
>
> One alternate possibility is that OpenMPI is the problem, I’ve seen a few
> systems where it has an issue with perl itself, and the only way to get
> around it is to install your own version of perl without perl threads
> enabled and install MAKER with that version of Perl (then OpenMPI seems to
> be ok again). If that’s the case it is often easier to switch to MPICH2 or
> Intel MPI as the MPI launcher if they are available and then reinstall
> MAKER with that MPI flavor.
>
> —Carson
>
>
>
> On Feb 26, 2020, at 12:36 PM, Devon O'Rourke <devon.orourke at gmail.com>
> wrote:
>
> Thanks very much for the reply Carson,
> I've attached few files file of the most recently failed run: the shell
> script submitted to Slurm, the _opts.ctl file, and the pair of log files
> generated from the job. The reason there are a 1a and 1b pair of files is
> that I had initially set the number of cpus in the _opts.ctl file to "60",
> but then tried re-running it after setting it to "28". Both seem to have
> the same result.
> I certainly have access to more memory if needed. I'm using a pretty
> typical (I think?) cluster that controls jobs with Slurm using a Lustre
> file system - it's the main high performance computing center at our
> university. I have access to plenty of nodes that contain about 120-150g of
> RAM each with between 24-28 cpus each, as well a handful of higher memory
> nodes with about 1.5tb of RAM. As I'm writing this email, I've submitted a
> similar Maker job (i.e. same fasta/gff inputs) requesting 200g of RAM over
> 32 cpus; if that fails, I could certainly run again with even more memory.
> Appreciate your insights; hope the weather in UT is filled with sun or
> snow or both.
> Devon
>
> On Wed, Feb 26, 2020 at 2:10 PM Carson Holt <carsonhh at gmail.com> wrote:
>
>> If running under MPI, the reason for a failure may be further back in the
>> STDERR (failures tend snowball other failures, so the initial cause is
>> often way back). If you can capture the STDERR and send it, that would be
>> the most informative. If its memory, you can also set all the blast_depth
>> parameters in maker_botpts.ctl to a value like 20.
>>
>> —Carson
>>
>>
>>
>> On Feb 19, 2020, at 1:54 PM, Devon O'Rourke <devon.orourke at gmail.com>
>> wrote:
>>
>> Hello,
>>
>> I apologize for not posting directly to the archived forum but it appears
>> that the option to enter new posts is disabled. Perhaps this is by design
>> so emails go directly to this address. I hope this is what you are looking
>> for.
>>
>> Thank you for your continued support of Maker and your responses to the
>> forum posts. I have been running Maker (V3.01.02-beta) to annotate a
>> mammalian genome that consists of 22 chromosome-length scaffolds (between
>> ~200-20Mb) and about 10,000 smaller fragments from 1Mb to 10kb in length.
>> In my various tests in running Maker, the vast majority of the smaller
>> fragments are annotated successfully, but nearly all the large scaffolds
>> fail with the same error code when I look at the 'run.log.child.0' file:
>> ```
>> DIED RANK 0:6:0:0
>> DIED COUNT 2
>> ```
>> (the master 'run.log' file just shows "DIED COUNT 2")
>>
>> I struggled to find this exact error code anywhere on the forum and was
>> hoping you might be able to help me determine where I should start
>> troubleshooting. I thought perhaps it was an error concerning memory
>> requirements, so I altered the chunk size from the default to a few larger
>> sequence lengths (I've tried 1e6, 1e7, and 999,999,999 - all produce the
>> same outcome). I've tried running the program with parallel support using
>> either openMPI or mpich. I've tried running on a single node using 24 cpus
>> and 120g of RAM. It always stalls at the same step.
>>
>> Interestingly, one of the 22 large scaffolds always finishes and produces
>> the .maker.proteins.fasta, .maker.transcripts.fasta, and .gff files, but
>> the other 21 of 22 large scaffolds fail. This makes me think perhaps it's
>> not a memory issue?
>>
>> In the case of both the completed and failed scaffolds, the
>> "theVoid.scaffoldX" subdirectory(ies) containing the .rb.cat.gz, .rb.out,
>> .specific.ori.out, .specific.cat.gz, .specific.out,
>> te_proteins*fasta.repeat runner, the est *fasta.blastn, the altest
>> *fasta.tblastx, and protein *fasta.blastx files are all present (and appear
>> finished from what I can tell).
>> However, the particular contents in the parent directory to the
>> "theVoid.scaffold" folder differ. For the failed scaffolds, the contents
>> generally always look something like this (that is, they stall with the
>> same kind of files produced):
>> ```
>> 0
>> evidence_0.gff
>> query.fasta
>> query.masked.fasta
>> query.masked.fasta.index
>> query.masked.gff
>> run.log.child.0
>> scaffold22.0.final.section
>> scaffold22.0.pred.raw.section
>> scaffold22.0.raw.section
>> scaffold22.gff.ann
>> scaffold22.gff.def
>> scaffold22.gff.seq
>> ```
>>
>> For the completed scaffold, there are many more files created:
>> ```
>> 0
>> 10
>> 100
>> 20
>> 30
>> 40
>> 50
>> 60
>> 70
>> 80
>> 90
>> evidence_0.gff
>> evidence_10.gff
>> evidence_1.gff
>> evidence_2.gff
>> evidence_3.gff
>> evidence_4.gff
>> evidence_5.gff
>> evidence_6.gff
>> evidence_7.gff
>> evidence_8.gff
>> evidence_9.gff
>> query.fasta
>> query.masked.fasta
>> query.masked.fasta.index
>> query.masked.gff
>> run.log.child.0
>> run.log.child.1
>> run.log.child.10
>> run.log.child.2
>> run.log.child.3
>> run.log.child.4
>> run.log.child.5
>> run.log.child.6
>> run.log.child.7
>> run.log.child.8
>> run.log.child.9
>> scaffold4.0-1.raw.section
>> scaffold4.0.final.section
>> scaffold4.0.pred.raw.section
>> scaffold4.0.raw.section
>> scaffold4.10.final.section
>> scaffold4.10.pred.raw.section
>> scaffold4.10.raw.section
>> scaffold4.1-2.raw.section
>> scaffold4.1.final.section
>> scaffold4.1.pred.raw.section
>> scaffold4.1.raw.section
>> scaffold4.2-3.raw.section
>> scaffold4.2.final.section
>> scaffold4.2.pred.raw.section
>> scaffold4.2.raw.section
>> scaffold4.3-4.raw.section
>> scaffold4.3.final.section
>> scaffold4.3.pred.raw.section
>> scaffold4.3.raw.section
>> scaffold4.4-5.raw.section
>> scaffold4.4.final.section
>> scaffold4.4.pred.raw.section
>> scaffold4.4.raw.section
>> scaffold4.5-6.raw.section
>> scaffold4.5.final.section
>> scaffold4.5.pred.raw.section
>> scaffold4.5.raw.section
>> scaffold4.6-7.raw.section
>> scaffold4.6.final.section
>> scaffold4.6.pred.raw.section
>> scaffold4.6.raw.section
>> scaffold4.7-8.raw.section
>> scaffold4.7.final.section
>> scaffold4.7.pred.raw.section
>> scaffold4.7.raw.section
>> scaffold4.8-9.raw.section
>> scaffold4.8.final.section
>> scaffold4.8.pred.raw.section
>> scaffold4.8.raw.section
>> scaffold4.9-10.raw.section
>> scaffold4.9.final.section
>> scaffold4.9.pred.raw.section
>> scaffold4.9.raw.section
>> ```
>>
>> Thanks for any troubleshooting tips you can offer.
>>
>> Cheers,
>> Devon
>>
>> --
>> Devon O'Rourke
>> Postdoctoral researcher, Northern Arizona University
>> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
>> twitter: @thesciencedork
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at yandell-lab.org
>> http://yandell-lab.org/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>
> --
> Devon O'Rourke
> Postdoctoral researcher, Northern Arizona University
> Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
> twitter: @thesciencedork
> <fail-1a.log.gz><fail-1b.log.gz><run1_maker_opts.ctl><run1_slurm.sh>
>
>
>
--
Devon O'Rourke
Postdoctoral researcher, Northern Arizona University
Lab of Jeffrey T. Foster - https://fozlab.weebly.com/
twitter: @thesciencedork
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20200226/71789380/attachment-0004.html>
More information about the maker-devel
mailing list