[maker-devel] MAKER processing time in a 2Gb genome

Ganko Eric USRE eric.ganko at syngenta.com
Thu Jul 9 14:36:59 MDT 2015


Tuesday I ran the same option files, this time with 480 cores, and the annotation completed in ~6 hours. Perhaps I’m trying too many simultaneous writes at higher levels, or there is too much MPI communication as you mentioned…  Thanks for the input on the RAM disk.

-eric


From: Carson Holt [mailto:carsonhh at gmail.com]
Sent: Thursday, July 09, 2015 3:01 PM
To: Ganko Eric USRE
Cc: maker-devel at yandell-lab.org
Subject: Re: [maker-devel] MAKER processing time in a 2Gb genome

Runtimes are the result of gene density, evidence dataset size, ans evidence dataset type. For example protein data takes ~10 times longer to process than EST data, and alt-EST data takes ~10 times longer than protein data.  If you double the size of input datasets, then you double runtime.  Also the assembly size doesn’t seem to have a large effect on runtime.  It tends to be gene density that has the most effect, so a 2Gb assembly runs only somewhat slower than a 300Mb assembly containing the same number of genes.

For best MPI performance, you can submit multiple jobs with 200 CPUs or less.  Over 200 CPUs per job tens to get limited throughput increases  due to MPI communication overhead.  I never use RAM disk.  In general MAKER produces too many temporary files to fit in RAM.

—Carson



On Jul 6, 2015, at 9:37 AM, Ganko Eric USRE <eric.ganko at syngenta.com<mailto:eric.ganko at syngenta.com>> wrote:

I’m hoping for some advice on an unexpectedly long process time for a 2Gb genome. Currently I’m using an install of MAKER-P on the iForge system @ NCSA and I’ve successfully run ~1Gb genomes in 2-3 hours across 20 nodes (24 Intel "Haswell" cores, 64 GB of RAM per node) via MPICH.

I recently ran some tests on 50Mb of corn that took ~2 hours on 2 nodes (48 cores). Based on that I was surprised when the full 2Gb corn genome run timed out at >24h with 30 nodes (720 cores); in that time it hadn’t processed many sequences based on the master_datastore_index.log :

TOTAL: 25000 seqs
STARTED: 3594
FINISHED: 2979
FAILED: 10
RETRY: 9
DIED_SKIPPED_PERMANENT: 0
SKIPPED_SMALL: 7635

While I can set a longer wall clock, these results are several times longer than what was reported in the MAKER-P paper, i.e. running the corn B73 genome in less than 4 hours; here it is not close to done after 24h. I don’t have an enormous amount of supporting data– this trial run has ~100k transcripts and another ~100k proteins. Corn has a very high repeat content, so my suspicion is Repeatmasker IO. In discussions with the iForge admins I have discovered that the temp space is network attached (GPFS), and they’ve suggested using a RAM disk (i.e /dev/shm) as the temp directory. In tests on smaller sequence that ran a little slower so I’m not sure if MAKER is meant to run that way. I’d appreciate input on experience with a RAM disk approach, or if anyone has alternative thoughts or suggestions?

Thanks,
Eric

________________________________
This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150709/bcf26823/attachment-0003.html>


More information about the maker-devel mailing list