[maker-devel] Some errors reported by Maker2

Wed Sep 6 10:06:46 MDT 2017

> (2) By reading some of your replies in the maker google group, and I noticed that it can reduce memory and save time for annotation if I set depth_blast to a certain number. So I changed the following parameters. But I wonder, whether it will decrease the quality of annotation? If it won't affect the quality, can I even use a smaller number (e.g., 20) to save more memory and time?
> 
> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

This values really only affects the final evidence kept in the GFF3 when you look at it in a browser. It has not affect on the annotation. This is because internally MAKER already collapses evidence down to the 10 best non-redundant features per evidence set per locus. The rest are put in the GFF3 just for reference. by setting it lower, you are just letting MAKER know it can through things away even sooner since you don’t want them in the GFF3. It provides a minor improvement for memory use, but max_dna_length is the big one that has the greatest effect.

> (3) I also have some concerns about the speed, especially for the long scaffolds (around 100Mb). I wonder which part is the most time consuming for genome annotation (repeat masking, blast, or polishing?).  Particularly, I wonder whether the blastx of protein evidence will take majority of time. Now, I have prepared 99k mammalian Swiss protein sequences and 340k rodent TrEMBL protein sequences as protein evidences. I am considering whether I can save much time if I only use the 99k mammalian Swiss protein sequences as evidences.

BLASTN (ESTs) -> fastest as it is searching nucleotide space
BLASTX (proteins) -> must search 6 reading frames so will be at least 6 times slower than BLASTN
TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at least 12 times slower than BLASTN and twice as slow as BLASTX

Also double the dataset size, double the runtime. Larger window sizes via max_dna_length will also increase runtimes.

> (4) For some reasons, I can not run maker though MPI on our cluster. So I can only start multiple maker. I wonder if it is possible to let multiple maker to annotate the same long scaffold (i.e., for a single sequence I start multiple maker, without splitting the long sequence into shorter ones).

Without MPI you won’t be able to split up large contigs. At the very least you can try and run on a single node and set MPI to use all CPUs on that node. It’s less difficult to set up compared to cross node jobs via MPI.

> (5) Still about the speed issue. I read some of your comments about "cpus" parameters in the maker_opts file (http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html <http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-allocate-memory-td4025117.html>). And I know it indicate the number of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file, then I can use the following command to submit the job, right?  

The cpu parameter only affects how many CPUs are given to the blast command line. So only the BLASt step will speed up, so I recommend using MPI to get all steps to speed up. Even if you are only running on a single node, you can give all CPUs to the mpiexec command.

—Carson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170906/2e1e3d6b/attachment-0003.html>