[maker-devel] Some errors reported by Maker2

Wed Sep 13 14:26:11 MDT 2017

Dear Carson:

I will take a look at try it. Thank you.

Best
Quanwei

2017-09-13 16:21 GMT-04:00 Carson Holt <carsonhh at gmail.com>:

> One final thought. If you are using rmblast as part of the RepeatMasker
> installation, it may be suffering a bug that some blast version suffer from
> that can sometimes lead to truncation of a blast report  (example of a
> separate error related to blast report truncation here)—>
> https://groups.google.com/forum/#!topic/maker-devel/96KGgiQMcxQ
>
> As a result there is a special update to rmblast —>
> http://www.repeatmasker.org/RMBlast.html
>
> So if you are not using the update try it, but if you are using the update
> and it is giving the error, roll back to rmblast 2.2.28 (i.e. the update
> may be the cause or the cure or RepeatMasker errors).
>
> —Carson
>
>
>
> On Sep 13, 2017, at 1:42 PM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>
> Dear Carson:
>
> Thank you for your explanation.  Sorry for not describing my problem
> clearly. The first two errors were all gone after I changed the parameters
> you suggested (e.g., max_dna_len, depeth_blast). Now I only get the
> following error for two contigs among thousands of contigs. One of the two
> failed contigs has length 863k, and I have done more tests on this contig
> individually. By running repeatmask on this contig, 65% was masked when
> using species specific repeat library, while it is only 35% when using
> mammalian repeat library. Since longer contigs (even 98Mb) can all be
> annotated, I doubt why this much shorter one can fail due to IO.
>
> I did not set "TMP", and I am running on a high performance cluster. I am
> not sure whether it is a virtual memory or not. I will check it later. Many
> thanks
>
> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
> line 188.
> 33708 --> rank=NA, hostname=n409
> 33709 ERROR: Failed while processing all repeats
> 33710 ERROR: Chunk failed at level:3, tier_type:1
> 33711 FAILED CONTIG:Contig31
>
> Best
> Quanwei
>
> 2017-09-13 14:23 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>
>> These are the 3 errors you have shown in your e-mails —>
>> open3: fork failed: Cannot allocate memory at
>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Widget/blastx.pm line 40.
>> Can't kill a non-numeric process ID at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm
>> line 1050.
>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>> line 188.
>>
>> The first two are memory related with the second being because it cannot
>> kill a lock maintainer thread that it was not able to start because of lack
>> of memory.
>>
>> The third one is IO related. It is a truncated file that succeeded on the
>> second try according to the e-mail you sent.
>>
>>
>> IO errors are quite common with NFS (network mounted file systems). It’s
>> one of the most frequent issues submitted to the devel list. MAKER can hit
>> IO limits long before it hits CPU limits. One of the most frequent casues
>> of these issues is that the user set TMP= in the control files to a manual
>> location that is not suitable for high IO (note TMP= defaults to /tmp). The
>> location should always be a true locally mounted disk. Sometimes this is a
>> virtual location (not really local disk but network mounted disk or an in
>> memory location). With the former you will get frequent IO failures and
>> with the latter you will also get out of memory issues.
>>
>> Note that when you supply more data files you will also use more memory
>> (to hold analysis results). According to your e-mail the last error you got
>> was 'Can't kill a non-numeric process ID’. Correct? So getting the error
>> with two input files but not when you supply a single input file further
>> suggests you are running low on RAM.
>>
>> 1. Some things to check. Make sure TMP= is not being set to a network
>> mounted location.
>> 2. Make sure your temporary directory is not a virtual in memory
>> directory on the node being used.
>> 3. If nodes are shared, you may run out of memory because of other users
>> or because you failed to request enough RAM during job submission.
>>
>> Finally, try running interactively so you can see what the memory and
>> directory locations look like on the node you get assigned for the job
>> (check space and mount points. Is /tmp or whereever you set TMP= in fact a
>> local disk?). Also run with MPI rather than starting multiple MAKER
>> instances. It uses resources better.
>>
>> Thanks,
>> Carson
>>
>>
>>
>>
>>
>>
>> On Sep 13, 2017, at 8:32 AM, Quanwei Zhang <qwzhang0601 at gmail.com> wrote:
>>
>> Dear Carson:
>>
>> I did more tests on one of the contigs (with length 863kb) that failed
>> when doing repeat masking. I found it only fail when I added the species
>> specific repeat library, and it can be successfully annotated when only
>> considering mammalian repeat library. When I did the test I only picked the
>> this contig and run maker with 64G memory. So I think the failure should
>> not be the problem with memory or IO, because even the contigs with length
>> 98Mb can be annotated with memory 32G.
>>
>> I also run RepeatMasker on this contig with mammalian and species
>> specific repeat library, separately. I found when I use  mammalian repeat
>> library, about 35% was masked as repeats, while it is 65% when I use
>> species specific repeat library (as shown below in blue). I wonder whether
>> the high level of repeats can lead to the failure of this contig.  Do you
>> have any ideas about this. Thanks
>>
>>
>>
>> file name: test_scaffold31.fasta
>> sequences:             1
>> total length:     863590 bp  (858757 bp excl N/X-runs)
>> GC level:         37.02 %
>> bases masked:     562909 bp ( 65.18 %)
>> ==================================================
>>                number of      length   percentage
>>                elements*    occupied  of sequence
>> --------------------------------------------------
>> SINEs:              113        16134 bp    1.87 %
>>       ALUs           71        12479 bp    1.45 %
>>       MIRs            1          133 bp    0.02 %
>>
>> LINEs:              251       380142 bp   44.02 %
>>       LINE1         211       210623 bp   24.39 %
>>       LINE2           1           86 bp    0.01 %
>>       L3/CR1          0            0 bp    0.00 %
>>
>> LTR elements:       246       101221 bp   11.72 %
>>       ERVL            5         1037 bp    0.12 %
>>       ERVL-MaLRs     18         2744 bp    0.32 %
>>       ERV_classI    201        90942 bp   10.53 %
>>       ERV_classII    18         5964 bp    0.69 %
>>
>> DNA elements:        39        14177 bp    1.64 %
>>      hAT-Charlie      7         3864 bp    0.45 %
>>      TcMar-Tigger     7         1706 bp    0.20 %
>>
>> Unclassified:       196        45831 bp    5.31 %
>>
>> Total interspersed repeats:   557505 bp   64.56 %
>>
>>
>> Small RNA:            3          823 bp    0.10 %
>>
>> Satellites:           2          237 bp    0.03 %
>> Simple repeats:      94         4472 bp    0.52 %
>> Low complexity:      18          766 bp    0.09 %
>> ==================================================
>>
>> * most repeats fragmented by insertions or deletions
>>   have been counted as one element
>>
>>
>> The query species was assumed to be homo
>> RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
>>
>> run with rmblastn version 2.2.27+
>> The query was compared to classified sequences in
>> ".../consensi.fa.classifiednoProtFinal"
>>
>>
>> Best
>> Quanwei
>>
>> 2017-09-11 14:33 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>
>>> Dear Carson:
>>>
>>> I see. Thank you. I will try it.
>>>
>>> Best
>>> Quanwei
>>>
>>> 2017-09-11 13:46 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>
>>>> Each node is a single machine. Because you currently run without MPI,
>>>> each MAKER job you submit runs on a single machine. So you are either
>>>> running multiple times on the same node, or you submitted 5 separate batch
>>>> jobs in which case you may have a single maker process on each of 5 nodes.
>>>>
>>>> MPI can parallelize on the same node or across nodes. If you request 10
>>>> nodes, then it can communicate across nodes to run the job on all hardware.
>>>> Or you can run MPI on a single node and ask for all CPUs on that node. In
>>>> that case it will split up work within a single node and use all resources
>>>> just on that node. So if you can’t get MPI to work across nodes, you can
>>>> just submit a job that goes to a single node and ask for all CPUs on that
>>>> node (multinode jobs may be hard to configure, but single node jobs are
>>>> very easy). Just set the -n parameter of mpiexec to the CPU count of that
>>>> node, and it will parallelize within the node.
>>>>
>>>> Example command for a 20 CPU node —>  mpiexec -n 20 maker
>>>>
>>>> —Carson
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 11, 2017, at 11:27 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>> wrote:
>>>>
>>>> Dear Carson:
>>>>
>>>> Would you please explain what do you mean by "a single machine"? I am
>>>> running maker2 on our high performance cluster. The cluster has more than
>>>> 1,620-core compute nodes with 128 GB RAM each. Univa Grid Engine was used
>>>> as the scheduler. Can I use MPICH3?
>>>>
>>>> Thanks
>>>>
>>>> Best
>>>> Quanwei
>>>>
>>>> 2017-09-11 13:18 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>
>>>>> If you are just using a single machine (and not cross machine MPI),
>>>>> use MPICH3 —> https://www.mpich.org
>>>>>
>>>>> It’s easy to install yourself, and tends to be very robust to failure.
>>>>>
>>>>> —Carson
>>>>>
>>>>>
>>>>>
>>>>> On Sep 11, 2017, at 11:16 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Carson:
>>>>>
>>>>> I met some problems to use MPI. I will give it another try.
>>>>> Thank you!
>>>>>
>>>>> Best
>>>>> Quanwei
>>>>>
>>>>> 2017-09-11 13:14 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>
>>>>>> It could be either. Please use MPI instead of starting multiple
>>>>>> instances. It will greatly reduce both IO and RAM usage.
>>>>>>
>>>>>> —Carson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sep 11, 2017, at 11:12 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Dear Carson:
>>>>>>
>>>>>> I only run 5 Maker instances in each directory (and set cpus=2). If
>>>>>> it is related to memory issue or an IO issue, I am not sure why the much
>>>>>> longer scaffolds (than the failed ones) were all annotated successfully,
>>>>>> but the relatively shorter ones failed.
>>>>>>
>>>>>> I have set "tries=5" (#number of times to try a contig if there is a
>>>>>> failure for some reason). I will try "clean_try=1" and test on the failed
>>>>>> scaffolds individually with larger memory to see whether they can be
>>>>>> annotated.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best
>>>>>> Quanwei
>>>>>>
>>>>>> 2017-09-11 13:07 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>
>>>>>>> I think the cause of the error may have been a little further
>>>>>>> upstream from what you pasted in the e-mail. One thing that may be
>>>>>>> happening is that you are taxing resources (like IO) if running MAKER
>>>>>>> multiple times or on too many CPUs. That can lead to failures because of
>>>>>>> truncated BLAST reports etc. In which case you can just retry and that will
>>>>>>> get around those types of IO derived errors. MAKER can generate a lot of
>>>>>>> IO, and if you are working on network mounted locations (i.e. the storage
>>>>>>> being used is actually across the network), then they can be lest robust
>>>>>>> than local storage (when under heavy load NFS can falsely report success on
>>>>>>> read/write operations that actually failed). It’s the reason we built in
>>>>>>> the retry capabilities of MAKER.
>>>>>>>
>>>>>>> For contigs that continuously fail, you may need to set clean_try=1.
>>>>>>> That will cause failures to start from scratch (i.e. delete all old reports
>>>>>>> on failure rather than just those suspected of being truncated).
>>>>>>>
>>>>>>> —Carson
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sep 11, 2017, at 10:19 AM, Quanwei Zhang <qwzhang0601 at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Dear Carson:
>>>>>>>
>>>>>>> About the error in my above email, I found the contig was correctly
>>>>>>> annotated at the second time RETRY. So please ignore my last email. But
>>>>>>> now, for a few number of scaffolds, I met problems to process the repeats
>>>>>>> (as shown below in red). I used both Mammalia repeat library and species
>>>>>>> specific repeat library (which is generated by your pipeline "
>>>>>>> http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Rep
>>>>>>> eat_Library_Construction--Basic"). There were no such problems when
>>>>>>> I only used Mammalia repeat library. Do you have any ideas about this? What
>>>>>>> could be the reason? Or do you have any suggestions for me to find the
>>>>>>> reason? Many thanks
>>>>>>>
>>>>>>> Here are some parameters I used
>>>>>>>
>>>>>>> #-----Repeat Masking (leave values blank to skip repeat masking)
>>>>>>> model_org=Mammalia #select a model organism for RepBase masking in
>>>>>>> RepeatMasker
>>>>>>> rmlib=../consensi.fa.classifiednoProtFinal #provide an organism
>>>>>>> specific repeat library in fasta format for Repe
>>>>>>>
>>>>>>> max_dna_len=300000
>>>>>>> split_hit=40000
>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
>>>>>>>
>>>>>>>
>>>>>>> Died at /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/Bio/Search/Hit/PhatHit/Base.pm
>>>>>>> line 188.
>>>>>>> 33708 --> rank=NA, hostname=n409
>>>>>>> 33709 ERROR: Failed while processing all repeats
>>>>>>> 33710 ERROR: Chunk failed at level:3, tier_type:1
>>>>>>> 33711 FAILED CONTIG:Contig31
>>>>>>>
>>>>>>>
>>>>>>> Best
>>>>>>> Quanwei
>>>>>>>
>>>>>>> 2017-09-08 23:25 GMT-04:00 Quanwei Zhang <qwzhang0601 at gmail.com>:
>>>>>>>
>>>>>>>> Dear Carson:
>>>>>>>>
>>>>>>>> I got the following error again. Is this still related to memory
>>>>>>>> issues? I wonder whether there can be other reasons lead to this error?
>>>>>>>> This time, I got this error during training of the SNAP model. Before, even
>>>>>>>> I set  max_dna_len=1Mb, I can train the model successfully.  And in the
>>>>>>>> current training (where I get the following error),  I have decreased the
>>>>>>>> max_dna_len to 300kb. I required the same amount memory as before. The only
>>>>>>>> difference is that I am using both mammalian repeat library and species
>>>>>>>> specific repeat library, while previously I only use the mammalian repeat
>>>>>>>> library. Will it greatly increases the requirement of memory to use both
>>>>>>>> repeat libraries (even when I decrease max_dna_len from 1Mb to 300kb)? I
>>>>>>>> have also set the depth_blast as 30 in current training.
>>>>>>>>
>>>>>>>> Thank you! Have a nice weekend!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>> Now starting the contig!!
>>>>>>>> SeqID: Contig10
>>>>>>>> Length: 18773588
>>>>>>>> #-----------------------------------------------------------
>>>>>>>> ----------
>>>>>>>>
>>>>>>>>
>>>>>>>> setting up GFF3 output and fasta chunks
>>>>>>>> doing repeat masking
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> doing blastx repeats
>>>>>>>> collecting blastx repeatmasking
>>>>>>>> processing all repeats
>>>>>>>> doing repeat masking
>>>>>>>> Can't kill a non-numeric process ID at
>>>>>>>> /gs/gsfs0/hpc01/apps/MAKER/2.31.9/bin/../lib/File/NFSLock.pm line
>>>>>>>> 1050.
>>>>>>>> --> rank=NA, hostname=n224
>>>>>>>> ERROR: Failed while doing repeat masking
>>>>>>>> ERROR: Chunk failed at level:0, tier_type:1
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> ERROR: Chunk failed at level:2, tier_type:0
>>>>>>>> FAILED CONTIG:Contig10
>>>>>>>>
>>>>>>>> Best
>>>>>>>> Quanwei
>>>>>>>>
>>>>>>>> 2017-09-06 12:06 GMT-04:00 Carson Holt <carsonhh at gmail.com>:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) By reading some of your replies in the maker google group, and
>>>>>>>>> I noticed that it can reduce memory and save time for annotation if I set
>>>>>>>>> depth_blast to a certain number. So I changed the following parameters. But
>>>>>>>>> I wonder, whether it will decrease the quality of annotation? If it won't
>>>>>>>>> affect the quality, can I even use a smaller number (e.g., 20) to save more
>>>>>>>>> memory and time?
>>>>>>>>>
>>>>>>>>> depth_blastn=30 #Blastn depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_blastx=30 #Blastx depth cutoff (0 to disable cutoff)
>>>>>>>>> depth_tblastx=30 #tBlastx depth cutoff (0 to disable cutoff)
>>>>>>>>> bit_rm_blastx=30 #Blastx bit cutoff for transposable element
>>>>>>>>> masking
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This values really only affects the final evidence kept in the
>>>>>>>>> GFF3 when you look at it in a browser. It has not affect on the annotation.
>>>>>>>>> This is because internally MAKER already collapses evidence down to the 10
>>>>>>>>> best non-redundant features per evidence set per locus. The rest are put in
>>>>>>>>> the GFF3 just for reference. by setting it lower, you are just letting
>>>>>>>>> MAKER know it can through things away even sooner since you don’t want them
>>>>>>>>> in the GFF3. It provides a minor improvement for memory use, but
>>>>>>>>> max_dna_length is the big one that has the greatest effect.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) I also have some concerns about the speed, especially for the
>>>>>>>>> long scaffolds (around 100Mb). I wonder which part is the most time
>>>>>>>>> consuming for genome annotation (repeat masking, blast, or polishing?).
>>>>>>>>> Particularly, I wonder whether the blastx of protein evidence will take
>>>>>>>>> majority of time. Now, I have prepared 99k mammalian Swiss protein
>>>>>>>>> sequences and 340k rodent TrEMBL protein sequences as protein evidences. I
>>>>>>>>> am considering whether I can save much time if I only use the 99k mammalian
>>>>>>>>> Swiss protein sequences as evidences.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BLASTN (ESTs) -> fastest as it is searching nucleotide space
>>>>>>>>> BLASTX (proteins) -> must search 6 reading frames so will be at
>>>>>>>>> least 6 times slower than BLASTN
>>>>>>>>> TBLASTX (alt-ESTs) -> must search 12 reading frames so will be at
>>>>>>>>> least 12 times slower than BLASTN and twice as slow as BLASTX
>>>>>>>>>
>>>>>>>>> Also double the dataset size, double the runtime. Larger window
>>>>>>>>> sizes via max_dna_length will also increase runtimes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (4) For some reasons, I can not run maker though MPI on our
>>>>>>>>> cluster. So I can only start multiple maker. I wonder if it is possible to
>>>>>>>>> let multiple maker to annotate the same long scaffold (i.e., for a single
>>>>>>>>> sequence I start multiple maker, without splitting the long sequence into
>>>>>>>>> shorter ones).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Without MPI you won’t be able to split up large contigs. At the
>>>>>>>>> very least you can try and run on a single node and set MPI to use all CPUs
>>>>>>>>> on that node. It’s less difficult to set up compared to cross node jobs via
>>>>>>>>> MPI.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (5) Still about the speed issue. I read some of your comments
>>>>>>>>> about "cpus" parameters in the maker_opts file (
>>>>>>>>> http://gmod.827538.n3.nabble.com/open3-fork-failed-Cannot-a
>>>>>>>>> llocate-memory-td4025117.html). And I know it indicate the number
>>>>>>>>> of cpus for a single chunk. So if I set "cpus=2" in the maker_opts file,
>>>>>>>>> then I can use the following command to submit the job, right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The cpu parameter only affects how many CPUs are given to the
>>>>>>>>> blast command line. So only the BLASt step will speed up, so I recommend
>>>>>>>>> using MPI to get all steps to speed up. Even if you are only running on a
>>>>>>>>> single node, you can give all CPUs to the mpiexec command.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> —Carson
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20170913/42eb2d53/attachment-0003.html>