[maker-devel] MAKER/repeatmasker/TRF parsing of long file names

Liudmila Sergeevna Mainzer lmainzer at life.illinois.edu
Mon Jan 9 00:02:01 MST 2017


Hello, MAKER developers!

I tried submitting this bug report through the web form on the 
RepeatMasker web page, but I am getting an "invalid submission" message, 
so I decided to post here.

I found a weird bug that results in the notorious "index out of bounds" 
error reported by RepeatMasker. Significantly, this error only arises on 
very long file names generated by MAKER.

I traced this through the code, and identified the error to originate in 
Tandem Repeat finder. TRF sometimes splits up its output into separate 
files. When that happens, the pieces with index >1 do not contain the 
sequence name. Compare the first few lines between these two files:

  head -n 20 
output_folder/InputFileName_batch-1.masked.2.3.5.75.20.33.7.1.txt.html
 
<HTML><HEAD><TITLE>InputFileName_batch-1.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY 


     bgcolor="#File 1 of 2 FBF8BC"><PRE>
     Tandem Repeats Finder Program written by:
                   Gary Benson
                   Program in Bioinformatics
                   Boston University
     Version 4.09
     Sequence: InputSequencefrag-1 CHUNK number:191 <http://number:191>
     size:455659 <http://size:455659> offset:57300000
     <http://offset:57300000>
     Parameters: 2 3 5 75 20 33 7

etcetera
But also the second chunk:

  head -n 20 
output_folder/InputFileName_batch-1.masked.2.3.5.75.20.33.7.2.txt.html
 
<HTML><HEAD><TITLE>InputFileName_batch-1.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY 


     bgcolor="#File 2 of 2 Found at i:56286 original size:1 final size:1
     <A NAME="56278--56322,1,45.0,1,1136"></A><A
     HREF="http://tandem.bu.edu/trf/trf.definitions.html#alignment"
     <http://tandem.bu.edu/trf/trf.definitions.html#alignment> target
     ="explanation">Alignment explanation</A><BR><BR>
        Indices: 56278--56322  Score: 55
        Period size: 1  Copynumber: 45.0  Consensus size: 1

etcetera


See how one file has the full header with the "Sequence:" statement and 
the other one does not? This "Sequence:" statement is used in the 
RepeatMasker code to name each piece of sequence that ends up being 
masked later. When this variable if empty (the name string is not 
defined), the setSubstr subroutine in the main RepeatMasker code breaks: 
length of an undefined string is of course zero, and that subroutine has 
a check for sequences whose length is shorter than the region that needs 
to be masked.

So it quits with the statement "Error index out of bounds!", even though 
the sequence is finite length, does not have any weird characters, and 
is maskable.

Once again, this only arises on very long file names, and those seem to 
be created by MAKER. Example:
LocalTmp/JobName.maker.output/JobName_datastore/53/6E/10000001/theVoid.chr_number/57/chr_number.191.My_Species_Name_%2Erepeats%2Econsensi%2Efa%2Eclassified%2Ecleaned%2Empi%2E10%2E0.specific

Notice how the last part of the file name has a bunch of identifiers 
separated by the %2E (generic URI-encoding)? I experimented with that 
file name. The path does not matter. The % signs do not matter. It is 
the length of the filename itself: if it is <108 characters, then 
RepeatMasker/TRF runs fine. If it is 108 or more, it breaks. Seems like 
maybe Perl is not handling that long a name very well...

So the problem is three-fold: MAKER creates file names that are 
very-very long, while RepeatMasker breaks due to TRF failing to write 
the file headers properly for those very long file names.

Would you provide any suggestions or patches for this problem? It is 
forcing us to run RepeatMasker separately, outside the main MAKER 
worlflow, which really complicates the data management and analysis as a 
whole.
We use RepeatMasker version open-4.0.6, maker-3.00.0-beta and perl 
v5.10.1 built for x86_64-linux-thread-multi.

Many thanks in advance,
Liudmila Mainzer

----------------
Senior Research Scientist
National Center for Supercomputing Applications

Research Assistant Professor
Institute of Genomic Biology

University of Illinois
217-300-0568
1205 W. Clark St. Room 4026
Urbana, IL 61801




More information about the maker-devel mailing list