[maker-devel] MAKER/repeatmasker/TRF parsing of long file names
Liudmila Sergeevna Mainzer
lmainzer at life.illinois.edu
Mon Jan 9 00:02:01 MST 2017
Hello, MAKER developers!
I tried submitting this bug report through the web form on the
RepeatMasker web page, but I am getting an "invalid submission" message,
so I decided to post here.
I found a weird bug that results in the notorious "index out of bounds"
error reported by RepeatMasker. Significantly, this error only arises on
very long file names generated by MAKER.
I traced this through the code, and identified the error to originate in
Tandem Repeat finder. TRF sometimes splits up its output into separate
files. When that happens, the pieces with index >1 do not contain the
sequence name. Compare the first few lines between these two files:
head -n 20
output_folder/InputFileName_batch-1.masked.2.3.5.75.20.33.7.1.txt.html
<HTML><HEAD><TITLE>InputFileName_batch-1.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY
bgcolor="#File 1 of 2 FBF8BC"><PRE>
Tandem Repeats Finder Program written by:
Gary Benson
Program in Bioinformatics
Boston University
Version 4.09
Sequence: InputSequencefrag-1 CHUNK number:191 <http://number:191>
size:455659 <http://size:455659> offset:57300000
<http://offset:57300000>
Parameters: 2 3 5 75 20 33 7
etcetera
But also the second chunk:
head -n 20
output_folder/InputFileName_batch-1.masked.2.3.5.75.20.33.7.2.txt.html
<HTML><HEAD><TITLE>InputFileName_batch-1.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY
bgcolor="#File 2 of 2 Found at i:56286 original size:1 final size:1
<A NAME="56278--56322,1,45.0,1,1136"></A><A
HREF="http://tandem.bu.edu/trf/trf.definitions.html#alignment"
<http://tandem.bu.edu/trf/trf.definitions.html#alignment> target
="explanation">Alignment explanation</A><BR><BR>
Indices: 56278--56322 Score: 55
Period size: 1 Copynumber: 45.0 Consensus size: 1
etcetera
See how one file has the full header with the "Sequence:" statement and
the other one does not? This "Sequence:" statement is used in the
RepeatMasker code to name each piece of sequence that ends up being
masked later. When this variable if empty (the name string is not
defined), the setSubstr subroutine in the main RepeatMasker code breaks:
length of an undefined string is of course zero, and that subroutine has
a check for sequences whose length is shorter than the region that needs
to be masked.
So it quits with the statement "Error index out of bounds!", even though
the sequence is finite length, does not have any weird characters, and
is maskable.
Once again, this only arises on very long file names, and those seem to
be created by MAKER. Example:
LocalTmp/JobName.maker.output/JobName_datastore/53/6E/10000001/theVoid.chr_number/57/chr_number.191.My_Species_Name_%2Erepeats%2Econsensi%2Efa%2Eclassified%2Ecleaned%2Empi%2E10%2E0.specific
Notice how the last part of the file name has a bunch of identifiers
separated by the %2E (generic URI-encoding)? I experimented with that
file name. The path does not matter. The % signs do not matter. It is
the length of the filename itself: if it is <108 characters, then
RepeatMasker/TRF runs fine. If it is 108 or more, it breaks. Seems like
maybe Perl is not handling that long a name very well...
So the problem is three-fold: MAKER creates file names that are
very-very long, while RepeatMasker breaks due to TRF failing to write
the file headers properly for those very long file names.
Would you provide any suggestions or patches for this problem? It is
forcing us to run RepeatMasker separately, outside the main MAKER
worlflow, which really complicates the data management and analysis as a
whole.
We use RepeatMasker version open-4.0.6, maker-3.00.0-beta and perl
v5.10.1 built for x86_64-linux-thread-multi.
Many thanks in advance,
Liudmila Mainzer
----------------
Senior Research Scientist
National Center for Supercomputing Applications
Research Assistant Professor
Institute of Genomic Biology
University of Illinois
217-300-0568
1205 W. Clark St. Room 4026
Urbana, IL 61801
More information about the maker-devel
mailing list