[maker-devel] MAKER/repeatmasker/TRF parsing of long file names
Carson Holt
carsonhh at gmail.com
Mon Jan 9 09:30:09 MST 2017
The name used by maker is based off of the input file name, so quick fix would just be to rename your input file to have a shorter name.
—Carson
> On Jan 9, 2017, at 12:02 AM, Liudmila Sergeevna Mainzer <lmainzer at life.illinois.edu> wrote:
>
> Hello, MAKER developers!
>
> I tried submitting this bug report through the web form on the RepeatMasker web page, but I am getting an "invalid submission" message, so I decided to post here.
>
> I found a weird bug that results in the notorious "index out of bounds" error reported by RepeatMasker. Significantly, this error only arises on very long file names generated by MAKER.
>
> I traced this through the code, and identified the error to originate in Tandem Repeat finder. TRF sometimes splits up its output into separate files. When that happens, the pieces with index >1 do not contain the sequence name. Compare the first few lines between these two files:
>
> head -n 20 output_folder/InputFileName_batch-1.masked.2.3.5.75.20.33.7.1.txt.html
> <HTML><HEAD><TITLE>InputFileName_batch-1.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY
>
> bgcolor="#File 1 of 2 FBF8BC"><PRE>
> Tandem Repeats Finder Program written by:
> Gary Benson
> Program in Bioinformatics
> Boston University
> Version 4.09
> Sequence: InputSequencefrag-1 CHUNK number:191 <http://number:191>
> size:455659 <http://size:455659> offset:57300000
> <http://offset:57300000>
> Parameters: 2 3 5 75 20 33 7
>
> etcetera
> But also the second chunk:
>
> head -n 20 output_folder/InputFileName_batch-1.masked.2.3.5.75.20.33.7.2.txt.html
> <HTML><HEAD><TITLE>InputFileName_batch-1.masked.2.3.5.75.20.33.7.txt.html</TITLE></HEAD><BODY
>
> bgcolor="#File 2 of 2 Found at i:56286 original size:1 final size:1
> <A NAME="56278--56322,1,45.0,1,1136"></A><A
> HREF="http://tandem.bu.edu/trf/trf.definitions.html#alignment"
> <http://tandem.bu.edu/trf/trf.definitions.html#alignment> target
> ="explanation">Alignment explanation</A><BR><BR>
> Indices: 56278--56322 Score: 55
> Period size: 1 Copynumber: 45.0 Consensus size: 1
>
> etcetera
>
>
> See how one file has the full header with the "Sequence:" statement and the other one does not? This "Sequence:" statement is used in the RepeatMasker code to name each piece of sequence that ends up being masked later. When this variable if empty (the name string is not defined), the setSubstr subroutine in the main RepeatMasker code breaks: length of an undefined string is of course zero, and that subroutine has a check for sequences whose length is shorter than the region that needs to be masked.
>
> So it quits with the statement "Error index out of bounds!", even though the sequence is finite length, does not have any weird characters, and is maskable.
>
> Once again, this only arises on very long file names, and those seem to be created by MAKER. Example:
> LocalTmp/JobName.maker.output/JobName_datastore/53/6E/10000001/theVoid.chr_number/57/chr_number.191.My_Species_Name_%2Erepeats%2Econsensi%2Efa%2Eclassified%2Ecleaned%2Empi%2E10%2E0.specific
>
> Notice how the last part of the file name has a bunch of identifiers separated by the %2E (generic URI-encoding)? I experimented with that file name. The path does not matter. The % signs do not matter. It is the length of the filename itself: if it is <108 characters, then RepeatMasker/TRF runs fine. If it is 108 or more, it breaks. Seems like maybe Perl is not handling that long a name very well...
>
> So the problem is three-fold: MAKER creates file names that are very-very long, while RepeatMasker breaks due to TRF failing to write the file headers properly for those very long file names.
>
> Would you provide any suggestions or patches for this problem? It is forcing us to run RepeatMasker separately, outside the main MAKER worlflow, which really complicates the data management and analysis as a whole.
> We use RepeatMasker version open-4.0.6, maker-3.00.0-beta and perl v5.10.1 built for x86_64-linux-thread-multi.
>
> Many thanks in advance,
> Liudmila Mainzer
>
> ----------------
> Senior Research Scientist
> National Center for Supercomputing Applications
>
> Research Assistant Professor
> Institute of Genomic Biology
>
> University of Illinois
> 217-300-0568
> 1205 W. Clark St. Room 4026
> Urbana, IL 61801
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
More information about the maker-devel
mailing list