<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-family: Calibri, sans-serif; ">
<div>
<div>
<div><font class="Apple-style-span" size="3">Repeats can still happen in genes. So an outright block actually causes more errors than it avoids, and a mixed approach of hard and soft masking becomes more appropriate. The masking step stops alignments from
seeding in repeat regions, but if alignments seed in non-repeat regions then they can still extend through repeat regions during polishing steps (I.e. The EST evidence supports extension through the repeat and inclusion of the TE).</font></div>
<div><font class="Apple-style-span" size="3"><br>
</font></div>
<div><font class="Apple-style-span" size="3">--Carson</font></div>
<div><font class="Apple-style-span" size="3"><br>
</font></div>
<div style="font-size: 14px; ">
<div>
<div style="color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px; ">
</div>
</div>
</div>
</div>
</div>
<div style="font-size: 14px; "><br>
</div>
<span id="OLK_SRC_BODY_SECTION" style="font-size: 14px; ">
<div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style="font-weight:bold">From: </span>Dario Copetti <<a href="mailto:dcopetti@cals.arizona.edu">dcopetti@cals.arizona.edu</a>><br>
<span style="font-weight:bold">Organization: </span>AGI<br>
<span style="font-weight:bold">Date: </span>Monday, 6 May, 2013 5:19 PM<br>
<span style="font-weight:bold">To: </span><<a href="mailto:maker-devel@yandell-lab.org">maker-devel@yandell-lab.org</a>><br>
<span style="font-weight:bold">Cc: </span>"<a href="mailto:kapeel@cals.arizona.edu">kapeel@cals.arizona.edu</a>" <<a href="mailto:kapeel@cals.arizona.edu">kapeel@cals.arizona.edu</a>>, "Stein, Joshua" <<a href="mailto:steinj@cshl.edu">steinj@cshl.edu</a>>,
Rod Wing <<a href="mailto:rwing@Ag.arizona.edu">rwing@Ag.arizona.edu</a>><br>
<span style="font-weight:bold">Subject: </span>gene models overlapping with TEs<br>
</div>
<div><br>
</div>
<div>
<div text="#000000" bgcolor="#FFFFFF">Carson,<br>
<br>
Analyzing the output of a MAKER run on a rice-sized genome I noticed that some gene models (~10%) overlap with TE coding regions. As a QC step, I used BEDtools to determine the intersection of "CDS" and "repeatmasker" or "repeatrunner" and some 2400 genes overlap
for at least 30% of their respective length. I am wondering how the gene models still appear in the final output, since I thought that the masking step was giving us the absoulte confirmation that in our endogenous gene list we do not include TE coding regions.
Here below an example of a gene (attached picture too):<br>
<br>
<table height="551" width="1167" border="0" cellspacing="0" cols="9">
<colgroup span="5" width="85"></colgroup><colgroup width="35"></colgroup><colgroup span="2" width="31">
</colgroup><colgroup width="85"></colgroup>
<tbody>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">mRNA</td>
<td sdval="355056" sdnum="1033;0;#,##0" align="RIGHT">355,056</td>
<td sdval="358075" sdnum="1033;0;#,##0" align="RIGHT">358,075</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=Obrac10g00240.1;Parent=Obrac10g00240;Name=Obrac10g00240.1;_AED=0.24;_eAED=0.24;_QI=0|0.66|0.5|1|1|1|4|0|788</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">exon</td>
<td sdval="355056" sdnum="1033;0;#,##0" align="RIGHT">355,056</td>
<td sdval="356874" sdnum="1033;0;#,##0" align="RIGHT">356,874</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=Obrac10g00240.1:exon:4;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">exon</td>
<td sdval="356965" sdnum="1033;0;#,##0" align="RIGHT">356,965</td>
<td sdval="357081" sdnum="1033;0;#,##0" align="RIGHT">357,081</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=Obrac10g00240.1:exon:3;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">exon</td>
<td sdval="357209" sdnum="1033;0;#,##0" align="RIGHT">357,209</td>
<td sdval="357319" sdnum="1033;0;#,##0" align="RIGHT">357,319</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=Obrac10g00240.1:exon:2;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">exon</td>
<td sdval="357756" sdnum="1033;0;#,##0" align="RIGHT">357,756</td>
<td sdval="358075" sdnum="1033;0;#,##0" align="RIGHT">358,075</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=Obrac10g00240.1:exon:1;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">CDS</td>
<td sdval="357756" sdnum="1033;0;#,##0" align="RIGHT">357,756</td>
<td sdval="358075" sdnum="1033;0;#,##0" align="RIGHT">358,075</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td sdval="2" sdnum="1033;" align="RIGHT">2</td>
<td align="LEFT">ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">CDS</td>
<td sdval="357209" sdnum="1033;0;#,##0" align="RIGHT">357,209</td>
<td sdval="357319" sdnum="1033;0;#,##0" align="RIGHT">357,319</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td sdval="2" sdnum="1033;" align="RIGHT">2</td>
<td align="LEFT">ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">CDS</td>
<td sdval="356965" sdnum="1033;0;#,##0" align="RIGHT">356,965</td>
<td sdval="357081" sdnum="1033;0;#,##0" align="RIGHT">357,081</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td sdval="2" sdnum="1033;" align="RIGHT">2</td>
<td align="LEFT">ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16">ObracChr10</td>
<td align="LEFT">maker</td>
<td align="LEFT">CDS</td>
<td sdval="355056" sdnum="1033;0;#,##0" align="RIGHT">355,056</td>
<td sdval="356874" sdnum="1033;0;#,##0" align="RIGHT">356,874</td>
<td align="LEFT">.</td>
<td align="LEFT">-</td>
<td sdval="0" sdnum="1033;" align="RIGHT">0</td>
<td align="LEFT">ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1</td>
</tr>
<tr>
<td align="LEFT" height="16"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td sdnum="1033;0;#,##0" align="LEFT"><br>
</td>
<td sdnum="1033;0;#,##0" align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
</tr>
<tr>
<td align="LEFT" height="16"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td sdnum="1033;0;#,##0" align="LEFT"><br>
</td>
<td sdnum="1033;0;#,##0" align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
<td align="LEFT"><br>
</td>
</tr>
<tr>
<td align="LEFT" height="17">ObracChr10</td>
<td align="LEFT">repeatrunner</td>
<td align="LEFT">match_part</td>
<td sdval="357755" sdnum="1033;0;#,##0" align="RIGHT">357,755</td>
<td sdval="358084" sdnum="1033;0;#,##0" align="RIGHT">358,084</td>
<td sdval="566" sdnum="1033;" align="RIGHT">566</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=ObracChr10:hsp:75:1.3.0.3;Parent=ObracChr10:hit:75:1.3.0.3;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 117 226 +320</td>
</tr>
<tr>
<td align="LEFT" height="17">ObracChr10</td>
<td align="LEFT">repeatrunner</td>
<td align="LEFT">protein_match</td>
<td sdval="357755" sdnum="1033;0;#,##0" align="RIGHT">357,755</td>
<td sdval="358084" sdnum="1033;0;#,##0" align="RIGHT">358,084</td>
<td sdval="566" sdnum="1033;" align="RIGHT">566</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=ObracChr10:hit:75:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetical;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 117 226 +320</td>
</tr>
<tr>
<td align="LEFT" height="17">ObracChr10</td>
<td align="LEFT">repeatrunner</td>
<td align="LEFT">match_part</td>
<td sdval="357202" sdnum="1033;0;#,##0" align="RIGHT">357,202</td>
<td sdval="357294" sdnum="1033;0;#,##0" align="RIGHT">357,294</td>
<td sdval="142" sdnum="1033;" align="RIGHT">142</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=ObracChr10:hsp:74:1.3.0.3;Parent=ObracChr10:hit:74:1.3.0.3;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 264 294 +86</td>
</tr>
<tr>
<td align="LEFT" height="17">ObracChr10</td>
<td align="LEFT">repeatrunner</td>
<td align="LEFT">protein_match</td>
<td sdval="357202" sdnum="1033;0;#,##0" align="RIGHT">357,202</td>
<td sdval="357294" sdnum="1033;0;#,##0" align="RIGHT">357,294</td>
<td sdval="142" sdnum="1033;" align="RIGHT">142</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=ObracChr10:hit:74:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetical;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 264 294 +86</td>
</tr>
<tr>
<td align="LEFT" height="17">ObracChr10</td>
<td align="LEFT">repeatrunner</td>
<td align="LEFT">match_part</td>
<td sdval="355059" sdnum="1033;0;#,##0" align="RIGHT">355,059</td>
<td sdval="357092" sdnum="1033;0;#,##0" align="RIGHT">357,092</td>
<td sdval="3367" sdnum="1033;" align="RIGHT">3367</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=ObracChr10:hsp:73:1.3.0.3;Parent=ObracChr10:hit:73:1.3.0.3;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 289 937 +1816</td>
</tr>
<tr>
<td align="LEFT" height="17">ObracChr10</td>
<td align="LEFT">repeatrunner</td>
<td align="LEFT">protein_match</td>
<td sdval="355059" sdnum="1033;0;#,##0" align="RIGHT">355,059</td>
<td sdval="357092" sdnum="1033;0;#,##0" align="RIGHT">357,092</td>
<td sdval="3367" sdnum="1033;" align="RIGHT">3367</td>
<td align="LEFT">-</td>
<td align="LEFT">.</td>
<td align="LEFT">ID=ObracChr10:hit:73:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetical;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 289 937 +1816</td>
</tr>
</tbody>
</table>
<title></title>
<meta name="GENERATOR" content="LibreOffice 3.5 (Linux)">
<style>
<!--
BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Liberation Sans"; font-size:x-small }
-->
</style><br>
<br>
This result is valid both for output lines from repeatmasker or repeatrunner, and the gene models come from either FGENESH or SNAP predictions.<br>
How can I explain this problem?<br>
Thanks,<br>
<br>
Dario<br>
<br>
<br>
<br>
<br>
<pre class="moz-signature" cols="72">--
Dario Copetti, PhD
Research Associate
Arizona Genomics Institute
University of Arizona - BIO5
1657 E. Helen St.
Tucson, AZ 85721
<a class="moz-txt-link-abbreviated" href="http://www.genome.arizona.edu">www.genome.arizona.edu</a></pre>
</div>
</div>
</span>
</body>
</html>