[maker-devel] gene models overlapping with TEs

Carson Holt Carson.Holt at oicr.on.ca
Mon May 6 20:22:23 MDT 2013


Repeats can still happen in genes.  So an outright block actually causes more errors than it avoids, and a mixed approach of hard and soft masking becomes more appropriate.  The masking step stops alignments from seeding in repeat regions, but if alignments seed in non-repeat regions then they can still extend through repeat regions during polishing steps (I.e. The EST evidence supports extension through the repeat and inclusion of the TE).

--Carson


From: Dario Copetti <dcopetti at cals.arizona.edu<mailto:dcopetti at cals.arizona.edu>>
Organization: AGI
Date: Monday, 6 May, 2013 5:19 PM
To: <maker-devel at yandell-lab.org<mailto:maker-devel at yandell-lab.org>>
Cc: "kapeel at cals.arizona.edu<mailto:kapeel at cals.arizona.edu>" <kapeel at cals.arizona.edu<mailto:kapeel at cals.arizona.edu>>, "Stein, Joshua" <steinj at cshl.edu<mailto:steinj at cshl.edu>>, Rod Wing <rwing at Ag.arizona.edu<mailto:rwing at Ag.arizona.edu>>
Subject: gene models overlapping with TEs

Carson,

Analyzing the output of a MAKER run on a rice-sized genome I noticed that some gene models (~10%) overlap with TE coding regions. As a QC step, I used BEDtools to determine the intersection of "CDS" and "repeatmasker" or "repeatrunner" and some 2400 genes overlap for at least 30% of their respective length. I am wondering how the gene models still appear in the final output, since I thought that the masking step was giving us the absoulte confirmation that in our endogenous gene list we do not include TE coding regions. Here below an example of a gene (attached picture too):

ObracChr10      maker   mRNA    355,056 358,075 .       -       .       ID=Obrac10g00240.1;Parent=Obrac10g00240;Name=Obrac10g00240.1;_AED=0.24;_eAED=0.24;_QI=0|0.66|0.5|1|1|1|4|0|788
ObracChr10      maker   exon    355,056 356,874 .       -       .       ID=Obrac10g00240.1:exon:4;Parent=Obrac10g00240.1
ObracChr10      maker   exon    356,965 357,081 .       -       .       ID=Obrac10g00240.1:exon:3;Parent=Obrac10g00240.1
ObracChr10      maker   exon    357,209 357,319 .       -       .       ID=Obrac10g00240.1:exon:2;Parent=Obrac10g00240.1
ObracChr10      maker   exon    357,756 358,075 .       -       .       ID=Obrac10g00240.1:exon:1;Parent=Obrac10g00240.1
ObracChr10      maker   CDS     357,756 358,075 .       -       2       ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
ObracChr10      maker   CDS     357,209 357,319 .       -       2       ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
ObracChr10      maker   CDS     356,965 357,081 .       -       2       ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
ObracChr10      maker   CDS     355,056 356,874 .       -       0       ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1




















ObracChr10      repeatrunner    match_part      357,755 358,084 566     -       .       ID=ObracChr10:hsp:75:1.3.0.3;Parent=ObracChr10:hit:75:1.3.0.3;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 117 226 +320
ObracChr10      repeatrunner    protein_match   357,755 358,084 566     -       .       ID=ObracChr10:hit:75:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetical;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 117 226 +320
ObracChr10      repeatrunner    match_part      357,202 357,294 142     -       .       ID=ObracChr10:hsp:74:1.3.0.3;Parent=ObracChr10:hit:74:1.3.0.3;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 264 294 +86
ObracChr10      repeatrunner    protein_match   357,202 357,294 142     -       .       ID=ObracChr10:hit:74:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetical;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 264 294 +86
ObracChr10      repeatrunner    match_part      355,059 357,092 3367    -       .       ID=ObracChr10:hsp:73:1.3.0.3;Parent=ObracChr10:hit:73:1.3.0.3;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 289 937 +1816
ObracChr10      repeatrunner    protein_match   355,059 357,092 3367    -       .       ID=ObracChr10:hit:73:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetical;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 289 937 +1816


This result is valid both for output lines from repeatmasker or repeatrunner, and the gene models come from either FGENESH or SNAP predictions.
How can I explain this problem?
Thanks,

Dario





--
Dario Copetti, PhD
Research Associate
Arizona Genomics Institute
University of Arizona - BIO5

1657 E. Helen St.
Tucson, AZ  85721
www.genome.arizona.edu<http://www.genome.arizona.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20130507/17513502/attachment-0003.html>


More information about the maker-devel mailing list