[maker-devel] gene models overlapping with TEs
Dario Copetti
dcopetti at cals.arizona.edu
Tue May 7 10:24:26 MDT 2013
Yes, there was RNA-seq evidence as well. Still I would like to have this
evidence annotated as TE, and not as a gene (or at least to have it
tagged in some way).
As you suggested, a good solution could be to sequentially soft mask
with the RMasker output and then hard mask with the RRunner result. In
this way we hide TE coding regions from all predictors and alignments,
leaving all the other types of repeats softmasked. This meets Mark's
target of having MITEs and other non-autonomous TEs (as well as
simple/low compl. repeats) annotated in UTRs or CDSs, if present. In my
opinion, this case could be one of the few cases (or the only one?)
where gene and repeat annotation can overlap.
For our genomes I will have a list of these genes overlapping TE coding
regions, and we will likely remove them. Please let us know how you
intend to fix this problem and on which MAKER version it will appear.
Thanks for the assistance and suggestions,
Dario
On 05/07/2013 04:39 AM, Carson Holt wrote:
> If I had to guess. I imagine the EST evidence includes assembled mRNA-seq
> reads? Is that correct?
>
> --Carson
>
>
>
> On 13-05-06 11:49 PM, "Mark Yandell" <myandell at genetics.utah.edu> wrote:
>
>> humm, eballing then it doesn't look lie its the UTRss..
>>
>> Mark Yandell
>> Professor of Human Genetics
>> H.A. & Edna Benning Presidential Endowed Chair
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> ph:801-587-7707
>>
>> ________________________________________
>> From: maker-devel-bounces at yandell-lab.org
>> [maker-devel-bounces at yandell-lab.org] on behalf of Dario Copetti
>> [dcopetti at cals.arizona.edu]
>> Sent: Monday, May 06, 2013 3:19 PM
>> To: maker-devel at yandell-lab.org
>> Cc: Stein, Joshua; Rod Wing; kapeel at cals.arizona.edu
>> Subject: [maker-devel] gene models overlapping with TEs
>>
>> Carson,
>>
>> Analyzing the output of a MAKER run on a rice-sized genome I noticed that
>> some gene models (~10%) overlap with TE coding regions. As a QC step, I
>> used BEDtools to determine the intersection of "CDS" and "repeatmasker"
>> or "repeatrunner" and some 2400 genes overlap for at least 30% of their
>> respective length. I am wondering how the gene models still appear in the
>> final output, since I thought that the masking step was giving us the
>> absoulte confirmation that in our endogenous gene list we do not include
>> TE coding regions. Here below an example of a gene (attached picture too):
>>
>> ObracChr10 maker mRNA 355,056 358,075 . - .
>> ID=Obrac10g00240.1;Parent=Obrac10g00240;Name=Obrac10g00240.1;_AED=0.24;_eA
>> ED=0.24;_QI=0|0.66|0.5|1|1|1|4|0|788
>> ObracChr10 maker exon 355,056 356,874 . - .
>> ID=Obrac10g00240.1:exon:4;Parent=Obrac10g00240.1
>> ObracChr10 maker exon 356,965 357,081 . - .
>> ID=Obrac10g00240.1:exon:3;Parent=Obrac10g00240.1
>> ObracChr10 maker exon 357,209 357,319 . - .
>> ID=Obrac10g00240.1:exon:2;Parent=Obrac10g00240.1
>> ObracChr10 maker exon 357,756 358,075 . - .
>> ID=Obrac10g00240.1:exon:1;Parent=Obrac10g00240.1
>> ObracChr10 maker CDS 357,756 358,075 . - 2
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>> ObracChr10 maker CDS 357,209 357,319 . - 2
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>> ObracChr10 maker CDS 356,965 357,081 . - 2
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>> ObracChr10 maker CDS 355,056 356,874 . - 0
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ObracChr10 repeatrunner match_part 357,755 358,084 566 -
>> .
>> ID=ObracChr10:hsp:75:1.3.0.3;Parent=ObracChr10:hit:75:1.3.0.3;Target=DTM_g
>> i_125573769_gb_EAZ15053.1hypothetical 117 226 +320
>> ObracChr10 repeatrunner protein_match 357,755 358,084 566 -
>> .
>> ID=ObracChr10:hit:75:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetic
>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 117 226 +320
>> ObracChr10 repeatrunner match_part 357,202 357,294 142 -
>> .
>> ID=ObracChr10:hsp:74:1.3.0.3;Parent=ObracChr10:hit:74:1.3.0.3;Target=DTM_g
>> i_125573769_gb_EAZ15053.1hypothetical 264 294 +86
>> ObracChr10 repeatrunner protein_match 357,202 357,294 142 -
>> .
>> ID=ObracChr10:hit:74:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetic
>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 264 294 +86
>> ObracChr10 repeatrunner match_part 355,059 357,092 3367 -
>> .
>> ID=ObracChr10:hsp:73:1.3.0.3;Parent=ObracChr10:hit:73:1.3.0.3;Target=DTM_g
>> i_125573769_gb_EAZ15053.1hypothetical 289 937 +1816
>> ObracChr10 repeatrunner protein_match 355,059 357,092 3367 -
>> .
>> ID=ObracChr10:hit:73:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetic
>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 289 937 +1816
>>
>>
>> This result is valid both for output lines from repeatmasker or
>> repeatrunner, and the gene models come from either FGENESH or SNAP
>> predictions.
>> How can I explain this problem?
>> Thanks,
>>
>> Dario
>>
>>
>>
>>
>>
>> --
>> Dario Copetti, PhD
>> Research Associate
>> Arizona Genomics Institute
>> University of Arizona - BIO5
>>
>> 1657 E. Helen St.
>> Tucson, AZ 85721
>> www.genome.arizona.edu<http://www.genome.arizona.edu>
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
--
Dario Copetti, PhD
Research Associate
Arizona Genomics Institute
University of Arizona - BIO5
1657 E. Helen St.
Tucson, AZ 85721
www.genome.arizona.edu
More information about the maker-devel
mailing list