[maker-devel] gene models overlapping with TEs

Dario Copetti dcopetti at cals.arizona.edu
Tue May 7 10:24:26 MDT 2013


Yes, there was RNA-seq evidence as well. Still I would like to have this 
evidence annotated as TE, and not as a gene (or at least to have it 
tagged in some way).

As you suggested, a good solution could be to sequentially soft mask 
with the RMasker output and then hard mask with the RRunner result. In 
this way we hide TE coding regions from all predictors and alignments, 
leaving all the other types of repeats softmasked. This meets Mark's 
target of having MITEs and other non-autonomous TEs (as well as 
simple/low compl. repeats) annotated in UTRs or CDSs, if present. In my 
opinion, this case could be one of the few cases (or the only one?) 
where gene and repeat annotation can overlap.

For our genomes I will have a list of these genes overlapping TE coding 
regions, and we will likely remove them. Please let us know how you 
intend to fix this problem and on which MAKER version it will appear.
Thanks for the assistance and suggestions,

Dario



On 05/07/2013 04:39 AM, Carson Holt wrote:
> If I had to guess.  I imagine the EST evidence includes assembled mRNA-seq
> reads?  Is that correct?
>
> --Carson
>
>
>
> On 13-05-06 11:49 PM, "Mark Yandell" <myandell at genetics.utah.edu> wrote:
>
>> humm, eballing then it doesn't look lie its the UTRss..
>>
>> Mark Yandell
>> Professor of Human Genetics
>> H.A. & Edna Benning Presidential Endowed Chair
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> ph:801-587-7707
>>
>> ________________________________________
>> From: maker-devel-bounces at yandell-lab.org
>> [maker-devel-bounces at yandell-lab.org] on behalf of Dario Copetti
>> [dcopetti at cals.arizona.edu]
>> Sent: Monday, May 06, 2013 3:19 PM
>> To: maker-devel at yandell-lab.org
>> Cc: Stein, Joshua; Rod Wing; kapeel at cals.arizona.edu
>> Subject: [maker-devel] gene models overlapping with TEs
>>
>> Carson,
>>
>> Analyzing the output of a MAKER run on a rice-sized genome I noticed that
>> some gene models (~10%) overlap with TE coding regions. As a QC step, I
>> used BEDtools to determine the intersection of "CDS" and "repeatmasker"
>> or "repeatrunner" and some 2400 genes overlap for at least 30% of their
>> respective length. I am wondering how the gene models still appear in the
>> final output, since I thought that the masking step was giving us the
>> absoulte confirmation that in our endogenous gene list we do not include
>> TE coding regions. Here below an example of a gene (attached picture too):
>>
>> ObracChr10      maker   mRNA    355,056 358,075 .       -       .
>> ID=Obrac10g00240.1;Parent=Obrac10g00240;Name=Obrac10g00240.1;_AED=0.24;_eA
>> ED=0.24;_QI=0|0.66|0.5|1|1|1|4|0|788
>> ObracChr10      maker   exon    355,056 356,874 .       -       .
>> ID=Obrac10g00240.1:exon:4;Parent=Obrac10g00240.1
>> ObracChr10      maker   exon    356,965 357,081 .       -       .
>> ID=Obrac10g00240.1:exon:3;Parent=Obrac10g00240.1
>> ObracChr10      maker   exon    357,209 357,319 .       -       .
>> ID=Obrac10g00240.1:exon:2;Parent=Obrac10g00240.1
>> ObracChr10      maker   exon    357,756 358,075 .       -       .
>> ID=Obrac10g00240.1:exon:1;Parent=Obrac10g00240.1
>> ObracChr10      maker   CDS     357,756 358,075 .       -       2
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>> ObracChr10      maker   CDS     357,209 357,319 .       -       2
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>> ObracChr10      maker   CDS     356,965 357,081 .       -       2
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>> ObracChr10      maker   CDS     355,056 356,874 .       -       0
>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ObracChr10      repeatrunner    match_part      357,755 358,084 566     -
>>       .
>> ID=ObracChr10:hsp:75:1.3.0.3;Parent=ObracChr10:hit:75:1.3.0.3;Target=DTM_g
>> i_125573769_gb_EAZ15053.1hypothetical 117 226 +320
>> ObracChr10      repeatrunner    protein_match   357,755 358,084 566     -
>>       .
>> ID=ObracChr10:hit:75:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetic
>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 117 226 +320
>> ObracChr10      repeatrunner    match_part      357,202 357,294 142     -
>>       .
>> ID=ObracChr10:hsp:74:1.3.0.3;Parent=ObracChr10:hit:74:1.3.0.3;Target=DTM_g
>> i_125573769_gb_EAZ15053.1hypothetical 264 294 +86
>> ObracChr10      repeatrunner    protein_match   357,202 357,294 142     -
>>       .
>> ID=ObracChr10:hit:74:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetic
>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 264 294 +86
>> ObracChr10      repeatrunner    match_part      355,059 357,092 3367    -
>>       .
>> ID=ObracChr10:hsp:73:1.3.0.3;Parent=ObracChr10:hit:73:1.3.0.3;Target=DTM_g
>> i_125573769_gb_EAZ15053.1hypothetical 289 937 +1816
>> ObracChr10      repeatrunner    protein_match   355,059 357,092 3367    -
>>       .
>> ID=ObracChr10:hit:73:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothetic
>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 289 937 +1816
>>
>>
>> This result is valid both for output lines from repeatmasker or
>> repeatrunner, and the gene models come from either FGENESH or SNAP
>> predictions.
>> How can I explain this problem?
>> Thanks,
>>
>> Dario
>>
>>
>>
>>
>>
>> --
>> Dario Copetti, PhD
>> Research Associate
>> Arizona Genomics Institute
>> University of Arizona - BIO5
>>
>> 1657 E. Helen St.
>> Tucson, AZ  85721
>> www.genome.arizona.edu<http://www.genome.arizona.edu>
>>
>>
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>

-- 
Dario Copetti, PhD
Research Associate
Arizona Genomics Institute
University of Arizona - BIO5

1657 E. Helen St.
Tucson, AZ  85721
www.genome.arizona.edu





More information about the maker-devel mailing list