[maker-devel] gene models overlapping with TEs
Carson Holt
carsonhh at gmail.com
Tue May 7 12:20:19 MDT 2013
This is really more of an evidence issue. Because you have assembled
mRNAseq evidence, you are probably getting them improperly included in the
assembled EST, so MAKER just follows the evidence. It tries to mask it
out, but the alignment of the longer EST heavily supports the repeats
inclusion in the model during alignment polishing.
Solutions:
1. You can set softmask=0 instead of softmask=1 (1 is the default), to
make everything hard masked instead (it will be a hard 'N' so no alignment
can happen).
2. You can pre-mask the genome. Easiest way to do this would be to
collect the query.masked.fasta files inside each theVoid directory in the
datastore and use them as the input. Then none of the polishing steps can
ever extend the alignment.
3. You can filter the mRNA-seq data fro TE elements before assembly.
Thanks,
Carson
On 13-05-07 12:24 PM, "Dario Copetti" <dcopetti at cals.arizona.edu> wrote:
>Yes, there was RNA-seq evidence as well. Still I would like to have this
>evidence annotated as TE, and not as a gene (or at least to have it
>tagged in some way).
>
>As you suggested, a good solution could be to sequentially soft mask
>with the RMasker output and then hard mask with the RRunner result. In
>this way we hide TE coding regions from all predictors and alignments,
>leaving all the other types of repeats softmasked. This meets Mark's
>target of having MITEs and other non-autonomous TEs (as well as
>simple/low compl. repeats) annotated in UTRs or CDSs, if present. In my
>opinion, this case could be one of the few cases (or the only one?)
>where gene and repeat annotation can overlap.
>
>For our genomes I will have a list of these genes overlapping TE coding
>regions, and we will likely remove them. Please let us know how you
>intend to fix this problem and on which MAKER version it will appear.
>Thanks for the assistance and suggestions,
>
>Dario
>
>
>
>On 05/07/2013 04:39 AM, Carson Holt wrote:
>> If I had to guess. I imagine the EST evidence includes assembled
>>mRNA-seq
>> reads? Is that correct?
>>
>> --Carson
>>
>>
>>
>> On 13-05-06 11:49 PM, "Mark Yandell" <myandell at genetics.utah.edu> wrote:
>>
>>> humm, eballing then it doesn't look lie its the UTRss..
>>>
>>> Mark Yandell
>>> Professor of Human Genetics
>>> H.A. & Edna Benning Presidential Endowed Chair
>>> Eccles Institute of Human Genetics
>>> University of Utah
>>> 15 North 2030 East, Room 2100
>>> Salt Lake City, UT 84112-5330
>>> ph:801-587-7707
>>>
>>> ________________________________________
>>> From: maker-devel-bounces at yandell-lab.org
>>> [maker-devel-bounces at yandell-lab.org] on behalf of Dario Copetti
>>> [dcopetti at cals.arizona.edu]
>>> Sent: Monday, May 06, 2013 3:19 PM
>>> To: maker-devel at yandell-lab.org
>>> Cc: Stein, Joshua; Rod Wing; kapeel at cals.arizona.edu
>>> Subject: [maker-devel] gene models overlapping with TEs
>>>
>>> Carson,
>>>
>>> Analyzing the output of a MAKER run on a rice-sized genome I noticed
>>>that
>>> some gene models (~10%) overlap with TE coding regions. As a QC step, I
>>> used BEDtools to determine the intersection of "CDS" and "repeatmasker"
>>> or "repeatrunner" and some 2400 genes overlap for at least 30% of their
>>> respective length. I am wondering how the gene models still appear in
>>>the
>>> final output, since I thought that the masking step was giving us the
>>> absoulte confirmation that in our endogenous gene list we do not
>>>include
>>> TE coding regions. Here below an example of a gene (attached picture
>>>too):
>>>
>>> ObracChr10 maker mRNA 355,056 358,075 . - .
>>>
>>>ID=Obrac10g00240.1;Parent=Obrac10g00240;Name=Obrac10g00240.1;_AED=0.24;_
>>>eA
>>> ED=0.24;_QI=0|0.66|0.5|1|1|1|4|0|788
>>> ObracChr10 maker exon 355,056 356,874 . - .
>>> ID=Obrac10g00240.1:exon:4;Parent=Obrac10g00240.1
>>> ObracChr10 maker exon 356,965 357,081 . - .
>>> ID=Obrac10g00240.1:exon:3;Parent=Obrac10g00240.1
>>> ObracChr10 maker exon 357,209 357,319 . - .
>>> ID=Obrac10g00240.1:exon:2;Parent=Obrac10g00240.1
>>> ObracChr10 maker exon 357,756 358,075 . - .
>>> ID=Obrac10g00240.1:exon:1;Parent=Obrac10g00240.1
>>> ObracChr10 maker CDS 357,756 358,075 . - 2
>>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>>> ObracChr10 maker CDS 357,209 357,319 . - 2
>>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>>> ObracChr10 maker CDS 356,965 357,081 . - 2
>>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>>> ObracChr10 maker CDS 355,056 356,874 . - 0
>>> ID=Obrac10g00240.1:cds;Parent=Obrac10g00240.1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ObracChr10 repeatrunner match_part 357,755 358,084 566
>>> -
>>> .
>>>
>>>ID=ObracChr10:hsp:75:1.3.0.3;Parent=ObracChr10:hit:75:1.3.0.3;Target=DTM
>>>_g
>>> i_125573769_gb_EAZ15053.1hypothetical 117 226 +320
>>> ObracChr10 repeatrunner protein_match 357,755 358,084 566
>>> -
>>> .
>>>
>>>ID=ObracChr10:hit:75:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothet
>>>ic
>>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 117 226 +320
>>> ObracChr10 repeatrunner match_part 357,202 357,294 142
>>> -
>>> .
>>>
>>>ID=ObracChr10:hsp:74:1.3.0.3;Parent=ObracChr10:hit:74:1.3.0.3;Target=DTM
>>>_g
>>> i_125573769_gb_EAZ15053.1hypothetical 264 294 +86
>>> ObracChr10 repeatrunner protein_match 357,202 357,294 142
>>> -
>>> .
>>>
>>>ID=ObracChr10:hit:74:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothet
>>>ic
>>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 264 294 +86
>>> ObracChr10 repeatrunner match_part 355,059 357,092 3367
>>> -
>>> .
>>>
>>>ID=ObracChr10:hsp:73:1.3.0.3;Parent=ObracChr10:hit:73:1.3.0.3;Target=DTM
>>>_g
>>> i_125573769_gb_EAZ15053.1hypothetical 289 937 +1816
>>> ObracChr10 repeatrunner protein_match 355,059 357,092 3367
>>> -
>>> .
>>>
>>>ID=ObracChr10:hit:73:1.3.0.3;Name=DTM_gi_125573769_gb_EAZ15053.1hypothet
>>>ic
>>> al;Target=DTM_gi_125573769_gb_EAZ15053.1hypothetical 289 937 +1816
>>>
>>>
>>> This result is valid both for output lines from repeatmasker or
>>> repeatrunner, and the gene models come from either FGENESH or SNAP
>>> predictions.
>>> How can I explain this problem?
>>> Thanks,
>>>
>>> Dario
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Dario Copetti, PhD
>>> Research Associate
>>> Arizona Genomics Institute
>>> University of Arizona - BIO5
>>>
>>> 1657 E. Helen St.
>>> Tucson, AZ 85721
>>> www.genome.arizona.edu<http://www.genome.arizona.edu>
>>>
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>
>--
>Dario Copetti, PhD
>Research Associate
>Arizona Genomics Institute
>University of Arizona - BIO5
>
>1657 E. Helen St.
>Tucson, AZ 85721
>www.genome.arizona.edu
>
More information about the maker-devel
mailing list