[maker-devel] Fragmented annotation

Mon Jul 11 11:29:14 MDT 2016

Most likely culprit is still that you have not properly masked repeats. Repeats encode real proteins (i.e. reverse transcriptase, etc.). So if they are not all masked they will be annotated as genes with start and stop codons.

—Carson

> On Jul 9, 2016, at 2:50 PM, Ole Kristian Tørresen <o.k.torresen at ibv.uio.no> wrote:
> 
> Daniel, Carson,
> I haven't been able to spend much time on this yet, but a quick gft conversion shows that with 96576 genes, 91089 has a stop codon, and 91984 a start codon.
> 
> I guess the length of the genes/proteins could also be a factor I could look into.
> 
> If the annotation would have been/is fragmented, what options could I try to tune? 
> 
> Thank you.
> 
> Ole
> ________________________________________
> From: Daniel Ence <dence at genetics.utah.edu>
> Sent: 08 July 2016 00:13
> To: Ole Kristian Tørresen
> Cc: maker-devel at yandell-lab.org
> Subject: Re: [maker-devel] Fragmented annotation
> 
> That’s what I was thinking. If you have the fasta files of the transcripts, then you can do a fast pattern match on the beginning and end of each sequence.
> 
> This is also the kind of thing that can quickly become clear in a browser like you already suggested. In the old desktop Apollo genome browser, incomplete genes were highlighted with orange arrows. I don’t know how other browsers handle that.
> 
> ~Daniel
> 
> Daniel Ence
> Graduate Student
> Eccles Institute of Human Genetics
> University of Utah
> 15 North 2030 East, Room 2100
> Salt Lake City, UT 84112-5330
> 
>> On Jul 7, 2016, at 3:56 PM, Ole Kristian Tørresen <o.k.torresen at ibv.uio.no> wrote:
>> 
>> Sure, but is there a quick way of doing this? With UTRs and such, I am unsure how to parse the gff properly. Three first bases of each CDS for each gene, or something like that? And the three last for the last CDS for a gene?
>> 
>> Ole
>> ________________________________________
>> From: Daniel Ence <dence at genetics.utah.edu>
>> Sent: 07 July 2016 23:48
>> To: Ole Kristian Tørresen
>> Cc: maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] Fragmented annotation
>> 
>> Addressing your suspicion that your genes are fragmented, can you check how many of the protein or transcript sequeces begin and end with canonical start and stop codons? That might tell you whether you have “gene-parts” rather than full genes.
>> 
>> ~Daniel
>> 
>> 
>> Daniel Ence
>> Graduate Student
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> 
>>> On Jul 7, 2016, at 3:44 PM, Daniel Ence <dence at genetics.utah.edu> wrote:
>>> 
>>> Hi Ole, when I hear that a genome had too many genes annotated, one of the first things I think of is masking repetitive elements in the genome. Those can contribute a large number of spurious gene annotations which are originating from transposable elements. What did you use for repeat masking for your genome? Did you run MAKER on a pre-masked version of the assembly?
>>> 
>>> ~Daniel
>>> 
>>> 
>>> Daniel Ence
>>> Graduate Student
>>> Eccles Institute of Human Genetics
>>> University of Utah
>>> 15 North 2030 East, Room 2100
>>> Salt Lake City, UT 84112-5330
>>> 
>>>> On Jul 7, 2016, at 3:29 PM, Ole Kristian Tørresen <o.k.torresen at ibv.uio.no> wrote:
>>>> 
>>>> Hi all,
>>>> I have annotated a fish genome (about 700 Mbp total, 90 kbp N50 contig, 270 kbp N50 scaffold), where I get 96576 gene models, 67917 with default filtering (quality_filter.pl -d) and 67917 with standard filtering (quality_filter.pl -s). I chose to report all genes with AED less than 0.5  (27437) as the high quality set.
>>>> 
>>>> However, I wonder a bit. One thing is that 70k genes cannot be correct for this species (it is not polyploid), and the correct number of genes should be a bit more than 20k I think. I suspect that many of my genes are fragmented, how can I fix this? I have tried searching the forum, but cannot find any good answers. Is there some parameters I can adjust?
>>>> 
>>>> I have used SwissProt/UniProt and a Trinity assembly of reads from several stages of embryo development  as evidence. I used SNAP with CEGMA, AUGUSTUS trained with BUSCO actinoptergyrii genes and GeneMark-ES in first pass, SNAP trained on first pass annotation and AUGUSTUS trained on the transcriptome and first pass annotation together with GeneMark for second pass annotation.
>>>> 
>>>> Thank you.
>>>> 
>>>> Ole
>>>> _______________________________________________
>>>> maker-devel mailing list
>>>> maker-devel at box290.bluehost.com
>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>> 
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>> 
> 
> 
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org