[maker-devel] diff. numbers of geneson contigs vs. scaffolded genome

Mon Sep 22 18:10:38 MDT 2014

Also are you numbers including the ab-inito predictions without evidence that have pfamm domains?

cheers,

--mark

Mark Yandell
Professor of Human Genetics
H.A. & Edna Benning Presidential Endowed Chair
Co-director USTAR Center for Genetic Discovery
Eccles Institute of Human Genetics
University of Utah
15 North 2030 East, Room 2100
Salt Lake City, UT 84112-5330
ph:801-587-7707

________________________________________
From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of Carson Holt [carson.holt at genetics.utah.edu]
Sent: Monday, September 22, 2014 2:17 PM
To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org
Subject: Re: [maker-devel] diff. numbers of geneson contigs vs. scaffolded      genome

The contiged assembly is more likely to give spurious hits and alignments.
 They also can be harder to repeat mask.  Also gene predictors can behave
slightly different on small sequences than on longer ones.  If you have
fewer gene models than you expect, your first step should be to process
the scaffolds with CEGMA.  It will give you an estimate of the genomes
"completeness".  If CEGMA gives a 60% completeness value for example then
you can expect to only recover 60% of the expected number of genes. Next
you should run RepeatModeler of similar software to help generate a
species specific repeat library.  Under masked repeats can make predicting
genes on longer scaffolds far more difficult for ab initio predictors.

--Carson

On 9/19/14, 12:32 AM, "Stefan Zoller" <stefan.zoller at env.ethz.ch> wrote:

>Hi,
>
>I am working on the annotation of a plant genome (about 600MB) and we
>have a reasonable draft assembly, a fairly good transcriptome and quite
>a few proteins from related species. We have also extensively trained
>augustus and are also feeding genmark and snap predictions.
>
>Recently I noticed a behavior of Maker that seems fairly odd and which I
>cannot explain at all. When I take the scaffolded genome (about 23000
>scaffolds) I get roughly 9'000 maker approved gene models. Which is
>admittedly a bit on the low side and we have to work on this. However,
>when I break up the scaffolds into contigs at stretches of N longer
>500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now
>obviously 17'000 is more in the range what I would expect, so I am
>inclined to go with these. I have looked at both annotations and the
>evidence in WebApollo and the evidence alignments are identical for both
>runs. The approved genes seem to be the same, except for the additional
>ones in the "contiged" genome version. The additional gene models are
>not necessarily at the ends of the contigs, so I think it has nothing to
>do with having the stretches of Ns nearby in the scaffolded genome. Do
>you have any idea why maker comes up with the additional numbers of gene
>models and how I could "convince" maker to give me the same gene models
>for the scaffolded assembly?
>
>Cheers,
>Stefan
>
>
>
>--
>Stefan Zoller, PhD
>Bioinformatics
>Genetic Diversity Centre
>ETH Zurich CHN E55.1
>Universitätsstrasse 16
>8092 Zurich
>Switzerland
>
>Phone: +41 44 632 66 85
>E-Mail:  stefan.zoller at env.ethz.ch
>Web: www.gdc.ethz.ch
>
>

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org