[maker-devel] diff. numbers of geneson contigs vs. scaffolded genome

Thu Sep 25 12:17:19 MDT 2014

Sorry for the slow reply.  I was trying to locate a script that might be
useful for you.

I think a species specific repeat libary will be of most benefit here
(it's surprising just how influential this step is).  Also note that you
should train SNAP and Augustus on your species and are not just using
another related species as a stand in.

With respect to PFAM domains, on some organisms you may not get a lot of
cross species protein alignments because of divergence or assembly issues.
This of course makes it harder to support these models with direct protein
alignments.  However you can run InterProscan over the
non-overlapping.proteins.fasta file produced by MAKER (contains
non-redundant rejected models).  Because an HMM is used for domain
identification, it can pick up protein domains that would not produce a
significant BLAST alignment because of divergence. You can then add models
with positive hits for protein domains back into your gene set.

This ad hoc procedure usually can only increase gene counts by about 10%
though for organisms where it's required. I've attached a script that
makes generating results for these genes easier.

1. First you run InterProScan with just PFAM.
2. Then you take the IDs of all models that have a domain in the report
and create a list (1 ID per line).
3. Next use the fasta_tool script that comes with MAKER together with the
--select flag to separate just the positive hits (ID's in your list) from
the non-overlapping.proteins.fasta and non-overlapping.transscripts.fasta
files.
4. Use the attached script to separate just the positive hits (your ID
list) from the GFF3. The script will upgrade match/match_part results to
gene/mRNA/exon/CDS results and print them out for you.
5. Use the fasta_maerge and gff3_merge scripts that come with MAKER to
merge the selected/upgraded GFF3 entries and selected FASTA entries back
into the original MAKER results.

--Carson

On 9/23/14, 6:36 AM, "Stefan Zoller" <stefan.zoller at env.ethz.ch> wrote:

>Please forgive my ignorance, I am not entirely sure if I understand your
>question correctly, but I will try to answer.
>As evidence we use:
>1) our own transcriptome (trinity assembled RNAseq, filtering out the
>very low expression transcripts).
>2) all swissprot plant proteins, and several protein sets from closely
>related plant species downloaded from NCBI.
>I am not sure if the ab-initio predictions without evidence have pfamm
>domains. Honestly, I would not know how to tell and how to
>include/exclude.
>I was assuming that we should not have too many Maker approved
>predictions without evidence anyway, because we use "keeps_preds=0".
>The numbers of gene predictions I mentioned in my email are the
>predictions reported by the fasta_merge/gff3_merge scripts in the
>"*maker.proteins.fasta". There are of course many more predictions in
>e.g., "*maker.augustus_masked.proteins.fasta" (about 68'000 in this file).
>
>I hope I am not totally off with my answer.
>Cheers, Stefan
>
>
>
>On 23.09.14 02:10, Mark Yandell wrote:
>> Also are you numbers including the ab-inito predictions without
>>evidence that have pfamm domains?
>>
>> cheers,
>>
>>
>> --mark
>>
>>
>>
>> Mark Yandell
>> Professor of Human Genetics
>> H.A. & Edna Benning Presidential Endowed Chair
>> Co-director USTAR Center for Genetic Discovery
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> ph:801-587-7707
>>
>> ________________________________________
>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of
>>Carson Holt [carson.holt at genetics.utah.edu]
>> Sent: Monday, September 22, 2014 2:17 PM
>> To: stefan.zoller at env.ethz.ch; maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] diff. numbers of geneson contigs vs.
>>scaffolded      genome
>>
>> The contiged assembly is more likely to give spurious hits and
>>alignments.
>>   They also can be harder to repeat mask.  Also gene predictors can
>>behave
>> slightly different on small sequences than on longer ones.  If you have
>> fewer gene models than you expect, your first step should be to process
>> the scaffolds with CEGMA.  It will give you an estimate of the genomes
>> "completeness".  If CEGMA gives a 60% completeness value for example
>>then
>> you can expect to only recover 60% of the expected number of genes. Next
>> you should run RepeatModeler of similar software to help generate a
>> species specific repeat library.  Under masked repeats can make
>>predicting
>> genes on longer scaffolds far more difficult for ab initio predictors.
>>
>> --Carson
>>
>>
>> On 9/19/14, 12:32 AM, "Stefan Zoller" <stefan.zoller at env.ethz.ch> wrote:
>>
>>> Hi,
>>>
>>> I am working on the annotation of a plant genome (about 600MB) and we
>>> have a reasonable draft assembly, a fairly good transcriptome and quite
>>> a few proteins from related species. We have also extensively trained
>>> augustus and are also feeding genmark and snap predictions.
>>>
>>> Recently I noticed a behavior of Maker that seems fairly odd and which
>>>I
>>> cannot explain at all. When I take the scaffolded genome (about 23000
>>> scaffolds) I get roughly 9'000 maker approved gene models. Which is
>>> admittedly a bit on the low side and we have to work on this. However,
>>> when I break up the scaffolds into contigs at stretches of N longer
>>> 500bp (about 60'000 contigs) I get about 17'000 maker gene models. Now
>>> obviously 17'000 is more in the range what I would expect, so I am
>>> inclined to go with these. I have looked at both annotations and the
>>> evidence in WebApollo and the evidence alignments are identical for
>>>both
>>> runs. The approved genes seem to be the same, except for the additional
>>> ones in the "contiged" genome version. The additional gene models are
>>> not necessarily at the ends of the contigs, so I think it has nothing
>>>to
>>> do with having the stretches of Ns nearby in the scaffolded genome. Do
>>> you have any idea why maker comes up with the additional numbers of
>>>gene
>>> models and how I could "convince" maker to give me the same gene models
>>> for the scaffolded assembly?
>>>
>>> Cheers,
>>> Stefan
>>>
>>>
>>>
>>> --
>>> Stefan Zoller, PhD
>>> Bioinformatics
>>> Genetic Diversity Centre
>>> ETH Zurich CHN E55.1
>>> Universitätsstrasse 16
>>> 8092 Zurich
>>> Switzerland
>>>
>>> Phone: +41 44 632 66 85
>>> E-Mail:  stefan.zoller at env.ethz.ch
>>> Web: www.gdc.ethz.ch
>>>
>>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: gff3_preds2models
Type: application/octet-stream
Size: 5523 bytes
Desc: gff3_preds2models
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140925/8c8aca31/attachment-0003.obj>