[maker-devel] maker annotation with cufflinks output

Thu Jan 30 17:59:56 MST 2014

Hi All,

This is a problem I have been having for quite some time; maker predicts much lower number of genes or proteins than in my evidence RNA-seq transcripts. My genome is not repetitive and is atleast 90% complete.

I tried setting est2genome to 1, but that still doesn't seem to increase the predicted gene set too  much. If I input ~13000 genes(21000 transcripts) as evidence I get predictions of ~5000 genes(6000 transcripts). 
I ran MAKER again with the transcripts that didn't have a gene model predicted in the first run, and this time MAKER predicted gene models for ~20-30% of those transcripts.

Is there anything that can be done to increase the predicted gene count?

Thanks,
Ranjani 
________________________________________
From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of Carson Holt <carsonhh at gmail.com>
Sent: Thursday, January 30, 2014 4:14 PM
To: Daniel Ence; dhivya arasappan; maker-devel at yandell-lab.org
Subject: Re: [maker-devel] maker annotation with cufflinks output

What you get back from cufflinks should not necessarily be considered a
transcript count, and you should always expect the count given by
cufflinks to be high relative to assembly methods like trinity (especially
in plants).  This is because repetitive elements, spurious alignments, and
pseudogenes will all inflate the count because it is an alignment based
method which can be more sensitive but will also generate a lot of false
positives.  Fortunately the false positives will mostly be singe exon
results and will be filtered out by maker. Also your mRNA-seq data from
cufflinks will contribute to hints that can generate genes in the absence
of an ab-intio gene prediction, but if the gene finder doesn’t think the
hints make sense it will ignore them.  So a lot of cufflinks results that
don’t make sense with respect to ORF etc., will fall into the category of
being ignored.

In addition, you should try running your pipeline through CEGMA
(http://korflab.ucdavis.edu/datasets/cegma/) to identify the expected
completeness of the genome. For example if a genome of 70% completeness
then you only expect to recover 70% of the genes. I believe CEGMA can also
be run online from the iPlant discovery environment and iPlant atmosphere
images.   Also  make sure you are including proteins with your MAKER run,
as not all genes will be expressed, so mRNAseq will only capture a portion
of the genes and that portion can be as low as 50%.

Thanks,
Carson

On 1/30/14, 1:51 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:

>Hi Dhivya,
>
>I think there a few numbers that could be helpful to understand what's
>happening here.
>
>How many transcripts did Trinity assembly the RNA-seq data into? Also,
>you had 29,000 transcripts from cufflinks, but fewer from MAKER when you
>gave it the cufflinks data. How many transcripts did MAKER identify with
>the cufflinks data? Did you still get more than the 10,000 transcripts
>that you found with just the Trinity data?
>
>A key part of MAKER's approach to genome annotation that might be
>affecting it's performance is that it only annotates a gene where there
>is both evidence (like your RNA-seq data) and an ab-initio prediction. If
>a prediction is unsupported by the evidence, then MAKER won't annotate a
>gene and if evidence aligns where there's no prediction, MAKER won't
>annotate a gene either. What ab-initio predictors are you using and have
>they been trained specific genome?
>
>You can force MAKER to automatically promote evidence alignments to a
>gene model by setting the est2genome option to 1, but that will usually
>give you many false positives.
>
>Try rerunning it with either the Trinity data or the Cufflinks data and
>with est2genome set to 1, and let us know how that affects the MAKER
>results.
>
>Thanks,
>Daniel
>
>Daniel Ence
>Graduate Student
>Eccles Institute of Human Genetics
>University of Utah
>15 North 2030 East, Room 2100
>Salt Lake City, UT 84112-5330
>________________________________________
>From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of
>dhivya arasappan [darasappan at gmail.com]
>Sent: Thursday, January 30, 2014 11:18 AM
>To: maker-devel at yandell-lab.org
>Subject: [maker-devel] maker annotation with cufflinks output
>
>Hello,
>
>I am trying to annotate a 200 mb plant genome for which I have a very
>good assembly.
>
>I tried to denovo assemble RNA-seq data using trinity and ran maker
>using my genome assembly and the trinity results.  I did not get as
>many transcripts as expected, around 10,000 transcripts.
>
>So, I decided to try a different approach.  I did a genome assisted
>assembly of the RNA-seq data using tophat/cufflinks. This pipeline
>generated 21,000 genes, 29,000 transcripts.  I then ran maker using my
>genome assembly and the cufflinks result.  I get much less number of
>transcripts as a result.
>
>If cufflinks found 29000 transcripts by mapping to the genome, I'm
>confused as to why maker is not finding the same.
>
>Any suggestions would be appreciated.
>
>Thanks
>Dhivya
>
>
>_______________________________________________
>maker-devel mailing list
>maker-devel at box290.bluehost.com
>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>_______________________________________________
>maker-devel mailing list
>maker-devel at box290.bluehost.com
>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org