[maker-devel] maker annotation with cufflinks output

Fri Jan 31 01:39:26 MST 2014

Many thanks, Carson - for this fabulous post describing general
principles. You've hinted at some of these tips in other posts, but
it's great to have them in one place.

Thanks especially for the kinds of things that make a big difference
(point 3 below). More such tips always welcome!

Best wishes, and thanks for an amazing piece of kit.

- Sujai

On 31 January 2014 07:20, Carson Holt <carsonhh at gmail.com> wrote:
> So just a few suggestions. If you are getting fewer genes than you expect,
> that is usually an indication that the evidence provided is insufficient,
> the gene predictors need to be retrained, the repeat masking is
> insufficient, or the assembly has problems.
>
> Here is more explanation on each point:
>
> 1. In addition to any mRNA/EST data, you should provided full proteomes
> from a minimum of two species as closely related as possible, and perhaps
> a comprehensive database such as UniProt/Swissprot.  Note that based on
> experience the comprehensive database cannot substitute for a related
> species proteome, they can complement it, but not substitute for it.  So
> you need to supply full proteomes from something. mRNA/EST data is not
> sufficient by itself, so make sure you have enough protein evidence.
>
> 2. All models are ultimately generated by the predictors (maker doesn't
> generate these), so care should be taken to train the predictors as best
> as possible. Also train at least two predictors (SNAP and Augustus are
> recommended).  If they are both well trained, then they will be in general
> concordance with one another. If they are not well trained, then each
> program will produce very different models.  So visually inspecting their
> concordance can give you an idea of if they need to be retrained.
>
> 3. More often than not, poor predictor performance is actually the result
> of repeat related complications.  Many genomes that at first may seem
> repeat poor may actually contain novel repeats that can affect the
> performance of the gene predictors. If you are getting fewer genes than
> you expect or ab initio models are not in concordance from two independent
> predictors, run something like RepeatScout to generate species specific
> libraries.  This may seem minor, but I have seen predictions go from
> apparently random to textbook perfect just by producing a species specific
> library of novel repeats.
>
> 4. You can't have gene models if you don't have open reading frames to
> translate through.  Also gene predictors need sequence upstream and
> downstream of genes to work correctly, so if contigs are too short they
> won't be useful for prediction even if the sum of the contigs is large
> enough to encompass the whole genome.  In general any contig smaller than
> 10kb is not annotatable, so you should aim for as high an N50 value as
> possible.
>
>
> Annotating a new genome is sort of like a moving target.  No two organisms
> are alike, so you usually have to to identify what deficiencies exist
> based on preliminary runs and then correct for them in subsequent runs.
>
> Thanks,
> Carson
>
>
>
> On 1/30/14, 5:59 PM, "Sivaranjani Namasivayam" <ranjani at uga.edu> wrote:
>
>>Hi All,
>>
>>This is a problem I have been having for quite some time; maker predicts
>>much lower number of genes or proteins than in my evidence RNA-seq
>>transcripts. My genome is not repetitive and is atleast 90% complete.
>>
>>I tried setting est2genome to 1, but that still doesn't seem to increase
>>the predicted gene set too  much. If I input ~13000 genes(21000
>>transcripts) as evidence I get predictions of ~5000 genes(6000
>>transcripts).
>>I ran MAKER again with the transcripts that didn't have a gene model
>>predicted in the first run, and this time MAKER predicted gene models for
>>~20-30% of those transcripts.
>>
>>Is there anything that can be done to increase the predicted gene count?
>>
>>Thanks,
>>Ranjani
>>________________________________________
>>From: maker-devel <maker-devel-bounces at yandell-lab.org> on behalf of
>>Carson Holt <carsonhh at gmail.com>
>>Sent: Thursday, January 30, 2014 4:14 PM
>>To: Daniel Ence; dhivya arasappan; maker-devel at yandell-lab.org
>>Subject: Re: [maker-devel] maker annotation with cufflinks output
>>
>>What you get back from cufflinks should not necessarily be considered a
>>transcript count, and you should always expect the count given by
>>cufflinks to be high relative to assembly methods like trinity (especially
>>in plants).  This is because repetitive elements, spurious alignments, and
>>pseudogenes will all inflate the count because it is an alignment based
>>method which can be more sensitive but will also generate a lot of false
>>positives.  Fortunately the false positives will mostly be singe exon
>>results and will be filtered out by maker. Also your mRNA-seq data from
>>cufflinks will contribute to hints that can generate genes in the absence
>>of an ab-intio gene prediction, but if the gene finder doesn't think the
>>hints make sense it will ignore them.  So a lot of cufflinks results that
>>don't make sense with respect to ORF etc., will fall into the category of
>>being ignored.
>>
>>In addition, you should try running your pipeline through CEGMA
>>(http://korflab.ucdavis.edu/datasets/cegma/) to identify the expected
>>completeness of the genome. For example if a genome of 70% completeness
>>then you only expect to recover 70% of the genes. I believe CEGMA can also
>>be run online from the iPlant discovery environment and iPlant atmosphere
>>images.   Also  make sure you are including proteins with your MAKER run,
>>as not all genes will be expressed, so mRNAseq will only capture a portion
>>of the genes and that portion can be as low as 50%.
>>
>>Thanks,
>>Carson
>>
>>
>>On 1/30/14, 1:51 PM, "Daniel Ence" <dence at genetics.utah.edu> wrote:
>>
>>>Hi Dhivya,
>>>
>>>I think there a few numbers that could be helpful to understand what's
>>>happening here.
>>>
>>>How many transcripts did Trinity assembly the RNA-seq data into? Also,
>>>you had 29,000 transcripts from cufflinks, but fewer from MAKER when you
>>>gave it the cufflinks data. How many transcripts did MAKER identify with
>>>the cufflinks data? Did you still get more than the 10,000 transcripts
>>>that you found with just the Trinity data?
>>>
>>>A key part of MAKER's approach to genome annotation that might be
>>>affecting it's performance is that it only annotates a gene where there
>>>is both evidence (like your RNA-seq data) and an ab-initio prediction. If
>>>a prediction is unsupported by the evidence, then MAKER won't annotate a
>>>gene and if evidence aligns where there's no prediction, MAKER won't
>>>annotate a gene either. What ab-initio predictors are you using and have
>>>they been trained specific genome?
>>>
>>>You can force MAKER to automatically promote evidence alignments to a
>>>gene model by setting the est2genome option to 1, but that will usually
>>>give you many false positives.
>>>
>>>Try rerunning it with either the Trinity data or the Cufflinks data and
>>>with est2genome set to 1, and let us know how that affects the MAKER
>>>results.
>>>
>>>Thanks,
>>>Daniel
>>>
>>>Daniel Ence
>>>Graduate Student
>>>Eccles Institute of Human Genetics
>>>University of Utah
>>>15 North 2030 East, Room 2100
>>>Salt Lake City, UT 84112-5330
>>>________________________________________
>>>From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf of
>>>dhivya arasappan [darasappan at gmail.com]
>>>Sent: Thursday, January 30, 2014 11:18 AM
>>>To: maker-devel at yandell-lab.org
>>>Subject: [maker-devel] maker annotation with cufflinks output
>>>
>>>Hello,
>>>
>>>I am trying to annotate a 200 mb plant genome for which I have a very
>>>good assembly.
>>>
>>>I tried to denovo assemble RNA-seq data using trinity and ran maker
>>>using my genome assembly and the trinity results.  I did not get as
>>>many transcripts as expected, around 10,000 transcripts.
>>>
>>>So, I decided to try a different approach.  I did a genome assisted
>>>assembly of the RNA-seq data using tophat/cufflinks. This pipeline
>>>generated 21,000 genes, 29,000 transcripts.  I then ran maker using my
>>>genome assembly and the cufflinks result.  I get much less number of
>>>transcripts as a result.
>>>
>>>If cufflinks found 29000 transcripts by mapping to the genome, I'm
>>>confused as to why maker is not finding the same.
>>>
>>>Any suggestions would be appreciated.
>>>
>>>Thanks
>>>Dhivya
>>>
>>>
>>>_______________________________________________
>>>maker-devel mailing list
>>>maker-devel at box290.bluehost.com
>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>_______________________________________________
>>>maker-devel mailing list
>>>maker-devel at box290.bluehost.com
>>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
>>
>>_______________________________________________
>>maker-devel mailing list
>>maker-devel at box290.bluehost.com
>>http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org