[maker-devel] maker annotation with cufflinks output

Wed Feb 5 22:16:43 MST 2014

Thank you both for those explanations. I'll get back to you after I  
try rerunning maker.

Dhivya

On Feb 5, 2014, at 2:38 PM, Carson Holt wrote:

> Protein data doesn’t have to be from that closely a related  
> species.  This is because genes maintain homology at the amino acid  
> level across even very large evolutionary distances.  Having a  
> closer related species just ensures that genome contents are similar  
> (fewer losses/gains relative to each other). And use the entire  
> proteome of at least one related species (just using a database like  
> swiss-prot is not sufficient).
>
> Using translated mRNA-seq data will not give you any new information  
> that was not already available from the untranslated sequence.  Plus  
> it will introduce the complicating artifacts that mRNA-seq generates  
> into the protein part of the pipeline (gene merging, incorrect  
> assembly, and false calls caused by background transcription).  A  
> big gotcha with mRNA-seq is that all of your genome gets transcribed  
> at a low level, not just the genes, so you will always have  
> contamination that does not represent real gene models.  Also in the  
> end you really only expect to capture about 50% of the genes with  
> mRNA-seq (maybe 70% if you are fortunate - and most of those will be  
> partial). So using the proteins from another species, is important  
> to improve sensitivity, and fix many of the issues that arise from  
> the noisy nature of mRNA-seq.  In fact if you were forced to use  
> only one (either protein evidence or mRNA-seq) you will actually get  
> better annotations from the protein evidence in most cases. You get  
> better annotations when you use both, but if using only one of them,  
> the proteins from another species are better, and noisy mRNA-seq  
> will be the primary source of annotation error.
>
> Thanks,
> Carson
>
>
> From: dhivya arasappan <darasappan at gmail.com>
> Date: Wednesday, February 5, 2014 at 1:13 PM
> To: Daniel Ence <dence at genetics.utah.edu>
> Cc: Carson Holt <carsonhh at gmail.com>, "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org 
> >
> Subject: Re: [maker-devel] maker annotation with cufflinks output
>
> Hello Daniel and Carson,
>
> Thanks for your replies.
>
> Yes I used the the protein sequences resulting from annotation of  
> trinity assembly (using trinotate).  I'll try using protein  
> sequences from related species (though there arent sequences from  
> closely related orgs).  Could you tell me a little about why protein  
> data from annotating my rnaseq data would not work best here?
>
> Thanks
> Dhivya
>
> On Feb 5, 2014, at 1:28 PM, Daniel Ence wrote:
>
>> Hi Dhivya, Are the protein matches in your results coming from your  
>> annotations of the transcriptome? You should really use amino-acid  
>> sequences from related organisms and some kind of omnibus source  
>> like SwissProt.
>>
>> Thanks,
>> Daniel
>>
>> Daniel Ence
>> Graduate Student
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> From: Carson Holt [carsonhh at gmail.com]
>> Sent: Wednesday, February 05, 2014 11:38 AM
>> To: dhivya arasappan; Daniel Ence
>> Cc: maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] maker annotation with cufflinks output
>>
>> Do you have any features of type snap in your results from step 3?   
>> We’ve had a couple of recent posts where after training snap was  
>> giving no results, and as a result maker couldn’t give any genes.   
>> One cause of something like that may be your step 2.  Make sure the  
>> ZFF wasn’t empty you used to train with.  The maker2zff script uses  
>> filters to only put the best genes in the off file, and if all your  
>> genes fail the filtering then you are training with an empty ZFF.
>>
>> Also you should use proteins from a related species as your protein  
>> file.  I see that you protein marches are varying wildly from run  
>> to run? So is your contig count?  Were the subset of contigs you  
>> have results for long enough to contain genes?
>>
>> —Carson
>>
>> From: dhivya arasappan <darasappan at gmail.com>
>> Date: Monday, February 3, 2014 at 9:31 AM
>> To: Daniel Ence <dence at genetics.utah.edu>
>> Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
>> Subject: Re: [maker-devel] maker annotation with cufflinks output
>>
>> Hi Daniel,
>>
>> I was able to check on some of those questions.
>>
>> 1. From trinity assembly: I started with 102000 contigs. I used  
>> trinotate to annotate proteins in this.
>>
>> I ran maker on this data with est2genome set to 1. The output looks  
>> like this (most important parts on top):
>>
>>     6653 gene
>>    46675 exon
>>  280534 protein_match
>> 59934 CDS
>>     969 contig
>>  105388 expressed_sequence_match
>>   12584 five_prime_UTR
>>   78565 match
>> 1401369 match_part
>>   10180 mRNA
>>   11545 three_prime_UTR
>>
>> 2. From cufflinks assembly: I started with 133380 entries (out of  
>> which there are 29,000 transcripts).  I used the protein sequences  
>> from trinity assembly.
>>
>> I ran maker on this data with est2genome set to 1. The output looks  
>> like this:
>>      29 gene
>>      75 exon
>>  573659 protein_match
>> 67 CDS
>>    1099 contig
>>  269298 expressed_sequence_match
>>      23 five_prime_UTR
>>  173844 match
>> 2221846 match_part
>>      29 mRNA
>>      23 three_prime_UTR
>>
>> The genes annotated using the trinity assembly is lower than  
>> expected, so I went the cufflinks route. I dont understand why when  
>> using the cufflinks transcripts, even less genes are being found.
>>
>> 3. Training SNAP:  I used the results of maker from 1 to train  
>> SNAP.  I then used that training set to rerun maker:
>> snaphmm=/scratch/01184/daras/jansen/RHA/allpaths/ 
>> maker_mpi_withAlltrinity/snap/RHA.hmm
>> est2genome=0
>>
>> And again I got results with no entries for gene, exon, CDS etc.
>> 957 contig
>>   46555 expressed_sequence_match
>>   43651 match
>>  553633 match_part
>>  113738 protein_match
>>
>> As I mentioned in another email, cegma results indicated that the  
>> genome was more than 90% complete. Any suggestions would be helpful.
>>
>> Thank you
>> Dhivya
>>
>>
>>
>>
>> On Jan 30, 2014, at 2:51 PM, Daniel Ence wrote:
>>
>>> Hi Dhivya,
>>>
>>> I think there a few numbers that could be helpful to understand  
>>> what's happening here.
>>>
>>> How many transcripts did Trinity assembly the RNA-seq data into?  
>>> Also, you had 29,000 transcripts from cufflinks, but fewer from  
>>> MAKER when you gave it the cufflinks data. How many transcripts  
>>> did MAKER identify with the cufflinks data? Did you still get more  
>>> than the 10,000 transcripts that you found with just the Trinity  
>>> data?
>>>
>>> A key part of MAKER's approach to genome annotation that might be  
>>> affecting it's performance is that it only annotates a gene where  
>>> there is both evidence (like your RNA-seq data) and an ab-initio  
>>> prediction. If a prediction is unsupported by the evidence, then  
>>> MAKER won't annotate a gene and if evidence aligns where there's  
>>> no prediction, MAKER won't annotate a gene either. What ab-initio  
>>> predictors are you using and have they been trained specific genome?
>>>
>>> You can force MAKER to automatically promote evidence alignments  
>>> to a gene model by setting the est2genome option to 1, but that  
>>> will usually give you many false positives.
>>>
>>> Try rerunning it with either the Trinity data or the Cufflinks  
>>> data and with est2genome set to 1, and let us know how that  
>>> affects the MAKER results.
>>>
>>> Thanks,
>>> Daniel
>>>
>>> Daniel Ence
>>> Graduate Student
>>> Eccles Institute of Human Genetics
>>> University of Utah
>>> 15 North 2030 East, Room 2100
>>> Salt Lake City, UT 84112-5330
>>> ________________________________________
>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf  
>>> of dhivya arasappan [darasappan at gmail.com]
>>> Sent: Thursday, January 30, 2014 11:18 AM
>>> To: maker-devel at yandell-lab.org
>>> Subject: [maker-devel] maker annotation with cufflinks output
>>>
>>> Hello,
>>>
>>> I am trying to annotate a 200 mb plant genome for which I have a  
>>> very
>>> good assembly.
>>>
>>> I tried to denovo assemble RNA-seq data using trinity and ran maker
>>> using my genome assembly and the trinity results.  I did not get as
>>> many transcripts as expected, around 10,000 transcripts.
>>>
>>> So, I decided to try a different approach.  I did a genome assisted
>>> assembly of the RNA-seq data using tophat/cufflinks. This pipeline
>>> generated 21,000 genes, 29,000 transcripts.  I then ran maker  
>>> using my
>>> genome assembly and the cufflinks result.  I get much less number of
>>> transcripts as a result.
>>>
>>> If cufflinks found 29000 transcripts by mapping to the genome, I'm
>>> confused as to why maker is not finding the same.
>>>
>>> Any suggestions would be appreciated.
>>>
>>> Thanks
>>> Dhivya
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>> _______________________________________________ maker-devel mailing  
>> list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140205/02e0218f/attachment-0003.html>