[maker-devel] maker annotation with cufflinks output
dhivya arasappan
darasappan at gmail.com
Wed Feb 5 22:16:43 MST 2014
Thank you both for those explanations. I'll get back to you after I
try rerunning maker.
Dhivya
On Feb 5, 2014, at 2:38 PM, Carson Holt wrote:
> Protein data doesn’t have to be from that closely a related
> species. This is because genes maintain homology at the amino acid
> level across even very large evolutionary distances. Having a
> closer related species just ensures that genome contents are similar
> (fewer losses/gains relative to each other). And use the entire
> proteome of at least one related species (just using a database like
> swiss-prot is not sufficient).
>
> Using translated mRNA-seq data will not give you any new information
> that was not already available from the untranslated sequence. Plus
> it will introduce the complicating artifacts that mRNA-seq generates
> into the protein part of the pipeline (gene merging, incorrect
> assembly, and false calls caused by background transcription). A
> big gotcha with mRNA-seq is that all of your genome gets transcribed
> at a low level, not just the genes, so you will always have
> contamination that does not represent real gene models. Also in the
> end you really only expect to capture about 50% of the genes with
> mRNA-seq (maybe 70% if you are fortunate - and most of those will be
> partial). So using the proteins from another species, is important
> to improve sensitivity, and fix many of the issues that arise from
> the noisy nature of mRNA-seq. In fact if you were forced to use
> only one (either protein evidence or mRNA-seq) you will actually get
> better annotations from the protein evidence in most cases. You get
> better annotations when you use both, but if using only one of them,
> the proteins from another species are better, and noisy mRNA-seq
> will be the primary source of annotation error.
>
> Thanks,
> Carson
>
>
> From: dhivya arasappan <darasappan at gmail.com>
> Date: Wednesday, February 5, 2014 at 1:13 PM
> To: Daniel Ence <dence at genetics.utah.edu>
> Cc: Carson Holt <carsonhh at gmail.com>, "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org
> >
> Subject: Re: [maker-devel] maker annotation with cufflinks output
>
> Hello Daniel and Carson,
>
> Thanks for your replies.
>
> Yes I used the the protein sequences resulting from annotation of
> trinity assembly (using trinotate). I'll try using protein
> sequences from related species (though there arent sequences from
> closely related orgs). Could you tell me a little about why protein
> data from annotating my rnaseq data would not work best here?
>
> Thanks
> Dhivya
>
> On Feb 5, 2014, at 1:28 PM, Daniel Ence wrote:
>
>> Hi Dhivya, Are the protein matches in your results coming from your
>> annotations of the transcriptome? You should really use amino-acid
>> sequences from related organisms and some kind of omnibus source
>> like SwissProt.
>>
>> Thanks,
>> Daniel
>>
>> Daniel Ence
>> Graduate Student
>> Eccles Institute of Human Genetics
>> University of Utah
>> 15 North 2030 East, Room 2100
>> Salt Lake City, UT 84112-5330
>> From: Carson Holt [carsonhh at gmail.com]
>> Sent: Wednesday, February 05, 2014 11:38 AM
>> To: dhivya arasappan; Daniel Ence
>> Cc: maker-devel at yandell-lab.org
>> Subject: Re: [maker-devel] maker annotation with cufflinks output
>>
>> Do you have any features of type snap in your results from step 3?
>> We’ve had a couple of recent posts where after training snap was
>> giving no results, and as a result maker couldn’t give any genes.
>> One cause of something like that may be your step 2. Make sure the
>> ZFF wasn’t empty you used to train with. The maker2zff script uses
>> filters to only put the best genes in the off file, and if all your
>> genes fail the filtering then you are training with an empty ZFF.
>>
>> Also you should use proteins from a related species as your protein
>> file. I see that you protein marches are varying wildly from run
>> to run? So is your contig count? Were the subset of contigs you
>> have results for long enough to contain genes?
>>
>> —Carson
>>
>> From: dhivya arasappan <darasappan at gmail.com>
>> Date: Monday, February 3, 2014 at 9:31 AM
>> To: Daniel Ence <dence at genetics.utah.edu>
>> Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
>> Subject: Re: [maker-devel] maker annotation with cufflinks output
>>
>> Hi Daniel,
>>
>> I was able to check on some of those questions.
>>
>> 1. From trinity assembly: I started with 102000 contigs. I used
>> trinotate to annotate proteins in this.
>>
>> I ran maker on this data with est2genome set to 1. The output looks
>> like this (most important parts on top):
>>
>> 6653 gene
>> 46675 exon
>> 280534 protein_match
>> 59934 CDS
>> 969 contig
>> 105388 expressed_sequence_match
>> 12584 five_prime_UTR
>> 78565 match
>> 1401369 match_part
>> 10180 mRNA
>> 11545 three_prime_UTR
>>
>> 2. From cufflinks assembly: I started with 133380 entries (out of
>> which there are 29,000 transcripts). I used the protein sequences
>> from trinity assembly.
>>
>> I ran maker on this data with est2genome set to 1. The output looks
>> like this:
>> 29 gene
>> 75 exon
>> 573659 protein_match
>> 67 CDS
>> 1099 contig
>> 269298 expressed_sequence_match
>> 23 five_prime_UTR
>> 173844 match
>> 2221846 match_part
>> 29 mRNA
>> 23 three_prime_UTR
>>
>> The genes annotated using the trinity assembly is lower than
>> expected, so I went the cufflinks route. I dont understand why when
>> using the cufflinks transcripts, even less genes are being found.
>>
>> 3. Training SNAP: I used the results of maker from 1 to train
>> SNAP. I then used that training set to rerun maker:
>> snaphmm=/scratch/01184/daras/jansen/RHA/allpaths/
>> maker_mpi_withAlltrinity/snap/RHA.hmm
>> est2genome=0
>>
>> And again I got results with no entries for gene, exon, CDS etc.
>> 957 contig
>> 46555 expressed_sequence_match
>> 43651 match
>> 553633 match_part
>> 113738 protein_match
>>
>> As I mentioned in another email, cegma results indicated that the
>> genome was more than 90% complete. Any suggestions would be helpful.
>>
>> Thank you
>> Dhivya
>>
>>
>>
>>
>> On Jan 30, 2014, at 2:51 PM, Daniel Ence wrote:
>>
>>> Hi Dhivya,
>>>
>>> I think there a few numbers that could be helpful to understand
>>> what's happening here.
>>>
>>> How many transcripts did Trinity assembly the RNA-seq data into?
>>> Also, you had 29,000 transcripts from cufflinks, but fewer from
>>> MAKER when you gave it the cufflinks data. How many transcripts
>>> did MAKER identify with the cufflinks data? Did you still get more
>>> than the 10,000 transcripts that you found with just the Trinity
>>> data?
>>>
>>> A key part of MAKER's approach to genome annotation that might be
>>> affecting it's performance is that it only annotates a gene where
>>> there is both evidence (like your RNA-seq data) and an ab-initio
>>> prediction. If a prediction is unsupported by the evidence, then
>>> MAKER won't annotate a gene and if evidence aligns where there's
>>> no prediction, MAKER won't annotate a gene either. What ab-initio
>>> predictors are you using and have they been trained specific genome?
>>>
>>> You can force MAKER to automatically promote evidence alignments
>>> to a gene model by setting the est2genome option to 1, but that
>>> will usually give you many false positives.
>>>
>>> Try rerunning it with either the Trinity data or the Cufflinks
>>> data and with est2genome set to 1, and let us know how that
>>> affects the MAKER results.
>>>
>>> Thanks,
>>> Daniel
>>>
>>> Daniel Ence
>>> Graduate Student
>>> Eccles Institute of Human Genetics
>>> University of Utah
>>> 15 North 2030 East, Room 2100
>>> Salt Lake City, UT 84112-5330
>>> ________________________________________
>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on behalf
>>> of dhivya arasappan [darasappan at gmail.com]
>>> Sent: Thursday, January 30, 2014 11:18 AM
>>> To: maker-devel at yandell-lab.org
>>> Subject: [maker-devel] maker annotation with cufflinks output
>>>
>>> Hello,
>>>
>>> I am trying to annotate a 200 mb plant genome for which I have a
>>> very
>>> good assembly.
>>>
>>> I tried to denovo assemble RNA-seq data using trinity and ran maker
>>> using my genome assembly and the trinity results. I did not get as
>>> many transcripts as expected, around 10,000 transcripts.
>>>
>>> So, I decided to try a different approach. I did a genome assisted
>>> assembly of the RNA-seq data using tophat/cufflinks. This pipeline
>>> generated 21,000 genes, 29,000 transcripts. I then ran maker
>>> using my
>>> genome assembly and the cufflinks result. I get much less number of
>>> transcripts as a result.
>>>
>>> If cufflinks found 29000 transcripts by mapping to the genome, I'm
>>> confused as to why maker is not finding the same.
>>>
>>> Any suggestions would be appreciated.
>>>
>>> Thanks
>>> Dhivya
>>>
>>>
>>> _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>> _______________________________________________ maker-devel mailing
>> list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140205/02e0218f/attachment-0003.html>
More information about the maker-devel
mailing list