[maker-devel] maker annotation with cufflinks output

Tue Feb 11 11:48:23 MST 2014

With your suggested changes (using a protein file not derived from the  
RNA-seq data and fixing the gff file for training SNAP), I was able to  
increase the number of genes from 6000+ to 18116.

I'm now trying to evaluate the quality of the annotation.  I have a  
question about the usage for mpi_evaluator.

In the maker tutorial,  the usage is given as:

  mpi_evaluator [options] <eval_opts> <eval_bopts> <eval_exe>
What files are being referred to in the input parameters: eval_opts,  
eval_bopts and eval_exe?

Thanks
Dhivya

On Feb 6, 2014, at 11:47 AM, Carson Holt wrote:

> Ok.  Content looks good.  Just make sure to use gff3_merge to join  
> the GFF3’s without stripping out the fasta sequence at the end when  
> training SNAP.
>
> Thanks,
> Carson
>
>
> From: dhivya arasappan <darasappan at gmail.com>
> Date: Thursday, February 6, 2014 at 10:29 AM
> To: Carson Holt <carsonhh at gmail.com>
> Cc: Daniel Ence <dence at genetics.utah.edu>
> Subject: Re: [maker-devel] maker annotation with cufflinks output
>
> Sorry I was just trying to make it small enough to be approved by  
> the mailing list.
>
> Here is the whole file:
>
>
>  cat.formatted.gff.tgz
>
>
>
> On Thu, Feb 6, 2014 at 11:04 AM, Carson Holt <carsonhh at gmail.com>  
> wrote:
>> Could you give me the file without using 'head’ to trim it, its  
>> cutting it before it reaches the part I’m interested in.
>>
>> —Carson
>>
>>
>> From: dhivya arasappan <darasappan at gmail.com>
>> Date: Thursday, February 6, 2014 at 10:01 AM
>>
>> To: Carson Holt <carsonhh at gmail.com>
>> Cc: Daniel Ence <dence at genetics.utah.edu>, "maker-devel at yandell-lab.org 
>> " <maker-devel at yandell-lab.org>
>> Subject: Re: [maker-devel] maker annotation with cufflinks output
>>
>> Oh yes I did- I took just the non sequence entries in the gff file  
>> and used that as my input.  I will rerun snap with the gff file  
>> containing the sequences as well.
>>
>> I'm attaching a snippet of the gff file that I used as input to  
>> maker2zff.
>>
>> Thanks for your help
>> Dhivya
>>
>>
>>
>>
>> On Feb 6, 2014, at 10:05 AM, Carson Holt wrote:
>>
>>> Your genome.dna file has no sequence?  Did you by any chance strip  
>>> the fasta sequence from the GFF3 you are using as input to  
>>> maker2zff?  There should be fasta sequence at the end of that  
>>> file.  Also can I see the GFF3 file you are using as input to  
>>> maker2zff.
>>>
>>> Thanks,
>>> Carson
>>>
>>> From: dhivya arasappan <darasappan at gmail.com>
>>> Date: Thursday, February 6, 2014 at 7:47 AM
>>> To: Carson Holt <carsonhh at gmail.com>
>>> Cc: Daniel Ence <dence at genetics.utah.edu>, "maker-devel at yandell-lab.org 
>>> " <maker-devel at yandell-lab.org>
>>> Subject: Re: [maker-devel] maker annotation with cufflinks output
>>>
>>> Hello,
>>>
>>> I does appear than my genome.ann file from maker2zff script has  
>>> data in it. However, the SNAP steps after that have created empty  
>>> files.  The following are all empty:
>>>
>>> alt.dna  err.dna  export.dna  genome.dna  olp.dna  uni.dna  wrn.dna
>>> alt.ann  err.ann  export.ann  genome.ann  olp.ann  uni.ann  wrn.ann
>>>
>>> When I tried to get gene stats or validate genome.ann, I get  
>>> errors like this for all of them:
>>>
>>> fathom genome.ann genome.dna -gene-stats |more
>>> MODEL5547 1 1 6 + errors(6): exon-1:out_of_bounds  
>>> exon-2:out_of_bounds exon-3:out_of_bounds exon-4:out_of_bounds  
>>> exon-5:out_of_bounds exon-6:out_of_bounds
>>> MODEL5568 1 1 6 - errors(6): exon-6:out_of_bounds  
>>> exon-5:out_of_bounds exon-4:out_of_bounds exon-3:out_of_bounds  
>>> exon-2:out_of_bounds exon-1:out_of_bounds
>>> MODEL5589 1 1 5 + errors(5): exon-1:out_of_bounds  
>>> exon-2:out_of_bounds exon-3:out_of_bounds exon-4:out_of_bounds  
>>> exon-5:out_of_bounds
>>> MODEL5195 1 1 21 + errors(21): exon-1:out_of_bounds  
>>> exon-2:out_of_bounds exon-3:out_of_bounds exon-4:out_of_bounds  
>>> exon-5:out_of_bounds exon-6:out_of_bounds exon-7:out_of_bounds  
>>> exon-8:out_of_bounds exon-9:out_of_bounds exon-10:out_of_bounds  
>>> exon-11:out_of_bounds exon-12:out_of_bounds exon-13:out_of_bounds  
>>> exon-14:out_of_bounds exon-15:out_of_bounds exon-16:out_of_bounds  
>>> exon-17:out_of_bounds exon-18:out_of_bounds exon-19:out_of_bounds  
>>> exon-20:out_of_bounds exon-21:out_of_bounds
>>>
>>> I'm not sure why the annotation I'm seeing in genome.ann are all  
>>> showing up as errors. I realize this may be an issue with snap,  
>>> but are you familiar with anything like this? My genome.ann file  
>>> is attached for reference.
>>>
>>> Thanks
>>> Dhivya
>>>
>>> On Feb 5, 2014, at 12:38 PM, Carson Holt wrote:
>>>
>>>> Do you have any features of type snap in your results from step  
>>>> 3?  We’ve had a couple of recent posts where after training snap  
>>>> was giving no results, and as a result maker couldn’t give any  
>>>> genes.  One cause of something like that may be your step 2.   
>>>> Make sure the ZFF wasn’t empty you used to train with.  The  
>>>> maker2zff script uses filters to only put the best genes in the  
>>>> off file, and if all your genes fail the filtering then you are  
>>>> training with an empty ZFF.
>>>>
>>>> Also you should use proteins from a related species as your  
>>>> protein file.  I see that you protein marches are varying wildly  
>>>> from run to run? So is your contig count?  Were the subset of  
>>>> contigs you have results for long enough to contain genes?
>>>>
>>>> —Carson
>>>>
>>>> From: dhivya arasappan <darasappan at gmail.com>
>>>> Date: Monday, February 3, 2014 at 9:31 AM
>>>> To: Daniel Ence <dence at genetics.utah.edu>
>>>> Cc: "maker-devel at yandell-lab.org" <maker-devel at yandell-lab.org>
>>>> Subject: Re: [maker-devel] maker annotation with cufflinks output
>>>>
>>>> Hi Daniel,
>>>>
>>>> I was able to check on some of those questions.
>>>>
>>>> 1. From trinity assembly: I started with 102000 contigs. I used  
>>>> trinotate to annotate proteins in this.
>>>>
>>>> I ran maker on this data with est2genome set to 1. The output  
>>>> looks like this (most important parts on top):
>>>>
>>>>     6653 gene
>>>>    46675 exon
>>>>  280534 protein_match
>>>> 59934 CDS
>>>>     969 contig
>>>>  105388 expressed_sequence_match
>>>>   12584 five_prime_UTR
>>>>   78565 match
>>>> 1401369 match_part
>>>>   10180 mRNA
>>>>   11545 three_prime_UTR
>>>>
>>>> 2. From cufflinks assembly: I started with 133380 entries (out of  
>>>> which there are 29,000 transcripts).  I used the protein  
>>>> sequences from trinity assembly.
>>>>
>>>> I ran maker on this data with est2genome set to 1. The output  
>>>> looks like this:
>>>>      29 gene
>>>>      75 exon
>>>>  573659 protein_match
>>>> 67 CDS
>>>>    1099 contig
>>>>  269298 expressed_sequence_match
>>>>      23 five_prime_UTR
>>>>  173844 match
>>>> 2221846 match_part
>>>>      29 mRNA
>>>>      23 three_prime_UTR
>>>>
>>>> The genes annotated using the trinity assembly is lower than  
>>>> expected, so I went the cufflinks route. I dont understand why  
>>>> when using the cufflinks transcripts, even less genes are being  
>>>> found.
>>>>
>>>> 3. Training SNAP:  I used the results of maker from 1 to train  
>>>> SNAP.  I then used that training set to rerun maker:
>>>> snaphmm=/scratch/01184/daras/jansen/RHA/allpaths/ 
>>>> maker_mpi_withAlltrinity/snap/RHA.hmm
>>>> est2genome=0
>>>>
>>>> And again I got results with no entries for gene, exon, CDS etc.
>>>> 957 contig
>>>>   46555 expressed_sequence_match
>>>>   43651 match
>>>>  553633 match_part
>>>>  113738 protein_match
>>>>
>>>> As I mentioned in another email, cegma results indicated that the  
>>>> genome was more than 90% complete. Any suggestions would be  
>>>> helpful.
>>>>
>>>> Thank you
>>>> Dhivya
>>>>
>>>>
>>>>
>>>>
>>>> On Jan 30, 2014, at 2:51 PM, Daniel Ence wrote:
>>>>
>>>>> Hi Dhivya,
>>>>>
>>>>> I think there a few numbers that could be helpful to understand  
>>>>> what's happening here.
>>>>>
>>>>> How many transcripts did Trinity assembly the RNA-seq data into?  
>>>>> Also, you had 29,000 transcripts from cufflinks, but fewer from  
>>>>> MAKER when you gave it the cufflinks data. How many transcripts  
>>>>> did MAKER identify with the cufflinks data? Did you still get  
>>>>> more than the 10,000 transcripts that you found with just the  
>>>>> Trinity data?
>>>>>
>>>>> A key part of MAKER's approach to genome annotation that might  
>>>>> be affecting it's performance is that it only annotates a gene  
>>>>> where there is both evidence (like your RNA-seq data) and an ab- 
>>>>> initio prediction. If a prediction is unsupported by the  
>>>>> evidence, then MAKER won't annotate a gene and if evidence  
>>>>> aligns where there's no prediction, MAKER won't annotate a gene  
>>>>> either. What ab-initio predictors are you using and have they  
>>>>> been trained specific genome?
>>>>>
>>>>> You can force MAKER to automatically promote evidence alignments  
>>>>> to a gene model by setting the est2genome option to 1, but that  
>>>>> will usually give you many false positives.
>>>>>
>>>>> Try rerunning it with either the Trinity data or the Cufflinks  
>>>>> data and with est2genome set to 1, and let us know how that  
>>>>> affects the MAKER results.
>>>>>
>>>>> Thanks,
>>>>> Daniel
>>>>>
>>>>> Daniel Ence
>>>>> Graduate Student
>>>>> Eccles Institute of Human Genetics
>>>>> University of Utah
>>>>> 15 North 2030 East, Room 2100
>>>>> Salt Lake City, UT 84112-5330
>>>>> ________________________________________
>>>>> From: maker-devel [maker-devel-bounces at yandell-lab.org] on  
>>>>> behalf of dhivya arasappan [darasappan at gmail.com]
>>>>> Sent: Thursday, January 30, 2014 11:18 AM
>>>>> To: maker-devel at yandell-lab.org
>>>>> Subject: [maker-devel] maker annotation with cufflinks output
>>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to annotate a 200 mb plant genome for which I have a  
>>>>> very
>>>>> good assembly.
>>>>>
>>>>> I tried to denovo assemble RNA-seq data using trinity and ran  
>>>>> maker
>>>>> using my genome assembly and the trinity results.  I did not get  
>>>>> as
>>>>> many transcripts as expected, around 10,000 transcripts.
>>>>>
>>>>> So, I decided to try a different approach.  I did a genome  
>>>>> assisted
>>>>> assembly of the RNA-seq data using tophat/cufflinks. This pipeline
>>>>> generated 21,000 genes, 29,000 transcripts.  I then ran maker  
>>>>> using my
>>>>> genome assembly and the cufflinks result.  I get much less  
>>>>> number of
>>>>> transcripts as a result.
>>>>>
>>>>> If cufflinks found 29000 transcripts by mapping to the genome, I'm
>>>>> confused as to why maker is not finding the same.
>>>>>
>>>>> Any suggestions would be appreciated.
>>>>>
>>>>> Thanks
>>>>> Dhivya
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> maker-devel mailing list
>>>>> maker-devel at box290.bluehost.com
>>>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>>
>>>> _______________________________________________ maker-devel  
>>>> mailing list maker-devel at box290.bluehost.com http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140211/bf1fae70/attachment-0003.html>