[maker-devel] Unexpected results with correct_est_fusion

Carson Holt carsonhh at gmail.com
Tue Sep 17 21:57:12 MDT 2013


It does sound like this is likely the result of gene fusion from the trinity
assemblies.  One thing to look at is the number of coding exons compared to
the other ant species.  See if the increase in exons is mostly in UTR,
coding sequence, or both.

One thing you could try is running MAKER without the EST evidence, just to
see how many genes you get with protein only support.  There are ways to use
multiple MAKER runs to tease out details of the data.

For example:
run1: protein evidence only plus ab initio predators like snap and augustus.
run2: protein and EST evidence.  Models from run1 passed in as pred_gff with
snap and augustus turned off (this will force the addition of UTR, but not
the generation of new models).  Use the correct_est_fusion=1 option here to
clip UTR that runs into neighboring genes.
run3: protein and EST evidence plus augusuts and snap.

Then take models fromrun2 and models from run3 that do not overlap run2 and
add them all to your final set along with any models that come from
interproscan domain analysis of rejected models.

This solution is rather lengthy, but may avoid many of the problems you seem
to be getting with gene merging even with jaccard_clip and
correct_est_fusion turned on, because your ESTs would only contribute to the
UTR and to models not found based solely on protein evidence (I.e. They
would be ignored in cases where you get enough evidence from other sources).

--Carson



From:  Benjamin Rubin <brubin at fieldmuseum.org>
Date:  Tuesday, September 17, 2013 10:08 AM
To:  Carson Holt <carsonhh at gmail.com>
Subject:  Re: [maker-devel] Unexpected results with correct_est_fusion

Hi Carson,

The new version is working great. Thanks for your help.

I do have another more general question. I am working on annotating a new
ant genome (Pseudomyrmex gracilis) and the results that I am getting from
MAKER are a bit unexpected. The number of genes produced by MAKER is ~14,300
while, as you may know, the seven published ant genomes have at least 16,000
genes (this number was improved by several hundred by turning on
correct_est_fusion). Running the ab initio predictions through InterProScan
yields ~900 additional genes for P. gracilis so there are still
substantially fewer genes found for this species. This difference on its own
is not that unexpected; Pseudomyrmex likely diverged from the other
sequenced ants by over 100 million years and the genome sequence itself is
rather fragmented and incomplete. However, what is bothering me is that,
despite having fewer genes, I am seeing substantially larger numbers of
exons (~92,000 as opposed to 78-85,000) and the total length of all proteins
is more than a million amino acids longer in P. gracilis. It does not have
unexpectedly long genes, the average gene length is just a bit higher. I
have looked at the annotations of some conserved genes and found some
apparently spurious exons merged with these genes. I say that they are
spurious because they go beyond the end of the gene sequence in other
species (ants and Drosophila). Unfortunately, it appears that many of these
spurious calls are primarily the result of blast hits to my EST data. The
ESTs generally seem to blast to the genome a bit more often than expected.

Partly as a result of the relatively high repeat content of my genome (~50%
complex repeats) and partly because we only used two Illumina libraries, my
genome sequence is quite fragmented (~280Mb in ~6,500 scaffolds). Note that
the total genome length is estimated at 387Mb, so I am missing a fair amount
but almost all CEGMA genes are present in the assembly so I have concluded
that the missing sequence is predominantly repeats. I have no prior reason
to expect that my EST library has anything wrong with it. I did a single
Illumina lane of RNA-seq and assembled in Trinity with the jaccard_clip
option on to reduce gene fusions.

If you have any advice on how my gene predictions can be improved, I would
really appreciate it. Have you heard of this kind of problem before? Is
there a way to limit the influence of ESTs without discarding them entirely?

Thanks so much for your help with the fusion bug and for any advice here.
Ben


On Wed, Sep 11, 2013 at 9:27 AM, Benjamin Rubin <brubin at fieldmuseum.org>
wrote:
> Hi Carson,
> 
> OK, I will try it and let you know how it goes. And thanks for the suggestion
> about using always_complete as well.
> 
> Thanks!
> Ben
> 
> 
> On Tue, Sep 10, 2013 at 9:45 PM, Carson Holt <carsonhh at gmail.com> wrote:
>> I think I have it fixed.  Sorry it took so long, but my original fix actually
>> created other odd behaviors so I had to track those down as well.
>> 
>> You can download the test version with the fix by typing this on the command
>> line -->
>> 
>> svn co *********
>> 
>> user: *****
>> password: *****
>> 
>> Test it out and let me know.  On the contig you sent me, I also set
>> always_complete=1 as some of the hint based models were lacking start or stop
>> codons.  The results looked slightly better that way as well.
>> 
>> Thanks,
>> Carson
>> 
>> 
>> 
>> From:  Benjamin Rubin <brubin at fieldmuseum.org>
>> Date:  Wednesday, September 4, 2013 10:07 AM
>> To:  Carson Holt <carsonhh at gmail.com>
>> 
>> Subject:  Re: [maker-devel] Unexpected results with correct_est_fusion
>> 
>> OK, great. Thanks for letting me know.
>> 
>> Ben
>> 
>> 
>> On Wed, Sep 4, 2013 at 9:00 AM, Carson Holt <carsonhh at gmail.com> wrote:
>>> I thought I'd give you an update on this.  I've verified the bug and think
>>> I've identified roughly where it's happening.  I'll have a fix for you to
>>> test soon.
>>> 
>>> --Carson
>>> 
>>> 
>>> From:  Benjamin Rubin <brubin at fieldmuseum.org>
>>> 
>>> Date:  Wednesday, August 28, 2013 4:16 PM
>>> To:  Carson Holt <carsonhh at gmail.com>
>>> Subject:  Re: [maker-devel] Unexpected results with correct_est_fusion
>>> 
>>> Hi Carson,
>>> 
>>> OK, I think I uploaded all of the necessary files. I made a directory named
>>> "rubin_data" for everything. I included both the full genome file
>>> ("ec_patch...") as well as a file for scaffold_1. For this scaffold, I get
>>> 132 genes when correct_est_fusion is off and 35 when it is on. These results
>>> are after running maker a first time with correct_est_fusion on and
>>> retraining SNAP/Augustus on the results. The SNAP file is
>>> "gracilis_round_1.hmm" and I think the necessary Augustus files are in the
>>> "gracilis_jaccard_flank100_corrfusion_round_1_results" directory. I also
>>> included gff files for scaffold_1 with and without correct_est_fusion turned
>>> on.
>>> 
>>> Let me know if there is anything else that I failed to upload. I really
>>> appreciate your time. Thanks so much.
>>> 
>>> Ben
>>> 
>>> 
>>> On Wed, Aug 28, 2013 at 9:59 AM, Benjamin Rubin <brubin at fieldmuseum.org>
>>> wrote:
>>>> Hi Carson,
>>>> 
>>>> Yes, I would be happy to upload the necessary data. Just let me know the
>>>> connection information.
>>>> 
>>>> Thanks!
>>>> Ben
>>>> 
>>>> 
>>>> On Wed, Aug 28, 2013 at 8:09 AM, Carson Holt <carsonhh at gmail.com> wrote:
>>>>> Could you pick one contig where the number of genes shift dramatically and
>>>>> upload that contig fasta together with your control files and any evidence
>>>>> datasets used to one of our servers (I'm going to send you connection
>>>>> details in a separate e-mail).  I can then run with and without
>>>>> correct_est_fusion to see if there is anything unexpected going on.
>>>>> 
>>>>> --Carson
>>>>> 
>>>>> 
>>>>> 
>>>>> From:  Benjamin Rubin <brubin at fieldmuseum.org>
>>>>> Date:  Tuesday, August 27, 2013 10:59 AM
>>>>> To:  Carson Holt <carsonhh at gmail.com>
>>>>> Cc:  <maker-devel at yandell-lab.org>
>>>>> Subject:  Re: [maker-devel] Unexpected results with correct_est_fusion
>>>>> 
>>>>> Hi Carson,
>>>>> 
>>>>> I increased pred_flank to 200 and reran MAKER with correct_est_fusion, but
>>>>> I still only get ~5,000 genes (5,082 instead of the 5,020 with pred_flank
>>>>> at 100). This is using only the first round with SNAP and Augustus trained
>>>>> on the CEGMA genes. Is there anything else that I might be doing wrong? I
>>>>> have attached my control file in case that could be useful.
>>>>> 
>>>>> Thanks for the help!
>>>>> Ben
>>>>> 
>>>>> 
>>>>> On Mon, Aug 26, 2013 at 2:00 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>>>> The correct_est_fusion option just clips UTR on overlapping genes.   I
>>>>>> suspect the real problem is setting pred_flank too low.  If your lead in
>>>>>> sequence to a gene is too short, ab initio predictors won't call it.  So
>>>>>> you are probably getting empty reports from SNAP/Augustus for the hint
>>>>>> based predictions.  Try increasing pred_flank to at least 150.  Setting
>>>>>> pred_flank too low will also limit how far MAKER  will walk out along the
>>>>>> edges initial alignments during the polishing step (exonerate).  So
>>>>>> setting it too low may also be causing you to lose some EST and protein
>>>>>> alignments.
>>>>>> 
>>>>>> --Carson
>>>>>> 
>>>>>> 
>>>>>> From:  Benjamin Rubin <brubin at fieldmuseum.org>
>>>>>> Date:  Monday, August 26, 2013 2:20 PM
>>>>>> To:  <maker-devel at yandell-lab.org>
>>>>>> Subject:  [maker-devel] Unexpected results with correct_est_fusion
>>>>>> 
>>>>>> Hello developers,
>>>>>> 
>>>>>> I am using MAKER 2.28 to annotate an ant genome. I provide protein
>>>>>> sequence evidence from all seven of the other sequenced ant genomes and a
>>>>>> de novo assembled transcriptome as EST evidence. I assembled the
>>>>>> transcriptome using Trinity with the jaccard_clip option turned on to
>>>>>> reduce gene fusions. Despite using this set of hopefully non-fused ESTs,
>>>>>> I still have substantial fusion problems with the final annotation.
>>>>>> Therefore, I reduced pred_flank to 100 and turned on correct_est_fusion.
>>>>>> However, correct_est_fusion leads to the prediction of a much smaller
>>>>>> number of genes (~5,000 instead of ~14,000). I am initially training both
>>>>>> SNAP and Augustus using CEGMA genes and then retraining based on the
>>>>>> first round of annotation. Both rounds of annotation yield the same low
>>>>>> number (~5,000) of genes. It may also be worth mentioning that the number
>>>>>> of exons is also far lower when using correct_est_fusion (~26,000 instead
>>>>>> of ~90,000).
>>>>>> 
>>>>>> Is this the expected behavior of correct_est_fusion? I was surprised that
>>>>>> it reduced the predicted number of genes by such a large margin. I am
>>>>>> concerned that I am using it incorrectly. Do you have any other
>>>>>> suggestions for reducing gene merging?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> -- 
>>>>>> _____________________________________________________
>>>>>> Benjamin ER Rubin
>>>>>> PhD Candidate
>>>>>> Committee on Evolutionary Biology
>>>>>> University of Chicago
>>>>>> http://www.moreaulab.org/Benjamin_Rubin.html
>>>>>> 
>>>>>> Division of Insects
>>>>>> Zoology Department
>>>>>> Field Museum of Natural History
>>>>>> 1400 South Lake Shore Drive
>>>>>> Chicago, IL 60605
>>>>>> USA
>>>>>> Office: (312) 665-7776 <tel:%28312%29%20665-7776>
>>>>>> _______________________________________________ maker-devel mailing list
>>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinf
>>>>>> o/maker-devel_yandell-lab.org
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> _____________________________________________________
>>>>> Benjamin ER Rubin
>>>>> PhD Candidate
>>>>> Committee on Evolutionary Biology
>>>>> University of Chicago
>>>>> http://www.moreaulab.org/Benjamin_Rubin.html
>>>>> 
>>>>> Division of Insects
>>>>> Zoology Department
>>>>> Field Museum of Natural History
>>>>> 1400 South Lake Shore Drive
>>>>> Chicago, IL 60605
>>>>> USA
>>>>> Office: (312) 665-7776 <tel:%28312%29%20665-7776>
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> _____________________________________________________
>>>> Benjamin ER Rubin
>>>> PhD Candidate
>>>> Committee on Evolutionary Biology
>>>> University of Chicago
>>>> http://www.moreaulab.org/Benjamin_Rubin.html
>>>> 
>>>> Division of Insects
>>>> Zoology Department
>>>> Field Museum of Natural History
>>>> 1400 South Lake Shore Drive
>>>> Chicago, IL 60605
>>>> USA
>>>> Office: (312) 665-7776 <tel:%28312%29%20665-7776>
>>> 
>>> 
>>> 
>>> -- 
>>> _____________________________________________________
>>> Benjamin ER Rubin
>>> PhD Candidate
>>> Committee on Evolutionary Biology
>>> University of Chicago
>>> http://www.moreaulab.org/Benjamin_Rubin.html
>>> 
>>> Division of Insects
>>> Zoology Department
>>> Field Museum of Natural History
>>> 1400 South Lake Shore Drive
>>> Chicago, IL 60605
>>> USA
>>> Office: (312) 665-7776 <tel:%28312%29%20665-7776>
>> 
>> 
>> 
>> -- 
>> _____________________________________________________
>> Benjamin ER Rubin
>> PhD Candidate
>> Committee on Evolutionary Biology
>> University of Chicago
>> benrubin.org <http://benrubin.org>
>> 
>> Division of Insects
>> Zoology Department
>> Field Museum of Natural History
>> 1400 South Lake Shore Drive
>> Chicago, IL 60605
>> USA
>> Office: (312) 665-7776 <tel:%28312%29%20665-7776>
> 
> 
> 
> -- 
> _____________________________________________________
> Benjamin ER Rubin
> PhD Candidate
> Committee on Evolutionary Biology
> University of Chicago
> benrubin.org <http://benrubin.org>
> 
> Division of Insects
> Zoology Department
> Field Museum of Natural History
> 1400 South Lake Shore Drive
> Chicago, IL 60605
> USA
> Office: (312) 665-7776 <tel:%28312%29%20665-7776>



-- 
_____________________________________________________
Benjamin ER Rubin
PhD Candidate
Committee on Evolutionary Biology
University of Chicago
benrubin.org <http://benrubin.org>

Division of Insects
Zoology Department
Field Museum of Natural History
1400 South Lake Shore Drive
Chicago, IL 60605
USA
Office: (312) 665-7776


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20130917/a6a58a51/attachment-0002.html>


More information about the maker-devel mailing list