[maker-devel] Filtering of ab initio gene models

Sat Jun 7 14:03:18 MDT 2014

The problem in the example you sent is the geneseqer entries in the GFF3 you
are passing in.  It is causing merge of gene clusters.  The result is that
UTR is being over extended and is overlapping on the models (and probably
some models get merged).  As you noticed you can't have overlapping models
on the same strand. If you set score_preds=1 in the maker_opts.ctl file it
will give you AED scores for the rejected ab initio models.  You will notice
that none of them score better than 0.23.

One thing you can do is set correct_est_fusion=1.  This tries to correct for
erroneous EST/transcript evidence that leads to over extend UTR and false
gene merging.  You will see in the attached image that is trims back the
overlapping 3' and 5' UTR for the overlapping gene models, given that MAKER
believes the evidence leading to the overlap is likely low confidence and is
a false merge of regions.  I think much of your geneseqer input is more of a
problem than a help for the annotation. Many seem to be spurious alignments.

--Carson

From:  Daniel Standage <daniel.standage at gmail.com>
Date:  Friday, June 6, 2014 at 5:58 PM
To:  Volker Brendel <vbrendel at indiana.edu>
Cc:  Carson Holt <carsonhh at gmail.com>, Maker Mailing List
<maker-devel at yandell-lab.org>
Subject:  Re: [maker-devel] Filtering of ab initio gene models

In the example sent previously, transcript TSA024184 overlaps with the 3'
end of our gene model's CDS by 3 nucleotides. If I manually change the
transcript's end coordinate (6400 to 6100) so that there are two separate
non-overlapping evidence clusters, two models are reported as expected. But
I can even get both models reported with a much smaller change (6400 to
6395), where the UTRs still overlap but the CDS does not overlap with the
UTR. The 5' end of our gene model's CDS also overlaps with another
transcript. Maker has no problem reporting both of these gene models though,
probably since they're on different strands?

So correct me if I'm wrong, but it appears that Maker will report
overlapping gene models if they are on opposite strands or if no CDS is
involved in the overlap. Is there any way this behavior can be configured?

On another note, we're considering your suggestion to integrate EVM with
Maker. One possibility discussed is to run Maker 4 separate times (once for
each of Augustus, GeneMark, SNAP, and our model_gff models), each time with
all our transcript/protein evidence, prior to consensus modeling with EVM.
Would that provide any benefit over running Maker a single time with all
prediction sources simultaneously?

Thanks,
Daniel

--
Daniel S. Standage
Ph.D. Candidate
Computational Genome Science Laboratory
Indiana University

On Fri, Jun 6, 2014 at 5:52 PM, Volker Brendel <vbrendel at indiana.edu> wrote:
>     
>  Hi Carson,
>  is there a way of allowing MAKER to add UTRs to our external models (supplied
> by the pred_gff or model_gff tag)?  This seems to be one problem we are
> running into.  Our external models are high quality, but CDS only.  Thus their
> score gets knocked down relative to ab initio predictions with added UTRs.
>  
>  Daniel will have more questions/observations later with regard to overlapping
> gene models (we definitely need to allow gene models to overlap in the UTRs,
> because transcript evidence clearly shows such negative intergenic spaces).
>  
>  Thanks for all your help!
>  Volker
> 
>  
>  
> On 6/6/2014 11:39 AM, Carson Holt wrote:
>  
>  
>>   
>> snap_masked-$seqid-processed-gene was produced by SNAP on the repeat masked
>> sequence without hints (i.e. the ab initio call).
>>  
>> maker-$seqid-snap-gene was produced by SNAP after receiving hints from MAKER.
>>  
>> 
>>  
>>  
>> In both cases MAKER is allowed to add UTR to the model (hence the 'processed'
>> tag).
>>  
>> 
>>  
>>  
>> --Carson
>>  
>> 
>>  
>>  
>> 
>>  
>>   
>> From:  Daniel Standage <daniel.standage at gmail.com>
>>  Date:  Friday, June 6, 2014 at 10:33 AM
>>  To:  Carson Holt <carsonhh at gmail.com>
>>  Cc:  Maker Mailing List <maker-devel at yandell-lab.org>, Volker Brendel
>> <vbrendel at indiana.edu>
>>  Subject:  Re: [maker-devel] Filtering of ab initio gene models
>>  
>>  
>> 
>>  
>>  
>>  
>>  
>> Another question: is there documentation anywhere for the naming conventions
>> of the genes annotated by Maker? Of course it's easy to spot genes based on a
>> particular ab initio gene predictor, as the names are in the IDs. But what is
>> the significance of, say, "snap_masked-$seqid-processed-gene" in a gene ID vs
>> "maker-$seqid-snap-gene"?
>>  
>>  
>>  Thanks,
>>  
>>  Daniel
>>  
>>  
>> 
>>  
>>  
>> 
>>  --
>>  Daniel S. Standage
>>  Ph.D. Candidate
>>  Computational Genome Science Laboratory
>>  Indiana University
>>  
>>  
>>  
>>  
>>  
>> On Thu, Jun 5, 2014 at 2:05 PM, Daniel Standage <daniel.standage at gmail.com>
>> wrote:
>>  
>>>  
>>>  
>>>  
>>> I have attached data for a small 18kb region with a handful of genes, as
>>> well as the corresponding maker_opts.ctl file. (This is a smaller and
>>> different data set than what I was looking at yesterday, with a more
>>> well-defined problem).
>>>  
>>>  With the data files as is, Maker 2.31.3 reports a model from 4125 to 6400
>>> with an AED of 0.23. If you exclude transcript TSA024184, Maker reports a
>>> different gene from 6111 to 8345 with an AED of 0.01. Both of these genes
>>> have transcript support: will Maker report overlapping genes under any
>>> conditions? And even if Maker is forced to choose only a single gene to
>>> report, why would the model from 4125 to 6400 ever be reported in place of
>>> the one from 6111 to 8345, especially since this is provided in the
>>> model_gff file?
>>>  
>>>  
>>>  Even when transcript TSA024184 is included, Maker 2.10 reports the
>>> high-confidence gene from 611 to 8345.
>>>  
>>>  
>>>  Any light you could shed would be helpful. Thanks!
>>>  
>>>  
>>>  
>>> 
>>>  
>>>  
>>> 
>>>  --
>>>  Daniel S. Standage
>>>  Ph.D. Candidate
>>>  Computational Genome Science Laboratory
>>>  Indiana University
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>> On Wed, Jun 4, 2014 at 3:17 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>  
>>>>  
>>>>  
>>>> Just eAED, but eAED can affects selection of ab initio results.  For
>>>> example reading frame match of protein evidence which also affects whether
>>>> evidence from single_exon=1 and genes with single_exon protein evidence get
>>>> kept.  There is also the assumption that your alignments in GFF3 are are
>>>> correctly spliced (like BLAT does).  So giving blastn results as
>>>> precomputed est_gff would create a lot of noise, since maker ignores blastn
>>>> and is using it only to seed the polished exonerate alignments.
>>>>  
>>>> 
>>>>  
>>>>  
>>>> --Carson
>>>>  
>>>> 
>>>>  
>>>>  
>>>> 
>>>>  
>>>>   
>>>> From:  Daniel Standage <daniel.standage at gmail.com>
>>>>  Date:  Wednesday, June 4, 2014 at 1:11 PM
>>>>  To:  Carson Holt <carsonhh at gmail.com>
>>>>  Cc:  Maker Mailing List <maker-devel at yandell-lab.org>
>>>>  Subject:  Re: [maker-devel] Filtering of ab initio gene models
>>>>  
>>>>  
>>>>  
>>>>  
>>>> 
>>>>  
>>>>  
>>>> I do not provide Gap or Target attributes in the GFF3. Will this affect the
>>>> AED as well, or just the eAED?
>>>>  
>>>>  
>>>> 
>>>>  
>>>>  
>>>> 
>>>>  --
>>>>  Daniel S. Standage
>>>>  Ph.D. Candidate
>>>>  Computational Genome Science Laboratory
>>>>  Indiana University
>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>> On Wed, Jun 4, 2014 at 3:09 PM, Carson Holt <carsonhh at gmail.com> wrote:
>>>>  
>>>>>  
>>>>>  
>>>>> Sure.  that would be helpful.  One question. Do you provide the Gap
>>>>> attribute in your precomputed alignments?  Having or not having that
>>>>> attribute affects the eAED score which takes reading frame into account,
>>>>> and may cause some things to be kept that normally would be dropped,
>>>>> because MAKER won't be able to take the points of mismatch of the
>>>>> alignment into account (it just assumes match everywhere).
>>>>>  
>>>>> 
>>>>>  
>>>>>  
>>>>> --Carson
>>>>>  
>>>>> 
>>>>>  
>>>>>  
>>>>> 
>>>>>  
>>>>>   
>>>>> From:  Daniel Standage <daniel.standage at gmail.com>
>>>>>  Date:  Wednesday, June 4, 2014 at 1:03 PM
>>>>>  To:  Maker Mailing List <maker-devel at yandell-lab.org>
>>>>>  Subject:  [maker-devel] Filtering of ab initio gene models
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> 
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> Thanks everyone for your responses recently!
>>>>>  
>>>>>  
>>>>>  The reason for my recent flurry of email activity is that I'm seeing some
>>>>> unexpected trends when running the new version of Maker with precomputed
>>>>> alignments. Compared with an annotation I did a while ago (Maker 2.10,
>>>>> Maker-computed alignments), this new annotation has a substantial number
>>>>> of new genes annotated. If I compare distributions of AED scores between
>>>>> the old and new annotation, it's clear that the new annotation has a lot
>>>>> more low-quality models. If I look at new gene models that do not overlap
>>>>> with any gene model from the old annotation, the likelihood that it's a
>>>>> low-quality model is much higher.
>>>>>  
>>>>>  
>>>>>  I decided to run a little experiment. I annotated a scaffold first using
>>>>> Maker 2.10 and then using Maker 2.31.3. I both cases, I used the same
>>>>> pre-computed transcript and protein alignments and the same (latest)
>>>>> version of SNAP as the only ab initio predictor. Maker 2.10 predicted 44
>>>>> genes while Maker 2.31.3 predicted 63. If we group gene models into loci
>>>>> by overlap, there are 33 loci with gene models from both 2.10 and 2.31.3,
>>>>> 1 locus with only models from 2.10, and 28 loci with only models from
>>>>> 2.31.3.
>>>>>  
>>>>>  
>>>>>  Before this experiment, I assumed the issue was related to providing
>>>>> pre-computed alignments in GFF3 format and perhaps violating some
>>>>> important assumption. However, this experiment makes me wonder whether
>>>>> there have been changes to how Maker filters ab initio gene models between
>>>>> version 2.10 and version 2.31.3? Do you have any ideas? If it would help,
>>>>> I could put together a small data set that reproduces the behavior I just
>>>>> described.
>>>>>  
>>>>>  Thanks!
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> 
>>>>>  --
>>>>>  Daniel S. Standage
>>>>>  Ph.D. Candidate
>>>>>  Computational Genome Science Laboratory
>>>>>  Indiana University
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  _______________________________________________ maker-devel mailing list
>>>>> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo
>>>>> /maker-devel_yandell-lab.org
>>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>  
>>  
>>  
>>   
>  
>  
> -- 
> Volker Brendel
> Professor of Biology and Computer Science
> Indiana University
> Department of Biology & School of Informatics and Computing
> Simon Hall 205C
> 212 South Hawthorne Drive
> Bloomington, IN 47405-7003
> 
> Tel.: (812) 855-7074 <tel:%28812%29%20855-7074> http://brendelgroup.org/
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140607/f018927a/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6246A771-E6A9-4875-9362-DC8A7A5BC9C4.png
Type: image/png
Size: 48365 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140607/f018927a/attachment-0003.png>