[maker-devel] genome duplication?

Mon Feb 2 14:49:02 MST 2015

Thanks Carson. Any suggestion on the size limit to separate the short
contigs, eg <500 bp
On 03/02/2015 3:36 AM, "Carson Holt" <carsonhh at gmail.com> wrote:

> MAKER requires every gene to have at least some evidence support.  This is
> very important for most most eukaryotes as false positive predictions will
> dominate what is called by snap/augustus.  However, it is not such a large
> problem in fungi because of their high gene density and less frequent
> introns.  Setting keep_preds=1 will maximize sensitivity at the cost of
> specificity (bad idea in most eukaryotes, but not so much in fungi).  I
> would not be surprised if a bias toward sensitivity is used by most fungi
> annotation projects with every gene that can be annotated being annotated
> (even if it does increase false positives).  It is a tactic that can work
> at least in fungi.
>
> Also if the assembly is fragmented, you will be less likely to have
> evidence support for all genes as the evidence alignments will not meet the
> % coverage thresholds in the maker_bopts.ctl file.  You may want to
> separate out your shorter contigs, and annotate them separately with more
> relaxed thresholds for pcov_blast=, pid_blast=, ep_score_limit=, and
> en_score_limit=.
>
> —Carson
>
>
> On Jan 31, 2015, at 4:21 PM, Jason Stajich <jason.stajich at gmail.com>
> wrote:
>
> Xabier -
>  FYI - though you probably already compared, those stats are on par with
> the Hortaea v1 assembly, (we do have an improved Hortaea assembly now and
> genome size is still same range supporting the duplication hypothesis)
>  Hw version 1 asmbly -
> N50 9623; Max 71563
> CEGMA for Hw1
>              #Prots  %Completeness  -  #Total  Average  %Ortho
>
>   Complete      196       79.03      -   498     2.54     81.12
>    Partial      228       91.94      -   673     2.95     95.18
>
>
> Mikael - yes - we should compare notes on the models JGI is calling which
> have little support in MAKER - I am not sure if their pipeline runs with
> augustus/snap using informant hints though usually they are bringing RNAseq
> into the mix - I don't know if your approach for reannotation assembled the
> RNAseq and used it as evidence?
>
> We'll be trying to assess some of this when comparisons of proportion of
> shared genes in the first 1KFG paper so we may be able to say with more
> certainty of these extra predictions whether they are shared more widely
> and get a handle on singleton/false positives rates.
>
> Jason
>
> Jason Stajich
> jason.stajich at gmail.com
>
> On Sat, Jan 31, 2015 at 12:51 AM, Xabier Vázquez Campos <
> xvazquezc at gmail.com> wrote:
>
>> Thanks Mikael,
>>
>> This are the assembly stats as taken from abyss-fac, indeed it isn't a
>> great N50, but it isn't that bad either
>>
>>    n       n:500   n:N50   min     N80      N50     N20      E-size
>> max     sum
>> 14277   7099    1185    500     4698    10771   20438   14530   154519
>> 42.68e6
>>
>>
>>
>> 2015-01-31 19:42 GMT+11:00 Mikael Brandström Durling <
>> mikael.durling at slu.se>:
>>
>>>  Hi Xabier,
>>>
>>>  31 jan 2015 kl. 05:48 skrev Xabier Vázquez Campos <xvazquezc at gmail.com
>>> >:
>>>
>>>  Hi all,
>>>
>>> One of the fungal genomes I'm annotating is relatively shattered (?),
>>> with many contigs/scaffolds and based on CEGMA analysis only may indicate a
>>> potential widespread duplication of the genome
>>>
>>>  #      Statistics of the completeness of the genome based on 248 CEGs
>>>>    #
>>>>               #Prots  %Completeness  -  #Total  Average  %Ortho
>>>>
>>>>   Complete      181       72.98      -   365     2.02     67.40
>>>>    Partial      230       92.74      -   528     2.30     77.83
>>>>
>>>
>>>
>>>  Judging from these figure, you seem to have a very fragmented
>>> assembly? What N50 have you reached? According to my experience, assemblies
>>> with an N50 below 5-10 times the average gene length tend to give problems
>>> in producing good gene sets. Not to say that the gene sets are unusable,
>>> but for comparing e.g. gene complements to other species, it will be hard
>>> to draw any conclusions when a high proportion of the genes are incomplete.
>>>
>>>  The expected genome size is relatively low (~42 Mb by abyss-fac) in
>>> comparison with *Hortaea werneckii* (51.6Mb, 23333 genes), a related
>>> fungi with nearly 90% of its genes present in at least two copies.
>>> Paper:
>>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328
>>>
>>>  Now to the Maker part... So, as part of the Maker annotation, I
>>> trained SNAP and Augustus, and I generated a specific RepeatModeler
>>> library. I recorded the predicted outputs from each Maker run (AED, number
>>> of predicted proteins and transcripts...). Both Augustus and SNAP used to
>>> give quite high number (~19000 and ~23000 respectively) in comparison with
>>> the xxx.all.maker.proteins.fasta (about 13600). So, my first question is,
>>> how does maker deal with gene duplications? Or is this just a phenomenon
>>> given that there is no support from the protein files provided initially to
>>> Maker? I've used 4 different protein files for the annotation, could it be
>>> that they weren't the best choices? I picked them from the closest
>>> relatives and similar environments
>>>
>>>
>>>  Unless you by mistake filter out duplicated gene families as repeats
>>> with repeat modeler, maker should not care about duplicated genes. However,
>>> maker, without keep_preds=1, reports only genes with some kind of support
>>> (be it EST or protein homology). This is rather conservative, but if you
>>> enable keep_preds, you will get more genes as you have noted. Just for the
>>> sake of comparison, I have reannotad more than ten genomes downloaded from
>>> JGI, providing MAKER with similar evidence as JGI, and consistently, MAKER
>>> is reporting fewer gene models. I have yet to do a more thorough comparison
>>> to tell what genes JGI are reporting that don’t appear in the MAKER
>>> annotations.
>>>
>>>
>>>  So, in my last run I turn the keep_preds=1 and the proteins in the
>>> xxx.all.maker.proteins.fasta reached to
>>>
>>>  Last question regarding the protein files. I download the annotated
>>> genomes from the JGI and most of them have two annotation folders
>>> "All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been
>>> using the protein files found in the later as I expected to have real
>>> evidence and a lower chance of being predicting false genes. Am I right?
>>>
>>>
>>>  Yes, I would say so. The FilteredModels have passed through their
>>> model selection pipeline, while all_models contains models from all
>>> predictors, as well as combinations of predictors and EST evidence.
>>>
>>>  Just some 2 cents of observations of mine,
>>> cheers,
>>> Mikael
>>>
>>>
>>>  Thank you in advance,
>>>
>>>  Xabier
>>>
>>>
>>> --
>>> Xabier Vázquez Campos
>>> PhD Candidate
>>> Water Research Centre
>>> School of Civil and Environmental Engineering
>>> The University of New South Wales
>>> Sydney NSW 2052 AUSTRALIA
>>>  _______________________________________________
>>> maker-devel mailing list
>>> maker-devel at box290.bluehost.com
>>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>>
>>>
>>>
>>
>>
>> --
>> Xabier Vázquez Campos
>> *PhD Candidate*
>> Water Research Centre
>> School of Civil and Environmental Engineering
>> The University of New South Wales
>> Sydney NSW 2052 AUSTRALIA
>>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>>
>>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150203/ec5d0dd7/attachment-0003.html>