[maker-devel] genome duplication?

Fri Jan 30 21:48:33 MST 2015

Hi all,

One of the fungal genomes I'm annotating is relatively shattered (?), with
many contigs/scaffolds and based on CEGMA analysis only may indicate a
potential widespread duplication of the genome

#      Statistics of the completeness of the genome based on 248 CEGs      #
>               #Prots  %Completeness  -  #Total  Average  %Ortho
>
>   Complete      181       72.98      -   365     2.02     67.40
>    Group 1       54       81.82      -   105     1.94     66.67
>    Group 2       39       69.64      -    86     2.21     71.79
>    Group 3       45       73.77      -    86     1.91     57.78
>    Group 4       43       66.15      -    88     2.05     74.42
>    Partial      230       92.74      -   528     2.30     77.83
>    Group 1       61       92.42      -   140     2.30     72.13
>    Group 2       53       94.64      -   127     2.40     84.91
>    Group 3       56       91.80      -   126     2.25     69.64
>    Group 4       60       92.31      -   135     2.25     85.00

The expected genome size is relatively low (~42 Mb by abyss-fac) in
comparison with *Hortaea werneckii* (51.6Mb, 23333 genes), a related fungi
with nearly 90% of its genes present in at least two copies.
Paper:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328

Now to the Maker part... So, as part of the Maker annotation, I trained
SNAP and Augustus, and I generated a specific RepeatModeler library. I
recorded the predicted outputs from each Maker run (AED, number of
predicted proteins and transcripts...). Both Augustus and SNAP used to give
quite high number (~19000 and ~23000 respectively) in comparison with the
xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how
does maker deal with gene duplications? Or is this just a phenomenon given
that there is no support from the protein files provided initially to
Maker? I've used 4 different protein files for the annotation, could it be
that they weren't the best choices? I picked them from the closest
relatives and similar environments

So, in my last run I turn the keep_preds=1 and the proteins in the
xxx.all.maker.proteins.fasta reached to

Last question regarding the protein files. I download the annotated genomes
from the JGI and most of them have two annotation folders
"All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been
using the protein files found in the later as I expected to have real
evidence and a lower chance of being predicting false genes. Am I right?

Thank you in advance,

Xabier

-- 
Xabier Vázquez Campos
PhD Candidate
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150131/ee2638c5/attachment-0002.html>