[maker-devel] genome duplication?
Xabier Vázquez Campos
xvazquezc at gmail.com
Fri Jan 30 21:48:33 MST 2015
Hi all,
One of the fungal genomes I'm annotating is relatively shattered (?), with
many contigs/scaffolds and based on CEGMA analysis only may indicate a
potential widespread duplication of the genome
# Statistics of the completeness of the genome based on 248 CEGs #
> #Prots %Completeness - #Total Average %Ortho
>
> Complete 181 72.98 - 365 2.02 67.40
> Group 1 54 81.82 - 105 1.94 66.67
> Group 2 39 69.64 - 86 2.21 71.79
> Group 3 45 73.77 - 86 1.91 57.78
> Group 4 43 66.15 - 88 2.05 74.42
> Partial 230 92.74 - 528 2.30 77.83
> Group 1 61 92.42 - 140 2.30 72.13
> Group 2 53 94.64 - 127 2.40 84.91
> Group 3 56 91.80 - 126 2.25 69.64
> Group 4 60 92.31 - 135 2.25 85.00
The expected genome size is relatively low (~42 Mb by abyss-fac) in
comparison with *Hortaea werneckii* (51.6Mb, 23333 genes), a related fungi
with nearly 90% of its genes present in at least two copies.
Paper:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071328
Now to the Maker part... So, as part of the Maker annotation, I trained
SNAP and Augustus, and I generated a specific RepeatModeler library. I
recorded the predicted outputs from each Maker run (AED, number of
predicted proteins and transcripts...). Both Augustus and SNAP used to give
quite high number (~19000 and ~23000 respectively) in comparison with the
xxx.all.maker.proteins.fasta (about 13600). So, my first question is, how
does maker deal with gene duplications? Or is this just a phenomenon given
that there is no support from the protein files provided initially to
Maker? I've used 4 different protein files for the annotation, could it be
that they weren't the best choices? I picked them from the closest
relatives and similar environments
So, in my last run I turn the keep_preds=1 and the proteins in the
xxx.all.maker.proteins.fasta reached to
Last question regarding the protein files. I download the annotated genomes
from the JGI and most of them have two annotation folders
"All_models,_Filtered_and_Not" and "Filtered_Models___best__". I've been
using the protein files found in the later as I expected to have real
evidence and a lower chance of being predicting false genes. Am I right?
Thank you in advance,
Xabier
--
Xabier Vázquez Campos
PhD Candidate
Water Research Centre
School of Civil and Environmental Engineering
The University of New South Wales
Sydney NSW 2052 AUSTRALIA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150131/ee2638c5/attachment-0002.html>
More information about the maker-devel
mailing list