[maker-devel] maker on a nematode: few novel proteins

Thu Aug 27 10:36:41 MDT 2015

Hi Carson, all,

I am a new user of Maker and have been overall quite pleased with it. Let
me say also that the information I've gleaned on this forum has been
extremely helpful and important in learning how to successfully use the
software. I'm at a point where I'm planning next steps and hoping to
solicit suggestions and ideas.

Briefly, I'm annotating a novel nematode genome that appears to have a
reduced genome (~60 Mb, but about 24% repetitive, with a custom
RepeatModeler library that we generated). For technical reasons I do not
have access to RNA data so I'm running ab-initio predictors SNAP and
Augustus supplemented with lots of protein hints: swiss-prot and a combined
library of 28 nematode proteomes.

Maker Round 0: My student and I initially ran Maker without any active
predictors, just to align the swiss-prot database to the genome (no
nematode proteins used in this round), which output a .gff file with
swiss-prot alignments, but no predicted proteins as SNAP and Augustus were
turned off. We consider this 'round 0' of Maker.

Maker Round 1: We then trained SNAP with CEGMA output (which showed the
assembly is 98% complete and identified 243 eukaryotic orthologs in our
genome assembly). We proceeded with Maker round 1, run with CEGMA-trained
SNAP, supplemented with a combined library of 28 nematode proteomes
(protein predictions obtained from wormbase) and inputting the gff file
containing the swiss-prot alignments from 'round 0' of Maker (protein_pass
set to 1). Round 1 produced about 10,000 proteins.

Maker Round 2: The output of Round 1 was then filtered for protein hints
giving a training set of about 3,000 proteins that we used in re-training
SNAP, and in training Augustus. We then re-ran Maker without protein hints
but with SNAP and Augustus re-trained; we did however feed in the .gff file
produced in Round 1, containing all protein alignments (protein_pass set to
1). This gff file has all the swiss-prot alignments as well as the 28
nematode proteome alignments. Round 2 of Maker produced 9,374 proteins. At
first I thought this was only half of what should be produced but now I'm
re-evaluating that notion. I've been comparing known domains and protein
families and the numbers are quite comparable across nematodes. There is a
clear 'core' of conserved nematode proteins (about 5,000) and with C.
elegans we identify 5,684 ortholog clusters by OrthoVenn (and OrthoDB). My
student and I have manually inspected a number of predictions and they
appear quite reasonable, with good numbers of introns etc (some introns are
on the small side but that's also consistent with a reduced genome size).

My uncertainty arises from this: I took the 9,374 proteins and performed
blastp against the 28 nematode proteome database from our training, and
found that only 72 are 'novel', lacking a blast-match (e-value 1e-5). It is
widely reported in the literature that new nematode projects often identify
about 30% novel proteins with no blast-match to any other organism. I am
concerned that we are missing these novel proteins, perhaps due to our
heavy reliance on 'known' proteins as hints in Maker training. On the other
hand, novel proteins could have been predicted as easily as other proteins
by Maker, as to my knowledge there is no requirement that protein hints
support a given prediction. (About 18% of our proteins have no Pfam domain
predictions, so in some sense they are novel, but they're matching other
'unknown' proteins in other nematodes by blast).

In sum, I'm wondering if we should re-evaluate some step of our Maker
pipeline or whether we are likely to be on safe ground concluding that the
9,374 is relatively representative of the full proteome of this organism.
Interestingly, given the small genome size, there simply isn't much room
for more proteins to be predicted--once you account for the high repetitive
nature of this genome the gene density is slightly higher than that of C.
elegans, C. briggsae, and P. pacificus. Looking at the scaffolds, they are
quite densely populated with predictions and there are not obvious 'gaps'
where new predictions might be derived. From these observations I am
tending toward the idea that the 9,374 proteins is relatively complete and
that a lack of novel proteins is actually a scientific finding in this
organism (it lives in an unusual environment where this might make sense)
but I'd like to be sure we are seeing a real phenomenon, and not some weird
artifact of the way we predicted the proteins.

For the record, the N50 of our assembly is about 50kb, so I don't think
we're missing a lot of genes due to fragmentation, and at any rate that
shouldn't preferentially affect 'novel' genes more than 'known' genes.

So: where are the novel proteins? Should we amend our Maker pipeline?
Thanks for any an all ideas, questions, or comments! I'm attaching our
maker_opts.ctl file from the last run (Round 2) in case it is helpful.

Thanks,
John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150827/bbdb2d67/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: maker_opts.ctl
Type: application/octet-stream
Size: 4729 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150827/bbdb2d67/attachment-0001.obj>