[maker-devel] AUGUSTUS Training and "Off the Shelf" HMMs

Mon Nov 2 12:05:22 MST 2015

Hi Everyone,

I’ve been experimenting with optimizing Amazon to perform the HMM training
of augustus more speedily, based on a procedure that Kevin Childs has
written for “speedy” Augustus training.  The procedure essentially comes
from taking a subset of the genes predicted by SNAP, rather than the whole
genome and constructing the training set— a good idea that undoubtedly
saves a lot of time.  I’ve written some modifications to the Augustus
scripts and dependencies to try to speed this process up on Amazon, and I’d
be happy to share my notes with anyone that is interested.  I’ve gotten it
to the point where the whole AutoAug procedure can be accomplished in a day
on a small cluster.

I think that working with the Augustus authors, more improvements could be
made, but the whole experience with Augustus has  lead me to some questions
more generally...

1) One of the things noted in monkeying around with this reduced gene set
procedure is that you are unable to do UTR training with Augustus— the
AutoAug script complains that there aren’t enough genes left to make an
adequate training set.  Has anyone noted this, because I haven’t seen much
discussion of how important that the Augustus HMM is trained for UTRs when
used in the Maker2 pipeline.

2) I’ve been trying to evaluate how good my AUGUSTUS HMM is based on the
training set.  Running the newly trained species file, I see that the
performance on the “exon level” is low (around 5-6%) but sensitivity on the
nucleotide level is in the 89-95%, where the specificity is in the 50-60%
range, which seems consistent with other users on this and the Augustus
list serve. This is assessed based on a training set of approximately 200
genes selected from the output generated by multiple iterative runs using
the SNAP program, documented in the MAKER tutorial.  This is all based on
data & genes selected from a  “to be published” genome of an electric fish
I’m working on.

3) Just for laughs, I tried the HMM trained for zebrafish on the same
training set and found that the performance was slightly better than my
species-specific one that I’ve been working so hard on (a few percentage
points on both nucleotide level sensitivity and specificity).

I’ve reasoned that it might be best in terms of reproducibility to run
Maker one last time with my multiple rounds of SNAP hmm together with the
augustus zebrafish species file, rather than using my own custom species
training.  Can anyone think of a good reason why not to do this?  Are there
qualities/benefits not expressed by these sensitivity/specificity measures
not captured that I would benefit using my own custom species trained file
for?

What are folks’ experiences with AUGUSTUS in this regard?  Many thanks for
any advise in advance!

Jason Gallant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20151102/bbdbb623/attachment-0003.html>