[maker-devel] SNAP bootstrap training

Carson Holt carsonhh at gmail.com
Tue Apr 10 09:23:02 MDT 2018


If there is something in the assembly (broken ORF, altered splice site, or small string of N’s - very common in new assemblies) the gene predictor will alter splicing and intron/exon patterns to get around it. The issue is almost always in the assembly. Also if you are not masking repeats (i.e. did not build a species specific library), it will introduce ORFs from transposons that will confuse gene predictors.

Finally some predictors don’t work well on some organisms. SNAP has trouble with many vertebrate species for example.

A higher quality dataset of ~300 is good enough for training. If you have more (500-1000), most protocols have you split the dataset into a training set and a test set to evaluate sensitivity/specificity using tools like Eval from WashU (i.e. you train on half then predict on the other half to see if the predictions match the models).

—Carson

> On Apr 9, 2018, at 5:55 AM, Timo Metz <timo.metz at googlemail.com> wrote:
> 
> Hey Carson,
> 
> thanks for your advice. Would you then rather go for a little set of genes with high quality or rather more genes to feed into MAKER for the training? 
> 
> And I have another question, which is rather not directly related to this topic but I hope that you might still answer: It seems sometimes as if the hint-based prediction does not work sufficient. I can clearly find examples where maker infers gene models directly from a prediction even though the evidence does totally indicate something different and the gene model is probably wrong then (as I even find that cases when looking at highly conserved regions where I actually now the structure the gene should have).
> 
> best
> Timo
> 
> 
> 2018-04-06 17:40 GMT+02:00 Carson Holt <carsonhh at gmail.com <mailto:carsonhh at gmail.com>>:
> More than 2 total trading rounds can generate what is known as the overtraining trap. So I rarely do more than one round of bootstrapping with SNAP. To evaluate the models, look at them in a browser. If the raw models are similar to the final hint based models, then SNAP is well trained. If not then SNAP is poorly trained. Don’t use final models directly to evaluate training. Rather look at the raw models. They are what are made direct from the HMM.  A well trained predictor will perform similarly even outside if MAKER. If it’s over predicting on its own, you may need to filter or even manually curate a subset of models from the initial training round to get better bootstrap training. Also if you did not build a species specific repeat library, you may be under masking and essentially training SNAP to find transposons with the bootstrapping.
> 
> —Carson
> 
> Sent from my iPhone
> 
> > On Apr 6, 2018, at 7:23 AM, Timo Metz <timo.metz at googlemail.com <mailto:timo.metz at googlemail.com>> wrote:
> >
> > Hello,
> >
> > I am using MAKER for a non-model organism, and I am currently trying to do the bootstrap training for SNAP as outlined in the tutorial and the paper for MAKER.
> >
> > For the training I am using a set of ~300 sequences which are conserved (no golden standard genes available) and have very high quality and stop training after third round of bootstrap training.
> >
> > However, it seems as training does not work properly, because when checking the AEDs for each round of bootstrap training, they actually get worse each round. Also the performance of snap after training is practically similar as before training and significantly worse than using a training file for a model organism.
> >
> > Are there any suggestions what could be wrong? Is there anything special to check or look at what is not mentioned in the tutorial?
> >
> > thanks in advance
> >
> > kind regards
> > Timo Metz
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20180410/7980fe46/attachment-0003.html>


More information about the maker-devel mailing list