[maker-devel] MAKER training

Thu Sep 13 07:30:22 MDT 2012

Yes, those are the final annotations, and yes one is derived from the ab
initio model and one from hint based models.  The selection between hint
based models and ab initio models is based on evidence overlap, so either
can be better than the other or vice versa.  The bets models will have lower
AED scores.  So if for a given locus I have both hint based and ab initio
based models, I keep the one that best matches the evidence (lowest AED
score).

augustus_masked means the genome was masked for repeats before running
augustus.  Anything with augustus_masked in the second column will be ab
initio models kept for reference purposes.  Every ab initio model produced
by augustus will have an entry there.

Thanks,
Carson

From:  Sivaranjani Namasivayam <ranjani at uga.edu>
Date:  Wednesday, 12 September, 2012 1:53 PM
To:  Carson Holt <carsonhh at gmail.com>, "maker-devel at yandell-lab.org"
<maker-devel at yandell-lab.org>
Subject:  RE: [maker-devel] MAKER training

I have a few questions based on your comment about augustus/MAKER naming
convention.

I have been sorting the data using the second column of the GFF file, I
wanted to be sure I have it right

- Doesn't 'maker' in the second column signify MAKER's final annotations
based on all evidence (EST, protein and abinitio prediction) ?
   I noticed two types of gene IDs, example
   1. augustus_masked-scaffold00030-abinit-gene-3.2
   2. maker-scaffold00030-augustus-gene-3.7

Is the first one, a direct augustus prediction without a hints file and the
second based on a hints file (made from the EST and protein evidence)? If
this is the case, could 2 be a better annotation than 1?

- In case of augustus_masked in the 2nd column, I believe all are
predictions are without a hints file.

Thanks,
Ranjani

From: Carson Holt [carsonhh at gmail.com]
Sent: Tuesday, September 11, 2012 12:04 PM
To: Sivaranjani Namasivayam; maker-devel at yandell-lab.org
Subject: Re: [maker-devel] MAKER training

> - I have transcriptome data from 454 and Illumina platforms. Illumina is from
> a single time point and 454 from multiple time point. 454 was assembled using
> Newbler(dataset 1) and Illumina using  Tophat-Cufflinks (dataset 2) and the
> denovo Trinity pipeline (dataset 3). I now have3  assemblies - 454 and
> Illumina will have some redunant transcripts (because of one overlapping time
> point); TopHat-Cufflinks and Trinity will have highly redundant transcripts
> (because they use same raw reads). Is it OK to provide all 3 datasets as EST
> evidence, how does it affect the quality of annotation. (For now I have used
> dataset 1 and dataset 2 as EST evidence)

This is fine.  You can give them as a comma separated list
est=file1,file2,file3

> - I used the above model to retrain, I passed through everything except the
> abinitio gene predictions. I also provided a set a manually annotated genes ,
> many of which have EST evidence. Is this OK to do? [ For proteins evidence, I
> gave a set from related organisms, same as above]
> 
> - In my third retraining, I used the above retrained model, but this time I
> only provided the genome_gff but did not pass through any other data. However
> I did provide the manually annotated genes as EST evidence and related
> proteins as protein_evidence.
> 
> Can you please give me some advice on which of these could give me the best
> prediction, or if I can alter something to get a better prediction.
> 
Everything you've done sounds reasonable.  Better training comes from having
the most correct models to train with, so providing the manual annotations
as training works, or you can also select MAKER models with the lowest AED
score (i.e. models that most closely match evidence).  The goal is to try
and make the process as unbias as possible, so a consistent usually
automated selection method is often the easiest to justify justifiable.
> 

> - A quick question about Augustus - I used a Augustus model (trained for a
> closely related organism) for ab-initio prediction. Does MAKER adjust this
> model based on the evidence provided, or use the model as such for a
> prediction.

MAKER will provide hints to Augustus during the run to make it perform
better.  MAKER will report the raw unaided augustus results in the GFF3 file
as a reference, but will use evidence to improve performance where it can.
The gene name will let you know if it is a hint based or ab initio model
prediction.  When 'maker', is part of the gene name it is hint based.

Thanks,
Carson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20120913/c5f9d54f/attachment-0003.html>