[maker-devel] MAKER training

Wed Sep 12 11:53:37 MDT 2012

I have a few questions based on your comment about augustus/MAKER naming convention.

I have been sorting the data using the second column of the GFF file, I wanted to be sure I have it right

- Doesn't 'maker' in the second column signify MAKER's final annotations based on all evidence (EST, protein and abinitio prediction) ?
   I noticed two types of gene IDs, example
   1. augustus_masked-scaffold00030-abinit-gene-3.2
   2. maker-scaffold00030-augustus-gene-3.7

Is the first one, a direct augustus prediction without a hints file and the second based on a hints file (made from the EST and protein evidence)? If this is the case, could 2 be a better annotation than 1?

- In case of augustus_masked in the 2nd column, I believe all are predictions are without a hints file.

Thanks,
Ranjani

________________________________
From: Carson Holt [carsonhh at gmail.com]
Sent: Tuesday, September 11, 2012 12:04 PM
To: Sivaranjani Namasivayam; maker-devel at yandell-lab.org
Subject: Re: [maker-devel] MAKER training

- I have transcriptome data from 454 and Illumina platforms. Illumina is from a single time point and 454 from multiple time point. 454 was assembled using Newbler(dataset 1) and Illumina using  Tophat-Cufflinks (dataset 2) and the denovo Trinity pipeline (dataset 3). I now have3  assemblies - 454 and Illumina will have some redunant transcripts (because of one overlapping time point); TopHat-Cufflinks and Trinity will have highly redundant transcripts (because they use same raw reads). Is it OK to provide all 3 datasets as EST evidence, how does it affect the quality of annotation. (For now I have used dataset 1 and dataset 2 as EST evidence)

This is fine.  You can give them as a comma separated list est=file1,file2,file3

- I used the above model to retrain, I passed through everything except the abinitio gene predictions. I also provided a set a manually annotated genes , many of which have EST evidence. Is this OK to do? [ For proteins evidence, I gave a set from related organisms, same as above]

- In my third retraining, I used the above retrained model, but this time I only provided the genome_gff but did not pass through any other data. However I did provide the manually annotated genes as EST evidence and related proteins as protein_evidence.

Can you please give me some advice on which of these could give me the best prediction, or if I can alter something to get a better prediction.

Everything you've done sounds reasonable.  Better training comes from having the most correct models to train with, so providing the manual annotations as training works, or you can also select MAKER models with the lowest AED score (i.e. models that most closely match evidence).  The goal is to try and make the process as unbias as possible, so a consistent usually automated selection method is often the easiest to justify justifiable.

- A quick question about Augustus - I used a Augustus model (trained for a closely related organism) for ab-initio prediction. Does MAKER adjust this model based on the evidence provided, or use the model as such for a prediction.

MAKER will provide hints to Augustus during the run to make it perform better.  MAKER will report the raw unaided augustus results in the GFF3 file as a reference, but will use evidence to improve performance where it can.  The gene name will let you know if it is a hint based or ab initio model prediction.  When 'maker', is part of the gene name it is hint based.

Thanks,
Carson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20120912/9b5d12e7/attachment-0003.html>