[maker-devel] getting protein sequences from genomes

Barry Moore barry.moore at genetics.utah.edu
Fri May 17 13:02:31 MDT 2013


On May 17, 2013, at 3:45 AM, Luciano Abriata wrote:

> Hello, I am trying to use Maker to annotate genomes from different individuals of a population (D. melanogaster flies).
> 
> My ultimate goal is to get, for each gene, the amino acid sequences of the coded proteins as they are expressed from each genome. My questions are:
> 
> 1) How can I match proteins predicted for the same gene in two genomes?

blastp tweaked with parameters to optimize near perfect match

> 
> 2) What is the meaning of all the data in a line such as the following one (taken from the protein.fasta output)
> 
> maker-2L-augustus-gene-0.19-mRNA-1 protein AED:0.0322873164323667 eAED:0.0322873164323667 QI:2|1|0.66|1|1|1|3|208|541
> 

AED = Annotation edit distance describes how closely the prediction matches the evidence.  This is a distance measure and thus 0 is a perfect match and 1 is no overlap.

eAED = Exon adjusted annotation edit distance: This metric is the same as AED with a couple of exceptions.  For a protein coding exon to be counted as overlapping protein evidence the reading frame must be the same in the coding exon and the protein evidence.  Second, when mRNA Seq data is used as evidence and both ends of an exon are supported with splice site spanning reads, the middle of that exon is counted as supported as well even if coverage drops off in the interior of the exon..  For the most part AED and eAED will always be the same, but eAED tends to work better on many fringe cases.

QI values are as follows:

5' UTR Length
Fraction of splice sites confirmed by EST alignment.
Fraction of exons that overlap and EST alignment.
Fraction of exons that overlap EST or protein alignment.
Fraction of splice sites confirmed by an ab initio prediction.
Fraction of exons that overlap an ab intitio prediction.
Number of exons in the transcript.
3' UTR length.
Length of encoded protein.


> 3) If I include snap and augustus to improve protein predictions, I get several protein.fasta files: augustus_masked.proteins.fasta , snap_masked.proteins.fasta , non_overlapping_ab_initio.proteins.fasta , and proteins.fasta
> 
> Which of these files contains the definite set of predicted protein sequences?

The proteins.fasta file is the final set of proteins for all genes that MAKER created annotations for.

> 
> 
> 
> Thanks in advance!
> 
> Luciano
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------
(801) 585-3543




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20130517/3160cfe9/attachment-0003.html>


More information about the maker-devel mailing list