[maker-devel] Interpreting Maker Results

Thu Oct 11 16:21:49 MDT 2012

The non_overlapping_ab_initio.proteins.fasta file contain models that were
rejected for lack of homology evidence that do not overlap the maker models.
So if you accept everything it is not 21,000 proteins rather it's 33,000
(12,000 + 21,000).  This is because the contents of that file do not occupy
the same regions as the gene models (I.e. non-overlapping).  Ab initio
predictors have a tendency to overpredict which is why there are so many
models.  If you think you are missing genes you can try and add additional
evidence to the maker run (I.e all proteins from a few related species).
Also try running Cegma (http://korflab.ucdavis.edu/datasets/cegma/) to
estimate the completeness of you assembly.  Sometime a lower number than
expected can be attributed to an incomplete assembly.

Also you can run something like InterProScan to identify models in the
non_overlapping_ab_initio.proteins.fasta file that contain Uniprot protein
domains (likely to be real genes), then add them to your results as a second
step using maker's model_gff option.

I've attached a couple of scripts that can help with that. gff3_preds2models
will turn a set of match/match_part features to gene/mRNA/exon/CDS features.
It doesn't check translation of CDS though, so only use gene predictions
which should be all CDS.  gff3_select is used to select some subset of
features from a GFF3 file.  Useful for slicing sections of data from a GFF3
file.

Thanks,
Carson

From:  Kipp Johnson <kippjohnson at uchicago.edu>
Date:  Thursday, 11 October, 2012 3:43 PM
To:  <maker-devel at yandell-lab.org>, Carson Holt <carsonhh at gmail.com>
Subject:  Interpreting Maker Results

Hi Carson,

      I'm trying to get my genetic annotation out of maker. My maker run on
a non-model eukaryote finished, and I used your gff3_merge script to merge
the resulting files. This file is enormous, because I used snap, augustus,
genemark, repeatmasker, exonerate, and blast, and has a lot of entries from
all of these different programs.

      I want to extract only the genetic regions predicted by maker, so I
used the gff3_merge script with the "-g" option. However, when I do this, I
get a maker file that only has about 12,000 genes, while I was expecting
around 20,000 genes for our genome. However, when I use the fasta merge
tool, however, I get output files (for example,
"merged.fasta.all.maker.non_overlapping_ab_initio.proteins.fasta") with
about 21,000 proteins, which is closer to the gene number that I was
expecting. Does the "-g" option ignore evidence from blast/exonerate or
similar? How should I extract the complete set of genetic regions to blast
against, so that I can go about further working on the annotation?

Also, what is maker using to find these 9,000 extra proteins? Are these
these all alternately sliced or something along those lines? I can't find
any documentation online for how to actually get the final annotations out
of maker correctly.

Thanks for your time!

Best,

     Kipp Johnson
     kippjohnson at uchicago.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121011/0b408549/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gff3_preds2models
Type: application/octet-stream
Size: 4778 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121011/0b408549/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gff3_select
Type: application/octet-stream
Size: 3237 bytes
Desc: not available
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20121011/0b408549/attachment-0007.obj>