[maker-devel] Too many genes?

Annabel Beichman annabel.beichman at gmail.com
Mon Oct 17 13:20:37 MDT 2016


Hi Carson et al., 

Thanks so much for such a great pipeline, tutorials and advice pages.

I have just finished four rounds of annotation in Maker on the sea otter genome which we assembled using Meraculous shotgun assembly + Dovetail Genomics HiRise scaffolding. 

Rounds I & II: In the first two rounds, I trained Augustus and Snap on 400 scaffolds > 500kb using mRNA-seq data assembled in Trinity, and protein data from Ensembl for ferret, dog and cat. 

Round III: Then, using the trained gene predictors (Augustus showed spec/sens > 90%), I annotated all scaffolds >50kb.

Round IV: Based on reading emails in this group, I then decided to make a custom repeat library, and re-run maker one last time using my trained gene predictors, custom repeat library, and 1200 scaffolds >15kb. 

I found my number of genes dropping each round, as you suggest they should (47465 after Round I, 27289 after round II, 25847 after round III, and 25031 after round IV). 

However, this final gene count (25,031) still seems to high too me, and I was wondering if you had some advice for filtering? Using BUSCO, our assembly is 78% complete, and the final annotation is 72% complete. However, I am getting 25,000+ annotated genes; 22,000+ of which are below an AED and eAED cutoff of 0.5. This seems like far too many genes for a mammal genome that is only ~75% complete. I would have expected to get something more like 15-20,000 genes. 

22870 of the Maker-annotated proteins have BLAST hits to SwissProt/UniProt (e value 1e-03), but only 13,000 annotated proteins have orthologs in the ferret, the otter’s closest relative (e value 1e-05 using ProteinOrtho). 900 genes do not have any BLAST hits in SwissProt/UniProt, but have AED/eAED scores of 0.00 – when I visualize them in Jbrowse they have a Trinity read as evidence, but nothing else. Could these be Trinity artefacts? I also notice that my SNAP tracts are very long (some almost as long as the whole scaffold). 

I am designing an exome-capture array based on this annotation, and so am trying to filter the gene models to have a set of genes that we can be fairly confident in, but also trying not to miss real gene models. Could you please advise me on how to filter down the gene models, or what might be happening to cause the excess of genes? The most conservative gene list would be the 13,000 genes that are ferret orthologs. But I would like to salvage more genes if possible, if you can suggest a way to parse out real genes from among the ones that do not have ferret orthologs, but do have Blast hits to SwissProt? Would you recommend any additional filters on gene length, etc.?


Not sure if this is significant, but one thing I’ve noticed is that many of the genes with Blast hits in SwissProt but no ferret orthologs often have several similar genes in a row along the same scaffold:
ScbS9RH_101185 30796 38760 + ELUT_00004195-RA ELUT_00004195 Name=ELUT_00004195-RA 0.08 0.17 Similar to ANO3: Anoctamin-3 (Homo sapiens)
ScbS9RH_101185 42617 51087 + ELUT_00004196-RA ELUT_00004196 Name=ELUT_00004196-RA 0.25 0.26 Similar to ANO3: Anoctamin-3 (Homo sapiens)
ScbS9RH_101185 87006 87827 + ELUT_00004198-RA ELUT_00004198 Name=ELUT_00004198-RA 0.18 0.18 Similar to Ano3: Anoctamin-3 (Mus musculus)
ScbS9RH_101185 110043 122523 + ELUT_00004199-RA ELUT_00004199 Name=ELUT_00004199-RA 0.09 0.09 Similar to ANO3: Anoctamin-3 (Homo sapiens)

Thank you all so much for your help and advice!

[I also want to report an odd behavior, that may be specific to our server – when the number of scaffolds being annotated using maker drops below the number of cores (e.g. usning openmpi with 45 cores available, but there are only 44 scaffolds left), maker crashes. I then have to restart it with fewer cores, and it will crash again once the number of remaining scaffolds drops below the new lower number of cores. This makes finishing a run of Maker a bit like Zeno’s paradox, where it gets very slow for the last two days of the run due to the stopping and restarting.]

Best wishes,
Annabel Beichman
Wayne Lab/Lohmueller Lab
Ecology & Evolutionary Biology
UCLA
Annabelbeichman.com



More information about the maker-devel mailing list