[maker-devel] repeats masking

Carson Holt carsonhh at gmail.com
Fri Aug 18 09:35:48 MDT 2017


Hi Quanwei,

> (1) We are doing genome annotation for a new rodent species, we wonder whether we should use repeat library for  "Mammalia" or "rodent"? Which is more proper, if we did not construct a species-specific repeat library for the new genome?

Over masking can occur, but you should really only worry about it if there is a specific gene you are looking for or gene family and you don’t care about false positive gene models. On a genome wide level you will find that undermasking is almost always the greater danger. So I’d recommend using Mammalia. Also you should always build a species specific library when working with repeat rich organisms like mammals.


> (2) With some concerns as discussed above emails, we did not train a  species-specific repeat library. Since we have finished the annotation only using the repeat library from repeatMasker and Maker2, we wonder whether it is worth for us to firstly train a  species-specific repeat library and then do the genome annotation again? Will it (i.e., trainning a  species-specific repeat library) significantly affect the gene annotation and downstream analysis (e.g.,  gene family expansion analysis, positive selection)? 

It might be ok. Both Mammalia and rodent are already rich in related species repeats in RepBase. But you still may have a lot of false positives because of missed repeats. Repeats and transposable elements tend to create false regions of high evidence homology (make it look like you are getting evidence for a gene in the region, but when you look at the underlying sequence you realize it is a spurious alignment).


> (3) We identified some gene families under contraction, but we want to confirm those gene families really lost copies in our new genome. Do you think it is worth to do the genome annotation without repeat masking, so there will not be genes missing from annotation due to repeat mask?

Without repeat masking you will get a lot of false alignments. If you find anything without repeat masking you will need to do heavy manual review of the alignment and perhaps even domain identification to further weed out the many false positives you are sure to get.

—Carson



More information about the maker-devel mailing list