<div dir="ltr"><div><div><div><div>Dear Carson:<br><br></div>I am trying to build a species specific repeat library for our new rodent species, following "<a href="http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic">http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic</a>". But there are somethings not clear to us, would you please explain? Thanks<br><br></div>(1) For the predicted unknown (unclassified) repeat sequences (those in Modelerunknown.lib), it mentioned "Sequences in Modelerunknown.lib were searched against a transposase database (derived from <a rel="nofollow" class="external gmail-text" href="http://www.repeatmasker.org/">RepeatMaske</a>r) and sequences matching transposase were considered as transposons belonging to the relevant superfamily".  <br></div><div>I wonder how to do this search. Annotate the "unknown" repeat sequences using the Repeatmaker? Then what to do, if for an "unknown" repeat sequence, only part of the sequence match the known repeat elements.<br><br></div>(2) To exclude gene fragments, I need map the predicted repeat sequences against a protein database, and then run the package "ProExcluder"<b>. </b>Right?<b>  </b>I wonder how to get such protein database.<b> </b>Since I am working on a new rodent species, can I use all the rodent proteins from Uniprot (both Swiss-Prot and TrEMBL)?<br><br></div>(3) After I generate the species specific repeat library, do I still need to select a model organism for RepBase masking (as shown below). <br><div><br></div><div>In the file "maker_opts.ctl"<br>#-----Repeat Masking (leave values blank to skip repeat masking)<br>model_org=Mammalia #select a model organism for RepBase masking in RepeatMasker<br>rmlib=myRepeat.fa #provide an organism specific repeat library in fasta format for RepeatMasker<br><br>


 <div>Many thanks</div><div><br></div><div>Best</div><div>Quanwei<br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">2017-08-18 11:35 GMT-04:00 Carson Holt <span dir="ltr"><<a href="mailto:carsonhh@gmail.com" target="_blank">carsonhh@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Quanwei,<br>

<span class=""><br>

> (1) We are doing genome annotation for a new rodent species, we wonder whether we should use repeat library for  "Mammalia" or "rodent"? Which is more proper, if we did not construct a species-specific repeat library for the new genome?<br>

<br>

</span>Over masking can occur, but you should really only worry about it if there is a specific gene you are looking for or gene family and you don’t care about false positive gene models. On a genome wide level you will find that undermasking is almost always the greater danger. So I’d recommend using Mammalia. Also you should always build a species specific library when working with repeat rich organisms like mammals.<br>

<span class=""><br>

<br>

> (2) With some concerns as discussed above emails, we did not train a  species-specific repeat library. Since we have finished the annotation only using the repeat library from repeatMasker and Maker2, we wonder whether it is worth for us to firstly train a  species-specific repeat library and then do the genome annotation again? Will it (i.e., trainning a  species-specific repeat library) significantly affect the gene annotation and downstream analysis (e.g.,  gene family expansion analysis, positive selection)?<br>

<br>

</span>It might be ok. Both Mammalia and rodent are already rich in related species repeats in RepBase. But you still may have a lot of false positives because of missed repeats. Repeats and transposable elements tend to create false regions of high evidence homology (make it look like you are getting evidence for a gene in the region, but when you look at the underlying sequence you realize it is a spurious alignment).<br>

<span class=""><br>

<br>

> (3) We identified some gene families under contraction, but we want to confirm those gene families really lost copies in our new genome. Do you think it is worth to do the genome annotation without repeat masking, so there will not be genes missing from annotation due to repeat mask?<br>

<br>

</span>Without repeat masking you will get a lot of false alignments. If you find anything without repeat masking you will need to do heavy manual review of the alignment and perhaps even domain identification to further weed out the many false positives you are sure to get.<br>

<span class="HOEnZb"><font color="#888888"><br>

—Carson</font></span></blockquote></div><br></div>