[maker-devel] MAKER and RepeatModeler
Shaun Jackman
sjackman at gmail.com
Thu Jul 30 12:11:48 MDT 2015
Hi, Chris.
Yes, I did get a response from the RepeatModeler author, Robert Hubley (cc’ed). There’s no public mailing list, as far as I know, so it’s all in private communication.
Yes, RepeatModeler is non-deterministic. I suggested that the random seed be added as a parameter to RepeatModeler, and Robert agreed.
I’m still not sure why the results were so variable (between 5 kbp and 30 kbp annotated as repeats, see table far below). Perhaps it’s because my genome is much smaller (6 Mbp) than the size of the random sample (40 Mbp) that RepeatModeler uses. See immediately below. Robert?
Cheers,
Shaun
RepeatModeler Round # 1
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 40000000 bp
- Final Sample Size = 6001210 bp ( 5937815 non ambiguous )
- Num Contigs Represented = 38
--
http://sjackman.ca/
On 2015-July-30 at 9:28:27 , Fields, Christopher J (cjfields at illinois.edu) wrote:
Hi Shaun
Ever get an answer on this one from the RepeatMasker folks? I’ve seen (and expect) non-deterministic results from a few tools but the results shouldn’t change *that* dramatically.
chris
On Jul 17, 2015, at 3:29 PM, Shaun Jackman <sjackman at gmail.com> wrote:
Hi, Carson.
It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What’s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don’t see how I can implement a reproducible pipeline with the situation as it is.
This has become a RepeatModeler question more than a MAKER question, but I thought I’d continue this thread that I’d started here.
n n:1 L50 min N80 N50 N20 E-size max sum name
6 6 1 289 7667 12403 12403 9102 12403 24293 RepeatModeler1.fa
6 6 1 332 4023 14769 14769 10920 14769 21738 RepeatModeler2.fa
6 6 1 244 370 2731 2731 1765 2731 4688 RepeatModeler3.fa
10 10 1 354 2114 17134 17134 11354 17134 30782 RepeatModeler4.fa
8 8 3 538 1093 1750 2526 1706 2526 10713 RepeatModeler5.fa
My command line is
BuildDatabase -name x -engine ncbi x.fa
RepeatModeler -database x
cp -a RM_*/consensi.fa.classified RepeatModeler.fa
I installed the following software using Homebrew on a Mac.
repeatmodeler 1.0.8
recon 1.07
repeatmasker 4.0.5
repeatscout 1.0.5
rmblast 2.2.28
trf 4.07b
Cheers,
Shaun
--
http://sjackman.ca/
On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote:
The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn’t match the internal taxonomy, it throughs an error.
—Carson
On Jul 17, 2015, at 11:24 AM, Carson Holt <carsonhh at gmail.com> wrote:
Yes. It takes a the subset of RepBase. If runtime isn’t an issue and you really want to mask as much as possible, you can also set model_org=all. Most of whatever else is in RepBase probably won’t align anywhere, but it may give you marginally better sensitivity.
—Carson
On Jul 17, 2015, at 11:20 AM, Shaun Jackman <sjackman at gmail.com> wrote:
Hi, Carson.
I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea?
Cheers,
Shaun
--
http://sjackman.ca/
On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote:
That is weird.
One thought though. When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice). That way some of the edge cases might better be identified.
—Carson
On Jul 16, 2015, at 5:25 PM, Shaun Jackman <sjackman at gmail.com> wrote:
Hi, Carson.
I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it’s because I’m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat.
Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats?
Cheers,
Shaun
--
http://sjackman.ca/
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150730/10a39709/attachment-0003.html>
More information about the maker-devel
mailing list