[maker-devel] MAKER and RepeatModeler

Shaun Jackman sjackman at gmail.com
Thu Jul 30 12:11:48 MDT 2015


Hi, Chris.

Yes, I did get a response from the RepeatModeler author, Robert Hubley (cc’ed). There’s no public mailing list, as far as I know, so it’s all in private communication.

Yes, RepeatModeler is non-deterministic. I suggested that the random seed be added as a parameter to RepeatModeler, and Robert agreed.

I’m still not sure why the results were so variable (between 5 kbp and 30 kbp annotated as repeats, see table far below). Perhaps it’s because my genome is much smaller (6 Mbp) than the size of the random sample (40 Mbp) that RepeatModeler uses. See immediately below. Robert?

Cheers,
Shaun

RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 6001210 bp ( 5937815 non ambiguous )
   - Num Contigs Represented = 38




-- 
http://sjackman.ca/

On 2015-July-30 at 9:28:27 , Fields, Christopher J (cjfields at illinois.edu) wrote:

Hi Shaun

Ever get an answer on this one from the RepeatMasker folks?  I’ve seen (and expect) non-deterministic results from a few tools but the results shouldn’t change *that* dramatically.

chris

On Jul 17, 2015, at 3:29 PM, Shaun Jackman <sjackman at gmail.com> wrote:

Hi, Carson.

It seems that RepeatModeler is not deterministic. I run it fives times on the same sequence and get very different outputs. Two of these five runs mark atp8 as a repeat, which is why I have genes blinking in and out of existence. How do folk deal with this situation? It seems absurd. What’s the cause of the non-determinism? Random number generator? Threading? Can I get deterministic behaviour if I set the seed of the random number generator and use it single-threaded? I don’t see how I can implement a reproducible pipeline with the situation as it is.

This has become a RepeatModeler question more than a MAKER question, but I thought I’d continue this thread that I’d started here.

n	n:1	L50	min	N80	N50	N20	E-size	max	sum	name
6	6	1	289	7667	12403	12403	9102	12403	24293	RepeatModeler1.fa
6	6	1	332	4023	14769	14769	10920	14769	21738	RepeatModeler2.fa
6	6	1	244	370	2731	2731	1765	2731	4688	RepeatModeler3.fa
10	10	1	354	2114	17134	17134	11354	17134	30782	RepeatModeler4.fa
8	8	3	538	1093	1750	2526	1706	2526	10713	RepeatModeler5.fa
My command line is

    BuildDatabase -name x -engine ncbi x.fa
    RepeatModeler -database x
    cp -a RM_*/consensi.fa.classified RepeatModeler.fa

I installed the following software using Homebrew on a Mac.

repeatmodeler 1.0.8
recon 1.07
repeatmasker 4.0.5
repeatscout 1.0.5
rmblast 2.2.28
trf 4.07b

Cheers,
Shaun




-- 
http://sjackman.ca/

On 2015-July-17 at 10:36:50 , Carson Holt (carsonhh at gmail.com) wrote:

The subset is actually built of a built of a taxonomy. So you can extract all repeats for a species or genus for example. If a term doesn’t match the internal taxonomy, it throughs an error.

—Carson

On Jul 17, 2015, at 11:24 AM, Carson Holt <carsonhh at gmail.com> wrote:

Yes. It takes a the subset of RepBase. If runtime isn’t an issue and you really want to mask as much as possible, you can also set model_org=all.  Most of whatever else is in RepBase probably won’t align anywhere, but it may give you marginally better sensitivity.

—Carson



On Jul 17, 2015, at 11:20 AM, Shaun Jackman <sjackman at gmail.com> wrote:

Hi, Carson.

I set model_org=picea. I see that it created a new data base in the RepeatModeler folder Libraries/20140131/picea/specieslib. What is the effect of the model_org option? Does it extract sequences from RepBase that match the string picea?

Cheers,
Shaun




-- 
http://sjackman.ca/

On 2015-July-17 at 9:40:58 , Carson Holt (carsonhh at gmail.com) wrote:

That is weird.

One thought though.  When you run MAKER do you supply both rmlib and model_org or just rmlib? If you are only supplying rmlib, you could try supplying both together (RepeatMasker will then run twice).  That way some of the edge cases might better be identified.

—Carson



On Jul 16, 2015, at 5:25 PM, Shaun Jackman <sjackman at gmail.com> wrote:

Hi, Carson.

I removed two small contaminant contigs (~7 kbp) from the assembly (~6 Mbp), and MAKER found four fewer genes, four copies of the same atp8 gene, but these genes were not in the contaminant contigs.I figured out that it’s because I’m running RepeatModeler to create the rmlib for MAKER. When I remove the contaminant contigs, RepeatModeler now identifies this gene atp8 as being a LTR/Gypsy repeat.
Any thoughts on why removing two contigs would cause RepeatModeler to identify new repeats?

Cheers,
Shaun




-- 
http://sjackman.ca/

_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org





_______________________________________________
maker-devel mailing list
maker-devel at box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20150730/10a39709/attachment-0003.html>


More information about the maker-devel mailing list