[maker-devel] Couple quick questions about Maker

Tue Jul 8 10:31:40 MDT 2014

Convert them both to ZFF, then concatenate the ZFF and sequence files.

--Carson

From:  Nathaniel Jue <n.jue at uconn.edu>
Date:  Tuesday, July 8, 2014 at 9:56 AM
To:  Carson Holt <carsonhh at gmail.com>
Cc:  Daniel Ence <dence at genetics.utah.edu>, "<maker-devel at yandell-lab.org>"
<maker-devel at yandell-lab.org>
Subject:  Re: [maker-devel] Couple quick questions about Maker

Carson, one more question: Any suggestions on how to combine the cegma and
maker est2genome/protein2genome results? Can I just concatenate and sort the
gff files or are there specific formating issues I need to consider? No
overlapping regions or something like that?

Thanks,
Nate

Nathaniel Jue, Ph.D.

Department of Molecular and Cell Biology

University of Connecticut

Storrs, CT 06269

 <http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fpub%2Fnat
haniel-jue%2F1%2F531%2F176%2F&sn=>

On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt <carsonhh at gmail.com> wrote:
> I don't think RepeatMasker produces GFF3.  I believe it is GFF2 with the -gff
> option (which is pretty different). Also If you provide GFF# files for
> repeats, you will still need to turn of repeat masking in the control files by
> blanking out the options.  Also MAKER uses a step called RepeatRunner against
> an internal transposable element protein databases which is probably still
> running (and is slow because it's a search in translated protein space).
> 
> For performance, you may want to give a larger max_dna_len for the MAKER run
> given that you have a large RAM machine. Also set all the depth_blast in
> maker_bopts.ctl to 15 or 20.
> 
> CEGMA is convenient for training predictors because it finds genes that will
> always be in every eukaryote (I.e. high confidence).  You can combine these
> with est2genome/protein2genome results from MAKER if you want.  You can then
> use the resulting HMM for a larger MAKER run with experimental evidence, and
> then train again on those results.  But beware than there is rarely any
> benefit from training beyond that second round.  More training actually tends
> to makes things worse (the overtraining paradox).
> 
> --Carson
> 
> 
> 
> From:  Daniel Ence <dence at genetics.utah.edu>
> Date:  Monday, July 7, 2014 at 10:00 AM
> To:  Nathaniel Jue <n.jue at uconn.edu>
> Cc:  "<maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org>
> Subject:  Re: [maker-devel] Couple quick questions about Maker
> 
> Hi Nathaniel, 
> 
> 1) We'll need to see the error messages that MAKER was giving to understand
> what might have gone wrong with the Repeat Masker gff3 file. If you could run
> maker on one of your scaffolds with your current settings and send us the
> complete output, we can start to figure out what happened.
> 
> 2) MAKER interacts with its gene predictors (augustus, snap, and the other
> ones listed in the control files) in a way that improves their performance
> (with the hints and such). When you supply predictions through the pred_gff
> parameter, MAKER can't give that performance improvement, so there's something
> of a tradeoff there. I think the performance improvement is a key part of
> MAKER's success, so I would definitely recommend running the ab-initio tools
> internally. 
> 
> MAKER tries to save you time by saving results from run to run and only
> rerunning tools (usually blast tools) that had their parameters changed in the
> control files. Taking advantage of that will probably be the biggest time
> saver for you. Something else that could save you almost as much time would be
> to set a reasonable lower-bound on the size of contigs that maker will try to
> annotate (usually <5kbp or <10kbp depending on your genome). This parameter is
> set with the min_contig parameter.
> 
> I'll have to check with my lab mates about the Repeat ORF searching and how
> they use CEGMA results. I think you can probably just put them all in there at
> once though. 
> 
> ~Daniel
> 
> 
> 
> 
> Daniel Ence
> Graduate Student
> dence at genetics.utah.edu
> Eccles Institute of Human Genetics
> University of Utah
> 15 North 2030 East, Room 2100
> Salt Lake City, UT 84112-5330
> 
> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue <n.jue at uconn.edu>
>  wrote:
> 
>> Hi, 
>> 
>> I'm trying to run maker on a couple genomes right now and was wondering if
>> folks had any thoughts on way to speed it up a bit. I'm running it on a
>> 48-processor supercomputer (lots of RAM, usually use it for genome assembly).
>> Both these genomes are a little fragmented, so there are lots of contigs,
>> which slows down the whole process. I am looking for ways to speed things up
>> and was wondering about a couple things:
>> 
>> 1) I'm currently just at the first round of maker predictions using EST and
>> protein evidence to build models. Had already done RepeatMasking so thought
>> I'd just input subsequent GFF to speed it up. Didn't seem to like the GFF, so
>> two questions: i) any thoughts on why that GFF wasn't acceptable? It's the
>> one that repeatmasker outputs if you ask it to; and ii) Providing this GFF,
>> should generally allow the program to bypass the RepeatMasking step, correct?
>> Does it also make it bypass the Repeat ORF searching step?
>> 
>> 2) I plan to run both SNAP and Augustus on these genomes as well. The
>> two-step SNAP training from the tutorials seems straightforward, but I was
>> wondering about the Augustus step. From what I can tell, simply providing an
>> Augustus "trained" species name should turn on Augustus and blast/blat-like
>> hints generated within Maker are then used in gene prediction. Any thoughts
>> on if it's either more accurate or faster to do the Augustus predictions
>> outside of the Maker pipeline and then import them using the pred_gff
>> parameter in the maker_opts file?
>> 
>> 3) Finally, I noticed that you had a script for converting cegma gff files to
>> zff file for snap training? Currently, I am using predicted transcript for
>> this species and protein sequences from related species to training. Does
>> anyone have any insight into using CEGMA results as well? Do you work
>> iteratively with them? For instance, start with the using hints from more
>> distant taxa (i.e. CEGMA) and then work your way closer? Just throw
>> everything in at once and retrain after that?
>> 
>> Thanks in advance for any advice and insight.
>> 
>> Cheers,
>> Nate
>> 
>> 
>> Nathaniel Jue, Ph.D.
>> Department of Molecular and Cell Biology
>> University of Connecticut
>> Storrs, CT 06269
>> 
>>  
>> <http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fpub%2Fnatha
>> niel-jue%2F1%2F531%2F176%2F&sn=>
>> _______________________________________________
>> maker-devel mailing list
>> maker-devel at box290.bluehost.com
>> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
> 
> _______________________________________________ maker-devel mailing list
> maker-devel at box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/mak
> er-devel_yandell-lab.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140708/ee2dc827/attachment-0003.html>