[maker-devel] Couple quick questions about Maker
Nathaniel Jue
n.jue at uconn.edu
Tue Jul 8 09:56:37 MDT 2014
Carson, one more question: Any suggestions on how to combine the cegma and
maker est2genome/protein2genome results? Can I just concatenate and sort
the gff files or are there specific formating issues I need to consider? No
overlapping regions or something like that?
Thanks,
Nate
*Nathaniel Jue, Ph.D.*
Department of Molecular and Cell Biology
University of Connecticut
Storrs, CT 06269
[image: LinkedIn]
<http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fpub%2Fnathaniel-jue%2F1%2F531%2F176%2F&sn=>
On Mon, Jul 7, 2014 at 12:26 PM, Carson Holt <carsonhh at gmail.com> wrote:
> I don't think RepeatMasker produces GFF3. I believe it is GFF2 with the
> -gff option (which is pretty different). Also If you provide GFF# files for
> repeats, you will still need to turn of repeat masking in the control files
> by blanking out the options. Also MAKER uses a step called RepeatRunner
> against an internal transposable element protein databases which is
> probably still running (and is slow because it's a search in translated
> protein space).
>
> For performance, you may want to give a larger max_dna_len for the MAKER
> run given that you have a large RAM machine. Also set all the depth_blast
> in maker_bopts.ctl to 15 or 20.
>
> CEGMA is convenient for training predictors because it finds genes that
> will always be in every eukaryote (I.e. high confidence). You can combine
> these with est2genome/protein2genome results from MAKER if you want. You
> can then use the resulting HMM for a larger MAKER run with experimental
> evidence, and then train again on those results. But beware than there is
> rarely any benefit from training beyond that second round. More training
> actually tends to makes things worse (the overtraining paradox).
>
> --Carson
>
>
>
> From: Daniel Ence <dence at genetics.utah.edu>
> Date: Monday, July 7, 2014 at 10:00 AM
> To: Nathaniel Jue <n.jue at uconn.edu>
> Cc: "<maker-devel at yandell-lab.org>" <maker-devel at yandell-lab.org>
> Subject: Re: [maker-devel] Couple quick questions about Maker
>
> Hi Nathaniel,
>
> 1) We'll need to see the error messages that MAKER was giving to
> understand what might have gone wrong with the Repeat Masker gff3 file. If
> you could run maker on one of your scaffolds with your current settings and
> send us the complete output, we can start to figure out what happened.
>
> 2) MAKER interacts with its gene predictors (augustus, snap, and the other
> ones listed in the control files) in a way that improves their performance
> (with the hints and such). When you supply predictions through the pred_gff
> parameter, MAKER can't give that performance improvement, so there's
> something of a tradeoff there. I think the performance improvement is a key
> part of MAKER's success, so I would definitely recommend running the
> ab-initio tools internally.
>
> MAKER tries to save you time by saving results from run to run and only
> rerunning tools (usually blast tools) that had their parameters changed in
> the control files. Taking advantage of that will probably be the biggest
> time saver for you. Something else that could save you almost as much time
> would be to set a reasonable lower-bound on the size of contigs that maker
> will try to annotate (usually <5kbp or <10kbp depending on your genome).
> This parameter is set with the min_contig parameter.
>
> I'll have to check with my lab mates about the Repeat ORF searching and
> how they use CEGMA results. I think you can probably just put them all in
> there at once though.
>
> ~Daniel
>
>
>
>
> Daniel Ence
> Graduate Student
> dence at genetics.utah.edu
> Eccles Institute of Human Genetics
> University of Utah
> 15 North 2030 East, Room 2100
> Salt Lake City, UT 84112-5330
>
> On Jul 7, 2014, at 9:26 AM, Nathaniel Jue <n.jue at uconn.edu>
> wrote:
>
> Hi,
>
> I'm trying to run maker on a couple genomes right now and was wondering if
> folks had any thoughts on way to speed it up a bit. I'm running it on a
> 48-processor supercomputer (lots of RAM, usually use it for genome
> assembly). Both these genomes are a little fragmented, so there are lots of
> contigs, which slows down the whole process. I am looking for ways to speed
> things up and was wondering about a couple things:
>
> 1) I'm currently just at the first round of maker predictions using EST
> and protein evidence to build models. Had already done RepeatMasking so
> thought I'd just input subsequent GFF to speed it up. Didn't seem to like
> the GFF, so two questions: i) any thoughts on why that GFF wasn't
> acceptable? It's the one that repeatmasker outputs if you ask it to; and
> ii) Providing this GFF, should generally allow the program to bypass the
> RepeatMasking step, correct? Does it also make it bypass the Repeat ORF
> searching step?
>
> 2) I plan to run both SNAP and Augustus on these genomes as well. The
> two-step SNAP training from the tutorials seems straightforward, but I was
> wondering about the Augustus step. From what I can tell, simply providing
> an Augustus "trained" species name should turn on Augustus and
> blast/blat-like hints generated within Maker are then used in gene
> prediction. Any thoughts on if it's either more accurate or faster to do
> the Augustus predictions outside of the Maker pipeline and then import them
> using the pred_gff parameter in the maker_opts file?
>
> 3) Finally, I noticed that you had a script for converting cegma gff files
> to zff file for snap training? Currently, I am using predicted transcript
> for this species and protein sequences from related species to training.
> Does anyone have any insight into using CEGMA results as well? Do you work
> iteratively with them? For instance, start with the using hints from more
> distant taxa (i.e. CEGMA) and then work your way closer? Just throw
> everything in at once and retrain after that?
>
> Thanks in advance for any advice and insight.
>
> Cheers,
> Nate
>
>
> *Nathaniel Jue, Ph.D.*
> Department of Molecular and Cell Biology
> University of Connecticut
> Storrs, CT 06269
>
> [image: LinkedIn]
> <http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fpub%2Fnathaniel-jue%2F1%2F531%2F176%2F&sn=>
> _______________________________________________
> maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
>
> _______________________________________________ maker-devel mailing list
> maker-devel at box290.bluehost.com
> http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://yandell-lab.org/pipermail/maker-devel_yandell-lab.org/attachments/20140708/3815e6fe/attachment-0003.html>
More information about the maker-devel
mailing list